# Travel Destination Recommendation System Notebook

#### Authors
* 1
* 2 
* 3
* 4
* 5
* 6


## Problem Statement

The goal is to build a machine learning model that can predict hotel ratings based on customer reviews, budget, specific locations, and the type of residence. The dataset is scraped from TripAdvisor and it contains information about various hotels, including their ratings, reviews, amenities, pricing, geographical coordinates, and residence types (e.g., hotel, bed and breakfast, specialty lodging). By analyzing the text reviews along with these additional factors, the objective is to develop a model that can accurately predict the ratings of new, unseen hotels based on customer reviews, budget constraints, location preferences, and residence type.

Approach:

Data Preprocessing: Clean and preprocess the text reviews by removing stopwords, punctuation, and performing tokenization. Convert the text data into a numerical representation suitable for modeling. Handle missing values, if any, in the budget, location, and residence type columns.

Feature Engineering: Extract additional features from the dataset, such as review sentiment scores, review length, and any other relevant information. Engineer new features related to budget, location, and residence type, such as price range categories, geographical distance from landmarks, and one-hot encoding of residence types.

Model Selection: Experiment with different supervised learning models, such as linear regression, decision trees, random forests, or neural networks, to find the best model for predicting hotel ratings considering customer reviews, budget, location, and residence type. Evaluate the models using appropriate evaluation metrics like mean squared error (MSE) or mean absolute error (MAE).

Model Training and Evaluation: Split the dataset into training and testing sets. Train the selected model on the training set and evaluate its performance on the testing set. Fine-tune the model parameters to improve its accuracy. Perform cross-validation to assess the model's generalization capabilities.

## Objectives

## Data Understanding

In [1]:
# Importing necessary libraries
import pandas as pd
import json
import glob

In [2]:
#func to read json files
def read_json_files(json_files):
    dfs = []
    for file in json_files:
        with open(file) as f:
            json_data = json.load(f)
            df = pd.DataFrame(json_data)
            dfs.append(df)

    merged_df = pd.concat(dfs, ignore_index=True)
    return merged_df


In [3]:
json_files = ['Data/egypt.json', 'Data/ethiopia.json', 'Data/kenya.json', 'Data/rwanda.json', 'Data/drc.json', 'Data/nigeria.json']
df = read_json_files(json_files)


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11122 entries, 0 to 11121
Data columns (total 65 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     11122 non-null  object 
 1   type                   11122 non-null  object 
 2   category               11122 non-null  object 
 3   subcategories          10234 non-null  object 
 4   name                   11122 non-null  object 
 5   locationString         10234 non-null  object 
 6   description            6505 non-null   object 
 7   image                  8907 non-null   object 
 8   photoCount             11122 non-null  int64  
 9   awards                 10234 non-null  object 
 10  rankingPosition        8181 non-null   float64
 11  rating                 8188 non-null   float64
 12  rawRanking             8181 non-null   float64
 13  phone                  7330 non-null   object 
 14  address                10234 non-null  object 
 15  ad

## EDA and Data Munging

## Modelling

## Model Evaluation

## Tuning

## Deployment

## Conclusion and Recommendations