# Travel Destination Recommendation System Notebook

#### Authors
* 1
* 2 
* 3
* 4
* 5
* 6


## Problem Statement

The goal is to build a machine learning model that can predict hotel ratings based on customer reviews, budget, specific locations, and the type of residence. The dataset is scraped from TripAdvisor and it contains information about various hotels, including their ratings, reviews, amenities, pricing, geographical coordinates, and residence types (e.g., hotel, bed and breakfast, specialty lodging). By analyzing the text reviews along with these additional factors, the objective is to develop a model that can accurately predict the ratings of new, unseen hotels based on customer reviews, budget constraints, location preferences, and residence type.

Approach:

Data Preprocessing: Clean and preprocess the text reviews by removing stopwords, punctuation, and performing tokenization. Convert the text data into a numerical representation suitable for modeling. Handle missing values, if any, in the budget, location, and residence type columns.

Feature Engineering: Extract additional features from the dataset, such as review sentiment scores, review length, and any other relevant information. Engineer new features related to budget, location, and residence type, such as price range categories, geographical distance from landmarks, and one-hot encoding of residence types.

Model Selection: Experiment with different supervised learning models, such as linear regression, decision trees, random forests, or neural networks, to find the best model for predicting hotel ratings considering customer reviews, budget, location, and residence type. Evaluate the models using appropriate evaluation metrics like mean squared error (MSE) or mean absolute error (MAE).

Model Training and Evaluation: Split the dataset into training and testing sets. Train the selected model on the training set and evaluate its performance on the testing set. Fine-tune the model parameters to improve its accuracy. Perform cross-validation to assess the model's generalization capabilities.

## Objectives

## Data Understanding

In [18]:
# Importing necessary libraries
import pandas as pd
import json
import glob

In [19]:
#func to read json files
def read_json_files(json_files):
    dfs = []
    for file in json_files:
        with open(file) as f:
            json_data = json.load(f)
            df = pd.DataFrame(json_data)
            dfs.append(df)

    merged_df = pd.concat(dfs, ignore_index=True)
    return merged_df


In [25]:
json_files = ['Data\\Egypt.json', 'Data\\Ethiopia.json', 'Data\\Kenya.json', 'Data\\Rwanda.json', 'Data\\DRC.json',
               'Data\\Nigeria.json', 'Data\\Uganda.json', 'Data\\Madagascar.json', 'Data\\Morocco.json',
               'Data\\Tanzania.json', 'Data\\Seychelles.json', 'Data\\namibia.json', 'Data\\southafrica.json', 
               'Data\\malawi.json', 'Data\\capeverde.json', 'Data\\ghana.json', 'Data\\botswana.json', 'Data\\zambia.json' , 
               'Data\\senegal.json']
df = read_json_files(json_files)
df.head()

Unnamed: 0,id,type,category,subcategories,name,locationString,description,image,photoCount,awards,...,hours,menuWebUrl,establishmentTypes,ownersTopReasons,rentalDescriptions,photos,bedroomInfo,bathroomInfo,bathCount,baseDailyRate
0,4022415,ATTRACTION,attraction,[Nightlife],Soho House Sharm El Sheikh,"Sharm El Sheikh, South Sinai, Red Sea and Sinai",Welcome to Soho House Sharm El Sheikh! The bes...,https://media-cdn.tripadvisor.com/media/photo-...,119,[],...,,,,,,,,,,
1,19730066,ATTRACTION,attraction,"[Shopping, Museums]",Nobles Art Gallery,"Luxor, Nile River Valley",Nobles Art Gallery is the best store in Luxor ...,https://media-cdn.tripadvisor.com/media/photo-...,105,[],...,,,,,,,,,,
2,8011182,ATTRACTION,attraction,[Outdoor Activities],YallaHorse Riding,"El Gouna, Hurghada, Red Sea and Sinai",Riding in El Gouna is an unforgettable experie...,https://media-cdn.tripadvisor.com/media/photo-...,362,[],...,,,,,,,,,,
3,7371664,ATTRACTION,attraction,[Spas & Wellness],Mividaspa at Jaz Aquamarine Resort,"Hurghada, Red Sea and Sinai",Mividaspa is fast earning a top reputation due...,https://media-cdn.tripadvisor.com/media/photo-...,67,[],...,,,,,,,,,,
4,17523327,ATTRACTION,attraction,"[Other, Transportation]",Sharm Airport Transfers Karim,"Sharm El Sheikh, South Sinai, Red Sea and Sinai",Airport transfer service safe reliable drivers...,https://media-cdn.tripadvisor.com/media/photo-...,25,[],...,,,,,,,,,,


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35836 entries, 0 to 35835
Data columns (total 65 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     35836 non-null  object 
 1   type                   35836 non-null  object 
 2   category               35836 non-null  object 
 3   subcategories          34497 non-null  object 
 4   name                   35836 non-null  object 
 5   locationString         34497 non-null  object 
 6   description            20129 non-null  object 
 7   image                  28495 non-null  object 
 8   photoCount             35836 non-null  int64  
 9   awards                 34497 non-null  object 
 10  rankingPosition        26570 non-null  float64
 11  rating                 26706 non-null  float64
 12  rawRanking             26570 non-null  float64
 13  phone                  24666 non-null  object 
 14  address                34494 non-null  object 
 15  ad

In [26]:
# converting to csv
df.to_csv('compiled_data.csv', index=False)

In [22]:
df.columns

Index(['id', 'type', 'category', 'subcategories', 'name', 'locationString',
       'description', 'image', 'photoCount', 'awards', 'rankingPosition',
       'rating', 'rawRanking', 'phone', 'address', 'addressObj', 'localName',
       'localAddress', 'localLangCode', 'email', 'latitude', 'longitude',
       'webUrl', 'website', 'rankingString', 'rankingDenominator',
       'neighborhoodLocations', 'nearestMetroStations', 'ancestorLocations',
       'ratingHistogram', 'numberOfReviews', 'reviewTags', 'reviews',
       'booking', 'offerGroup', 'subtype', 'hotelClass',
       'hotelClassAttribution', 'amenities', 'numberOfRooms', 'priceLevel',
       'priceRange', 'roomTips', 'checkInDate', 'checkOutDate', 'offers',
       'guideFeaturedInCopy', 'isClosed', 'isLongClosed', 'openNowText',
       'cuisines', 'mealTypes', 'dishes', 'features', 'dietaryRestrictions',
       'hours', 'menuWebUrl', 'establishmentTypes', 'ownersTopReasons',
       'rentalDescriptions', 'photos', 'bedroomInfo', '

In [24]:
# define a function that finds the percentage of missing values of all column 

def missing_values_percentage(df):
    return df.isnull().sum() / len(df) * 100

column_percentages = missing_values_percentage(df)
columns_above_50_percent = column_percentages[column_percentages > 50]
print(columns_above_50_percent)


localAddress              88.796183
localLangCode             81.141869
booking                   91.874093
offerGroup                91.874093
subtype                   74.148901
hotelClassAttribution     80.154035
numberOfRooms             52.349593
priceLevel                60.288537
priceRange                60.545262
guideFeaturedInCopy       99.785132
isClosed                  98.839156
isLongClosed              98.839156
openNowText               99.486550
cuisines                  98.839156
mealTypes                 98.839156
dishes                    98.839156
features                  98.839156
dietaryRestrictions       98.839156
hours                     99.486550
menuWebUrl                99.880009
establishmentTypes        98.839156
ownersTopReasons         100.000000
rentalDescriptions        96.263534
photos                    96.263534
bedroomInfo               96.263534
bathroomInfo              96.263534
bathCount                 96.263534
baseDailyRate             96

In [31]:
# Assuming your DataFrame is named df
columns_to_keep = ["id", "description", "category", "subcategories", "name", "addressObj",
                   "rating", "latitude", "longitude", "numberOfReviews", "reviewTags", 
                   "amenities", "priceRange"]

# Select the desired columns from the DataFrame
df_filtered = df[columns_to_keep]
df_filtered.head()

Unnamed: 0,id,description,category,subcategories,name,addressObj,rating,latitude,longitude,numberOfReviews,reviewTags,amenities,priceRange
0,4022415,Welcome to Soho House Sharm El Sheikh! The bes...,attraction,[Nightlife],Soho House Sharm El Sheikh,"{'street1': 'Soho Square, White Knight Beach',...",5.0,27.962564,34.39381,198,"[{'text': 'nice cocktails', 'reviews': 4}, {'t...",,
1,19730066,Nobles Art Gallery is the best store in Luxor ...,attraction,"[Shopping, Museums]",Nobles Art Gallery,"{'street1': '17 Corniche El Nile Street', 'str...",5.0,,,211,"[{'text': 'winter palace', 'reviews': 16}, {'t...",,
2,8011182,Riding in El Gouna is an unforgettable experie...,attraction,[Outdoor Activities],YallaHorse Riding,"{'street1': None, 'street2': None, 'city': 'El...",5.0,27.40301,33.670258,269,"[{'text': 'well taken care', 'reviews': 10}, {...",,
3,7371664,Mividaspa is fast earning a top reputation due...,attraction,[Spas & Wellness],Mividaspa at Jaz Aquamarine Resort,"{'street1': 'South Sahl Hashish Road.', 'stree...",5.0,27.092207,33.84492,372,"[{'text': 'indian head massage', 'reviews': 2}...",,
4,17523327,Airport transfer service safe reliable drivers...,attraction,"[Other, Transportation]",Sharm Airport Transfers Karim,"{'street1': 'King Abdullah Street Naama Bay', ...",5.0,,,351,"[{'text': 'always on time', 'reviews': 31}, {'...",,


## EDA and Data Munging

## Modelling

## Model Evaluation

## Tuning

## Deployment

## Conclusion and Recommendations