# Travel Destination Recommendation System Notebook

#### Authors
* 1
* 2 
* 3
* 4
* 5
* 6


## Problem Statement

The goal is to build a machine learning model that can predict hotel ratings based on customer reviews, budget, specific locations, and the type of residence. The dataset is scraped from TripAdvisor and it contains information about various hotels, including their ratings, reviews, amenities, pricing, geographical coordinates, and residence types (e.g., hotel, bed and breakfast, specialty lodging). By analyzing the text reviews along with these additional factors, the objective is to develop a model that can accurately predict the ratings of new, unseen hotels based on customer reviews, budget constraints, location preferences, and residence type.

Approach:

Data Preprocessing: Clean and preprocess the text reviews by removing stopwords, punctuation, and performing tokenization. Convert the text data into a numerical representation suitable for modeling. Handle missing values, if any, in the budget, location, and residence type columns.

Feature Engineering: Extract additional features from the dataset, such as review sentiment scores, review length, and any other relevant information. Engineer new features related to budget, location, and residence type, such as price range categories, geographical distance from landmarks, and one-hot encoding of residence types.

Model Selection: Experiment with different supervised learning models, such as linear regression, decision trees, random forests, or neural networks, to find the best model for predicting hotel ratings considering customer reviews, budget, location, and residence type. Evaluate the models using appropriate evaluation metrics like mean squared error (MSE) or mean absolute error (MAE).

Model Training and Evaluation: Split the dataset into training and testing sets. Train the selected model on the training set and evaluate its performance on the testing set. Fine-tune the model parameters to improve its accuracy. Perform cross-validation to assess the model's generalization capabilities.

## Objectives

## Data Understanding

In [51]:
# Importing necessary libraries
import pandas as pd
import json
import glob

In [52]:
#func to read json files
def read_json_files(json_files):
    dfs = []
    for file in json_files:
        with open(file) as f:
            json_data = json.load(f)
            df = pd.DataFrame(json_data)
            dfs.append(df)

    merged_df = pd.concat(dfs, ignore_index=True)
    return merged_df


In [53]:
json_files = ['Data\\Egypt.json', 'Data\\Ethiopia.json', 'Data\\Kenya.json', 'Data\\Rwanda.json', 'Data\\DRC.json',
               'Data\\Nigeria.json', 'Data\\Uganda.json', 'Data\\Madagascar.json', 'Data\\Morocco.json',
               'Data\\Tanzania.json', 'Data\\Seychelles.json', 'Data\\namibia.json', 'Data\\southafrica.json', 
               'Data\\malawi.json', 'Data\\capeverde.json', 'Data\\ghana.json', 'Data\\botswana.json', 'Data\\zambia.json' , 
               'Data\\senegal.json']
df = read_json_files(json_files)
df.head()

Unnamed: 0,id,type,category,subcategories,name,locationString,description,image,photoCount,awards,...,hours,menuWebUrl,establishmentTypes,ownersTopReasons,rentalDescriptions,photos,bedroomInfo,bathroomInfo,bathCount,baseDailyRate
0,4022415,ATTRACTION,attraction,[Nightlife],Soho House Sharm El Sheikh,"Sharm El Sheikh, South Sinai, Red Sea and Sinai",Welcome to Soho House Sharm El Sheikh! The bes...,https://media-cdn.tripadvisor.com/media/photo-...,119,[],...,,,,,,,,,,
1,19730066,ATTRACTION,attraction,"[Shopping, Museums]",Nobles Art Gallery,"Luxor, Nile River Valley",Nobles Art Gallery is the best store in Luxor ...,https://media-cdn.tripadvisor.com/media/photo-...,105,[],...,,,,,,,,,,
2,8011182,ATTRACTION,attraction,[Outdoor Activities],YallaHorse Riding,"El Gouna, Hurghada, Red Sea and Sinai",Riding in El Gouna is an unforgettable experie...,https://media-cdn.tripadvisor.com/media/photo-...,362,[],...,,,,,,,,,,
3,7371664,ATTRACTION,attraction,[Spas & Wellness],Mividaspa at Jaz Aquamarine Resort,"Hurghada, Red Sea and Sinai",Mividaspa is fast earning a top reputation due...,https://media-cdn.tripadvisor.com/media/photo-...,67,[],...,,,,,,,,,,
4,17523327,ATTRACTION,attraction,"[Other, Transportation]",Sharm Airport Transfers Karim,"Sharm El Sheikh, South Sinai, Red Sea and Sinai",Airport transfer service safe reliable drivers...,https://media-cdn.tripadvisor.com/media/photo-...,25,[],...,,,,,,,,,,


In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35836 entries, 0 to 35835
Data columns (total 65 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     35836 non-null  object 
 1   type                   35836 non-null  object 
 2   category               35836 non-null  object 
 3   subcategories          34497 non-null  object 
 4   name                   35836 non-null  object 
 5   locationString         34497 non-null  object 
 6   description            20129 non-null  object 
 7   image                  28495 non-null  object 
 8   photoCount             35836 non-null  int64  
 9   awards                 34497 non-null  object 
 10  rankingPosition        26570 non-null  float64
 11  rating                 26706 non-null  float64
 12  rawRanking             26570 non-null  float64
 13  phone                  24666 non-null  object 
 14  address                34494 non-null  object 
 15  ad

In [55]:
# converting to csv
# df.to_csv('compiled_data.csv', index=False)

In [56]:
df.columns

Index(['id', 'type', 'category', 'subcategories', 'name', 'locationString',
       'description', 'image', 'photoCount', 'awards', 'rankingPosition',
       'rating', 'rawRanking', 'phone', 'address', 'addressObj', 'localName',
       'localAddress', 'localLangCode', 'email', 'latitude', 'longitude',
       'webUrl', 'website', 'rankingString', 'rankingDenominator',
       'neighborhoodLocations', 'nearestMetroStations', 'ancestorLocations',
       'ratingHistogram', 'numberOfReviews', 'reviewTags', 'reviews',
       'booking', 'offerGroup', 'subtype', 'hotelClass',
       'hotelClassAttribution', 'amenities', 'numberOfRooms', 'priceLevel',
       'priceRange', 'roomTips', 'checkInDate', 'checkOutDate', 'offers',
       'guideFeaturedInCopy', 'isClosed', 'isLongClosed', 'openNowText',
       'cuisines', 'mealTypes', 'dishes', 'features', 'dietaryRestrictions',
       'hours', 'menuWebUrl', 'establishmentTypes', 'ownersTopReasons',
       'rentalDescriptions', 'photos', 'bedroomInfo', '

In [57]:
# define a function that finds the percentage of missing values of all column 

def missing_values_percentage(df):
    return df.isnull().sum() / len(df) * 100

column_percentages = missing_values_percentage(df)
columns_above_50_percent = column_percentages[column_percentages > 50]
print(columns_above_50_percent)


localAddress              88.796183
localLangCode             81.141869
booking                   91.874093
offerGroup                91.874093
subtype                   74.148901
hotelClassAttribution     80.154035
numberOfRooms             52.349593
priceLevel                60.288537
priceRange                60.545262
guideFeaturedInCopy       99.785132
isClosed                  98.839156
isLongClosed              98.839156
openNowText               99.486550
cuisines                  98.839156
mealTypes                 98.839156
dishes                    98.839156
features                  98.839156
dietaryRestrictions       98.839156
hours                     99.486550
menuWebUrl                99.880009
establishmentTypes        98.839156
ownersTopReasons         100.000000
rentalDescriptions        96.263534
photos                    96.263534
bedroomInfo               96.263534
bathroomInfo              96.263534
bathCount                 96.263534
baseDailyRate             96

In [58]:
# Check null values and filter columns with more than 10000 null values
null_counts = df.isnull().sum()
columns_above_threshold = null_counts[null_counts > 10000].index
# Print the columns with more than 10000 null values
list(columns_above_threshold)

['description',
 'phone',
 'localAddress',
 'localLangCode',
 'email',
 'website',
 'booking',
 'offerGroup',
 'subtype',
 'hotelClass',
 'hotelClassAttribution',
 'numberOfRooms',
 'priceLevel',
 'priceRange',
 'roomTips',
 'checkInDate',
 'checkOutDate',
 'offers',
 'guideFeaturedInCopy',
 'isClosed',
 'isLongClosed',
 'openNowText',
 'cuisines',
 'mealTypes',
 'dishes',
 'features',
 'dietaryRestrictions',
 'hours',
 'menuWebUrl',
 'establishmentTypes',
 'ownersTopReasons',
 'rentalDescriptions',
 'photos',
 'bedroomInfo',
 'bathroomInfo',
 'bathCount',
 'baseDailyRate']

In [59]:
# Assuming your DataFrame is named df
columns_to_keep = ['id','type', 'category', 'subcategories', 'name', 'locationString','image','photoCount','awards',
 'rankingPosition','rating','rawRanking','address','addressObj', 'localName', 'latitude','longitude',
'webUrl', 'rankingString','rankingDenominator','neighborhoodLocations', 'nearestMetroStations','ancestorLocations', 
'ratingHistogram','numberOfReviews','reviewTags','reviews','amenities']

# Select the desired columns from the DataFrame
df_filtered = df[columns_to_keep]
df_filtered.head()

Unnamed: 0,id,type,category,subcategories,name,locationString,image,photoCount,awards,rankingPosition,...,rankingString,rankingDenominator,neighborhoodLocations,nearestMetroStations,ancestorLocations,ratingHistogram,numberOfReviews,reviewTags,reviews,amenities
0,4022415,ATTRACTION,attraction,[Nightlife],Soho House Sharm El Sheikh,"Sharm El Sheikh, South Sinai, Red Sea and Sinai",https://media-cdn.tripadvisor.com/media/photo-...,119,[],2.0,...,#2 of 45 Nightlife in Sharm El Sheikh,45,[],[],"[{'id': '297555', 'name': 'Sharm El Sheikh', '...","{'count1': 1, 'count2': 3, 'count3': 4, 'count...",198,"[{'text': 'nice cocktails', 'reviews': 4}, {'t...",[],
1,19730066,ATTRACTION,attraction,"[Shopping, Museums]",Nobles Art Gallery,"Luxor, Nile River Valley",https://media-cdn.tripadvisor.com/media/photo-...,105,[],1.0,...,#1 of 59 Shopping in Luxor,59,[],[],"[{'id': '294205', 'name': 'Luxor', 'abbreviati...","{'count1': 0, 'count2': 1, 'count3': 0, 'count...",211,"[{'text': 'winter palace', 'reviews': 16}, {'t...",[],
2,8011182,ATTRACTION,attraction,[Outdoor Activities],YallaHorse Riding,"El Gouna, Hurghada, Red Sea and Sinai",https://media-cdn.tripadvisor.com/media/photo-...,362,[],4.0,...,#4 of 86 Outdoor Activities in El Gouna,86,[],[],"[{'id': '297548', 'name': 'El Gouna', 'abbrevi...","{'count1': 0, 'count2': 1, 'count3': 1, 'count...",269,"[{'text': 'well taken care', 'reviews': 10}, {...",[],
3,7371664,ATTRACTION,attraction,[Spas & Wellness],Mividaspa at Jaz Aquamarine Resort,"Hurghada, Red Sea and Sinai",https://media-cdn.tripadvisor.com/media/photo-...,67,[],1.0,...,#1 of 35 Spas & Wellness in Hurghada,35,[],[],"[{'id': '297549', 'name': 'Hurghada', 'abbrevi...","{'count1': 1, 'count2': 1, 'count3': 5, 'count...",372,"[{'text': 'indian head massage', 'reviews': 2}...",[],
4,17523327,ATTRACTION,attraction,"[Other, Transportation]",Sharm Airport Transfers Karim,"Sharm El Sheikh, South Sinai, Red Sea and Sinai",https://media-cdn.tripadvisor.com/media/photo-...,25,[],1.0,...,#1 of 104 Transportation in Sharm El Sheikh,104,[],[],"[{'id': '297555', 'name': 'Sharm El Sheikh', '...","{'count1': 1, 'count2': 1, 'count3': 1, 'count...",351,"[{'text': 'always on time', 'reviews': 31}, {'...",[],


**'numberOfReviews','reviewTags' and 'reviews' columns**

In [66]:
review_df = pd.DataFrame(df_filtered, columns=['numberOfReviews', 'reviewTags', 'reviews'])
review_df


Unnamed: 0,numberOfReviews,reviewTags,reviews
0,198,"[{'text': 'nice cocktails', 'reviews': 4}, {'t...",[]
1,211,"[{'text': 'winter palace', 'reviews': 16}, {'t...",[]
2,269,"[{'text': 'well taken care', 'reviews': 10}, {...",[]
3,372,"[{'text': 'indian head massage', 'reviews': 2}...",[]
4,351,"[{'text': 'always on time', 'reviews': 31}, {'...",[]
...,...,...,...
35831,0,[],[]
35832,0,[],[]
35833,0,[],[]
35834,0,[],[]


In [62]:
#number of missing values in reviews column
df_filtered['reviewTags'].isnull().sum()

1339

- The *'numberOfReviews'* column represent the number of reviews for each tourist destination.
- The *'reviewTags'* column are tags associated with the reviews. It has 1339 missing values. 
- The *'reviews column'* has rows that have the same empty list '[ ]' value, the column does not provide any useful information or insights about the data. It does not contribute to the analysis or modeling process.

- The *'reviews column'*  appears to contain a list of dictionaries, where each dictionary represents a review. Each dictionary has two keys: 'text' and 'reviews'. The 'text' key holds the review text, and the 'reviews' key holds the corresponding number of reviews.

- We can infer that it contains reviews or feedback related to a specific subject. Each review is represented by a dictionary with the review text and the number of reviews associated with it.



In [70]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Example review tags
review_tags = ['Well taken care', 'Experienced riders', 'Beautiful horses!', 'Nice photos']

# Convert to lowercase and remove punctuation
review_tags_cleaned = [tag.lower().replace('!', '') for tag in review_tags]

# Remove stop words
stop_words = set(stopwords.words('english'))
review_tags_cleaned = [' '.join([word for word in word_tokenize(tag) if word not in stop_words]) for tag in review_tags_cleaned]

# Create DataFrame with cleaned review tags
df = pd.DataFrame({'reviewTags': review_tags_cleaned})
print(df)


           reviewTags
0     well taken care
1  experienced riders
2    beautiful horses
3         nice photos


In [32]:
# converting to csv
# df_filtered.to_csv('condensed_data.csv', index=False)

## EDA and Data Munging

## Modelling

## Model Evaluation

## Tuning

## Deployment

## Conclusion and Recommendations