# Travel Destination Recommendation System Notebook

#### Authors
* 1
* 2 
* 3
* 4
* 5
* 6


## Problem Statement

The goal is to build a machine learning model that can predict hotel ratings based on customer reviews, budget, specific locations, and the type of residence. The dataset is scraped from TripAdvisor and it contains information about various hotels, including their ratings, reviews, amenities, pricing, geographical coordinates, and residence types (e.g., hotel, bed and breakfast, specialty lodging). By analyzing the text reviews along with these additional factors, the objective is to develop a model that can accurately predict the ratings of new, unseen hotels based on customer reviews, budget constraints, location preferences, and residence type.

Approach:

Data Preprocessing: Clean and preprocess the text reviews by removing stopwords, punctuation, and performing tokenization. Convert the text data into a numerical representation suitable for modeling. Handle missing values, if any, in the budget, location, and residence type columns.

Feature Engineering: Extract additional features from the dataset, such as review sentiment scores, review length, and any other relevant information. Engineer new features related to budget, location, and residence type, such as price range categories, geographical distance from landmarks, and one-hot encoding of residence types.

Model Selection: Experiment with different supervised learning models, such as linear regression, decision trees, random forests, or neural networks, to find the best model for predicting hotel ratings considering customer reviews, budget, location, and residence type. Evaluate the models using appropriate evaluation metrics like mean squared error (MSE) or mean absolute error (MAE).

Model Training and Evaluation: Split the dataset into training and testing sets. Train the selected model on the training set and evaluate its performance on the testing set. Fine-tune the model parameters to improve its accuracy. Perform cross-validation to assess the model's generalization capabilities.

## Objectives

## Data Understanding

In [40]:
# Importing necessary libraries
import pandas as pd
import json
import glob

In [41]:
#func to read json files
def read_json_files(json_files):
    dfs = []
    for file in json_files:
        with open(file) as f:
            json_data = json.load(f)
            df = pd.DataFrame(json_data)
            dfs.append(df)

    merged_df = pd.concat(dfs, ignore_index=True)
    return merged_df


In [42]:
json_files = json_files = ['../Data/botswana.json', '../Data/capeverde.json', '../Data/drc.json', '../Data/egypt.json', '../Data/ethiopia.json', '../Data/ghana.json', '../Data/kenya.json',
              '../Data/madagascar.json', '../Data/malawi.json', '../Data/morocco.json', '../Data/namibia.json', '../Data/nigeria.json', '../Data/rwanda.json',
              '../Data/senegal.json', '../Data/seychelles.json', '../Data/south_africa.json', '../Data/tanzania.json', '../Data/uganda.json', '../Data/zambia.json']
df = read_json_files(json_files)
df.head()

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35836 entries, 0 to 35835
Data columns (total 65 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     35836 non-null  object 
 1   type                   35836 non-null  object 
 2   category               35836 non-null  object 
 3   subcategories          34497 non-null  object 
 4   name                   35836 non-null  object 
 5   locationString         34497 non-null  object 
 6   description            20129 non-null  object 
 7   image                  28495 non-null  object 
 8   photoCount             35836 non-null  int64  
 9   awards                 34497 non-null  object 
 10  rankingPosition        26570 non-null  float64
 11  rating                 26706 non-null  float64
 12  rawRanking             26570 non-null  float64
 13  phone                  24666 non-null  object 
 14  address                34494 non-null  object 
 15  ad

In [None]:
# converting to csv
# df.to_csv('compiled_data.csv', index=False)

In [None]:
df.columns

Index(['id', 'type', 'category', 'subcategories', 'name', 'locationString',
       'description', 'image', 'photoCount', 'awards', 'rankingPosition',
       'rating', 'rawRanking', 'phone', 'address', 'addressObj', 'localName',
       'localAddress', 'email', 'latitude', 'longitude', 'webUrl', 'website',
       'rankingString', 'rankingDenominator', 'neighborhoodLocations',
       'nearestMetroStations', 'ancestorLocations', 'ratingHistogram',
       'numberOfReviews', 'reviewTags', 'reviews', 'booking', 'offerGroup',
       'subtype', 'hotelClass', 'amenities', 'numberOfRooms', 'priceLevel',
       'priceRange', 'roomTips', 'checkInDate', 'checkOutDate', 'offers',
       'hotelClassAttribution', 'localLangCode', 'isClosed', 'isLongClosed',
       'openNowText', 'cuisines', 'mealTypes', 'dishes', 'features',
       'dietaryRestrictions', 'hours', 'menuWebUrl', 'establishmentTypes',
       'ownersTopReasons', 'guideFeaturedInCopy', 'rentalDescriptions',
       'photos', 'bedroomInfo', '

In [None]:
# define a function that finds the percentage of missing values of all column 

def missing_values_percentage(df):
    return df.isnull().sum() / len(df) * 100

column_percentages = missing_values_percentage(df)
columns_above_50_percent = column_percentages[column_percentages > 50]
print(columns_above_50_percent)


localAddress              88.796183
booking                   91.874093
offerGroup                91.874093
subtype                   74.148901
numberOfRooms             52.349593
priceLevel                60.288537
priceRange                60.545262
hotelClassAttribution     80.154035
localLangCode             81.141869
isClosed                  98.839156
isLongClosed              98.839156
openNowText               99.486550
cuisines                  98.839156
mealTypes                 98.839156
dishes                    98.839156
features                  98.839156
dietaryRestrictions       98.839156
hours                     99.486550
menuWebUrl                99.880009
establishmentTypes        98.839156
ownersTopReasons         100.000000
guideFeaturedInCopy       99.785132
rentalDescriptions        96.263534
photos                    96.263534
bedroomInfo               96.263534
bathroomInfo              96.263534
bathCount                 96.263534
baseDailyRate             96

In [None]:
# Check null values and filter columns with more than 10000 null values
null_counts = df.isnull().sum()
columns_above_threshold = null_counts[null_counts > 10000].index
# Print the columns with more than 10000 null values
list(columns_above_threshold)

['description',
 'phone',
 'localAddress',
 'email',
 'website',
 'booking',
 'offerGroup',
 'subtype',
 'hotelClass',
 'numberOfRooms',
 'priceLevel',
 'priceRange',
 'roomTips',
 'checkInDate',
 'checkOutDate',
 'offers',
 'hotelClassAttribution',
 'localLangCode',
 'isClosed',
 'isLongClosed',
 'openNowText',
 'cuisines',
 'mealTypes',
 'dishes',
 'features',
 'dietaryRestrictions',
 'hours',
 'menuWebUrl',
 'establishmentTypes',
 'ownersTopReasons',
 'guideFeaturedInCopy',
 'rentalDescriptions',
 'photos',
 'bedroomInfo',
 'bathroomInfo',
 'bathCount',
 'baseDailyRate']

In [None]:
# Assuming your DataFrame is named df
columns_to_keep = ['id','type', 'category', 'subcategories', 'name', 'locationString','image','photoCount','awards',
 'rankingPosition','rating','rawRanking','address','addressObj', 'localName', 'latitude','longitude',
'webUrl', 'rankingString','rankingDenominator','neighborhoodLocations', 'nearestMetroStations','ancestorLocations', 
'ratingHistogram','numberOfReviews','reviewTags','reviews','amenities']

# Select the desired columns from the DataFrame
df_filtered = df[columns_to_keep]
df_filtered.head()

Unnamed: 0,id,type,category,subcategories,name,locationString,image,photoCount,awards,rankingPosition,...,rankingString,rankingDenominator,neighborhoodLocations,nearestMetroStations,ancestorLocations,ratingHistogram,numberOfReviews,reviewTags,reviews,amenities
0,1171922,ATTRACTION,attraction,[Sights & Landmarks],Khwai River Bridge,"Okavango Delta, North-West District",https://media-cdn.tripadvisor.com/media/photo-...,24,[],3.0,...,#3 of 5 things to do in Okavango Delta,5,[],[],"[{'id': '472673', 'name': 'Okavango Delta', 'a...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",45,[],[],
1,2513264,ATTRACTION,attraction,[Nature & Parks],Gaborone Game Reserve,"Gaborone, South-East District",https://media-cdn.tripadvisor.com/media/photo-...,84,[],7.0,...,#7 of 25 things to do in Gaborone,25,[],[],"[{'id': '293767', 'name': 'Gaborone', 'abbrevi...","{'count1': 4, 'count2': 11, 'count3': 35, 'cou...",115,"[{'text': 'eland', 'reviews': 7}, {'text': 'an...",[],
2,3247057,ATTRACTION,attraction,[Sights & Landmarks],ISKCON Gaborone,"Gaborone, South-East District",https://media-cdn.tripadvisor.com/media/photo-...,21,[],5.0,...,#5 of 25 things to do in Gaborone,25,[],[],"[{'id': '293767', 'name': 'Gaborone', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",23,[],[],
3,478934,ATTRACTION,attraction,[Nature & Parks],Serondela Reserve,"Chobe National Park, North-West District",https://media-cdn.tripadvisor.com/media/photo-...,65,[],3.0,...,#3 of 8 things to do in Chobe National Park,8,[],[],"[{'id': '472669', 'name': 'Chobe National Park...","{'count1': 2, 'count2': 0, 'count3': 0, 'count...",34,"[{'text': 'the river', 'reviews': 6}, {'text':...",[],
4,7931216,ATTRACTION,attraction,[Nature & Parks],Khutse Game Reserve,"Gaborone, South-East District",https://media-cdn.tripadvisor.com/media/photo-...,26,[],8.0,...,#8 of 25 things to do in Gaborone,25,[],[],"[{'id': '293767', 'name': 'Gaborone', 'abbrevi...","{'count1': 0, 'count2': 1, 'count3': 6, 'count...",29,"[{'text': 'bucket shower', 'reviews': 5}, {'te...",[],


**'numberOfReviews','reviewTags' and 'reviews' columns**

In [None]:
review_df = pd.DataFrame(df_filtered, columns=['numberOfReviews', 'reviewTags', 'reviews'])
review_df['reviewTags'][5]

[{'text': 'one night', 'reviews': 2},
 {'text': 'farm', 'reviews': 14},
 {'text': 'lucy', 'reviews': 13},
 {'text': 'botswana', 'reviews': 11},
 {'text': 'stay', 'reviews': 5},
 {'text': 'host', 'reviews': 5},
 {'text': 'dinner', 'reviews': 5},
 {'text': 'chickens', 'reviews': 2},
 {'text': 'nata', 'reviews': 2},
 {'text': 'campfire', 'reviews': 2},
 {'text': 'breakfast', 'reviews': 2},
 {'text': 'food', 'reviews': 4},
 {'text': 'gaborone', 'reviews': 2}]

In [None]:
#number of missing values in reviews column
df_filtered['reviewTags'].isnull().sum()

1339

- The *'numberOfReviews'* column represent the number of reviews for each tourist destination.

- The *'reviews column'* has rows that have the same empty list '[ ]' value, the column does not provide any useful information or insights about the data. It does not contribute to the analysis or modeling process.

- The *'reviewTags'* column are tags associated with the reviews. It appears to contain a list of dictionaries, where each dictionary represents a review. Each dictionary has two keys: 'text' and 'reviews'. The 'text' key holds the review text, and the 'reviews' key holds the corresponding number of reviews.

- We can infer that it contains reviews or feedback related to a specific subject. Each review is represented by a dictionary with the review text and the number of reviews associated with it.

-  We convert the 'reviewTags' column values into lists of dictionaries and then extracts the 'text' values from the dictionaries, resulting in a column with a list of strings.



In [39]:
def clean_review_tags(df):
    df.loc[:, 'reviewTags'] = df['reviewTags'].apply(lambda entries: [{'text': entry['text']} for entry in entries] if isinstance(entries, list) else [])
    df.loc[:, 'reviewTags'] = df['reviewTags'].apply(lambda tags: [tag['text'] for tag in tags])
    df.loc[:, 'reviewTags'] = df['reviewTags'].apply(lambda tags: ','.join(tags))
    return df

# Usage
df_filtered = clean_review_tags(df_filtered)
df_filtered.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['reviewTags'] = df_filtered['reviewTags'].apply(lambda tags: [tag['text'] for tag in tags])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['reviewTags'] = df_filtered['reviewTags'].apply(lambda tags: ','.join(tags))


Unnamed: 0,id,type,category,subcategories,name,locationString,image,photoCount,awards,rankingPosition,...,rankingString,rankingDenominator,neighborhoodLocations,nearestMetroStations,ancestorLocations,ratingHistogram,numberOfReviews,reviewTags,reviews,amenities
0,1171922,ATTRACTION,attraction,[Sights & Landmarks],Khwai River Bridge,"Okavango Delta, North-West District",https://media-cdn.tripadvisor.com/media/photo-...,24,[],3.0,...,#3 of 5 things to do in Okavango Delta,5,[],[],"[{'id': '472673', 'name': 'Okavango Delta', 'a...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",45,,[],
1,2513264,ATTRACTION,attraction,[Nature & Parks],Gaborone Game Reserve,"Gaborone, South-East District",https://media-cdn.tripadvisor.com/media/photo-...,84,[],7.0,...,#7 of 25 things to do in Gaborone,25,[],[],"[{'id': '293767', 'name': 'Gaborone', 'abbrevi...","{'count1': 4, 'count2': 11, 'count3': 35, 'cou...",115,"eland,animals",[],
2,3247057,ATTRACTION,attraction,[Sights & Landmarks],ISKCON Gaborone,"Gaborone, South-East District",https://media-cdn.tripadvisor.com/media/photo-...,21,[],5.0,...,#5 of 25 things to do in Gaborone,25,[],[],"[{'id': '293767', 'name': 'Gaborone', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",23,,[],
3,478934,ATTRACTION,attraction,[Nature & Parks],Serondela Reserve,"Chobe National Park, North-West District",https://media-cdn.tripadvisor.com/media/photo-...,65,[],3.0,...,#3 of 8 things to do in Chobe National Park,8,[],[],"[{'id': '472669', 'name': 'Chobe National Park...","{'count1': 2, 'count2': 0, 'count3': 0, 'count...",34,"the river,hippos,chobe",[],
4,7931216,ATTRACTION,attraction,[Nature & Parks],Khutse Game Reserve,"Gaborone, South-East District",https://media-cdn.tripadvisor.com/media/photo-...,26,[],8.0,...,#8 of 25 things to do in Gaborone,25,[],[],"[{'id': '293767', 'name': 'Gaborone', 'abbrevi...","{'count1': 0, 'count2': 1, 'count3': 6, 'count...",29,"bucket shower,game reserve,latrine,hartebeest,...",[],


In [51]:
# converting to csv
# df_filtered.to_csv('condensed_data.csv', index=False)

## EDA and Data Munging

## Modelling

## Model Evaluation

## Tuning

## Deployment

## Conclusion and Recommendations