# Predicting preferred destination  based on taste and preference

The goal is to build a machine learning model that can predict hotel ratings based on customer reviews, budget, specific locations, and the type of residence. The dataset is scraped from TripAdvisor and it contains information about various hotels, including their ratings, reviews, amenities, pricing, geographical coordinates, and residence types (e.g., hotel, bed and breakfast, specialty lodging). By analyzing the text reviews along with these additional factors, the objective is to develop a model that can accurately predict the ratings of new, unseen hotels based on customer reviews, budget constraints, location preferences, and residence type.

Approach:

Data Preprocessing: Clean and preprocess the text reviews by removing stopwords, punctuation, and performing tokenization. Convert the text data into a numerical representation suitable for modeling. Handle missing values, if any, in the budget, location, and residence type columns.

Feature Engineering: Extract additional features from the dataset, such as review sentiment scores, review length, and any other relevant information. Engineer new features related to budget, location, and residence type, such as price range categories, geographical distance from landmarks, and one-hot encoding of residence types.

Model Selection: Experiment with different supervised learning models, such as linear regression, decision trees, random forests, or neural networks, to find the best model for predicting hotel ratings considering customer reviews, budget, location, and residence type. Evaluate the models using appropriate evaluation metrics like mean squared error (MSE) or mean absolute error (MAE).

Model Training and Evaluation: Split the dataset into training and testing sets. Train the selected model on the training set and evaluate its performance on the testing set. Fine-tune the model parameters to improve its accuracy. Perform cross-validation to assess the model's generalization capabilities.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

import json
import glob
import re

In [5]:
def read_json_files(json_files):
    dfs = []
    for file in json_files:
        with open(file) as f:
            json_data = json.load(f)
            df = pd.DataFrame(json_data)
            dfs.append(df)

    return pd.concat(dfs, ignore_index=True)



In [10]:
df=pd.read_csv(r"C:\Users\User\Desktop\travel-destination-recommendation-sys\compiled_data.csv")
df

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,id,type,category,subcategories,name,locationString,description,image,photoCount,awards,...,hours,menuWebUrl,establishmentTypes,ownersTopReasons,rentalDescriptions,photos,bedroomInfo,bathroomInfo,bathCount,baseDailyRate
0,4022415,ATTRACTION,attraction,['Nightlife'],Soho House Sharm El Sheikh,"Sharm El Sheikh, South Sinai, Red Sea and Sinai",Welcome to Soho House Sharm El Sheikh! The bes...,https://media-cdn.tripadvisor.com/media/photo-...,119,[],...,,,,,,,,,,
1,19730066,ATTRACTION,attraction,"['Shopping', 'Museums']",Nobles Art Gallery,"Luxor, Nile River Valley",Nobles Art Gallery is the best store in Luxor ...,https://media-cdn.tripadvisor.com/media/photo-...,105,[],...,,,,,,,,,,
2,8011182,ATTRACTION,attraction,['Outdoor Activities'],YallaHorse Riding,"El Gouna, Hurghada, Red Sea and Sinai",Riding in El Gouna is an unforgettable experie...,https://media-cdn.tripadvisor.com/media/photo-...,362,[],...,,,,,,,,,,
3,7371664,ATTRACTION,attraction,['Spas & Wellness'],Mividaspa at Jaz Aquamarine Resort,"Hurghada, Red Sea and Sinai",Mividaspa is fast earning a top reputation due...,https://media-cdn.tripadvisor.com/media/photo-...,67,[],...,,,,,,,,,,
4,17523327,ATTRACTION,attraction,"['Other', 'Transportation']",Sharm Airport Transfers Karim,"Sharm El Sheikh, South Sinai, Red Sea and Sinai",Airport transfer service safe reliable drivers...,https://media-cdn.tripadvisor.com/media/photo-...,25,[],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35831,12233032,HOTEL,hotel,['Specialty Lodging'],Sandcreek Village,"Joal Fadiouth, La Petite Cote, Thies Region",,https://media-cdn.tripadvisor.com/media/partne...,0,[],...,,,,,,,,,,
35832,10071000,HOTEL,hotel,['Bed and Breakfast'],Chambres d'Hotes,"Nianing, La Petite Cote, Thies Region",,,0,[],...,,,,,,,,,,
35833,23686418,HOTEL,hotel,['Specialty Lodging'],Sessene,"Fatick, Fatick Region",,,0,[],...,,,,,,,,,,
35834,15756049,HOTEL,hotel,['Bed and Breakfast'],Havre de paix aux Almadie,"Ngor, Dakar, Dakar Region",,,0,[],...,,,,,,,,,,


In [11]:
df.isnull().sum()

id                   0
type                 0
category             0
subcategories     1339
name                 1
                 ...  
photos           34497
bedroomInfo      35133
bathroomInfo     34500
bathCount        34497
baseDailyRate    34568
Length: 65, dtype: int64

In [12]:
df.columns

Index(['id', 'type', 'category', 'subcategories', 'name', 'locationString',
       'description', 'image', 'photoCount', 'awards', 'rankingPosition',
       'rating', 'rawRanking', 'phone', 'address', 'addressObj', 'localName',
       'localAddress', 'localLangCode', 'email', 'latitude', 'longitude',
       'webUrl', 'website', 'rankingString', 'rankingDenominator',
       'neighborhoodLocations', 'nearestMetroStations', 'ancestorLocations',
       'ratingHistogram', 'numberOfReviews', 'reviewTags', 'reviews',
       'booking', 'offerGroup', 'subtype', 'hotelClass',
       'hotelClassAttribution', 'amenities', 'numberOfRooms', 'priceLevel',
       'priceRange', 'roomTips', 'checkInDate', 'checkOutDate', 'offers',
       'guideFeaturedInCopy', 'isClosed', 'isLongClosed', 'openNowText',
       'cuisines', 'mealTypes', 'dishes', 'features', 'dietaryRestrictions',
       'hours', 'menuWebUrl', 'establishmentTypes', 'ownersTopReasons',
       'rentalDescriptions', 'photos', 'bedroomInfo', '

In [13]:
df.reviewTags.value_counts()

[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

In [14]:
# Check null values and filter columns with more than 4000 null values
null_counts = df.isnull().sum()
columns_above_threshold = null_counts[null_counts > 10000].index

# Print the columns with more than 4000 null values
list(columns_above_threshold)


['description',
 'phone',
 'localName',
 'localAddress',
 'localLangCode',
 'email',
 'website',
 'booking',
 'offerGroup',
 'subtype',
 'hotelClass',
 'hotelClassAttribution',
 'numberOfRooms',
 'priceLevel',
 'priceRange',
 'roomTips',
 'checkInDate',
 'checkOutDate',
 'offers',
 'guideFeaturedInCopy',
 'isClosed',
 'isLongClosed',
 'openNowText',
 'cuisines',
 'mealTypes',
 'dishes',
 'features',
 'dietaryRestrictions',
 'hours',
 'menuWebUrl',
 'establishmentTypes',
 'ownersTopReasons',
 'rentalDescriptions',
 'photos',
 'bedroomInfo',
 'bathroomInfo',
 'bathCount',
 'baseDailyRate']

In [15]:
# we will drop the following columns because they do not have any contribution to our objectives.
# some also contain too many null values to fill. 
cols_to_drop = columns_above_threshold

df.drop(columns=cols_to_drop, inplace=True)

In [16]:
list(df.columns)

['id',
 'type',
 'category',
 'subcategories',
 'name',
 'locationString',
 'image',
 'photoCount',
 'awards',
 'rankingPosition',
 'rating',
 'rawRanking',
 'address',
 'addressObj',
 'latitude',
 'longitude',
 'webUrl',
 'rankingString',
 'rankingDenominator',
 'neighborhoodLocations',
 'nearestMetroStations',
 'ancestorLocations',
 'ratingHistogram',
 'numberOfReviews',
 'reviewTags',
 'reviews',
 'amenities']

In [17]:
df[['locationString','rankingPosition','rawRanking','rankingString','rankingDenominator']]

Unnamed: 0,locationString,rankingPosition,rawRanking,rankingString,rankingDenominator
0,"Sharm El Sheikh, South Sinai, Red Sea and Sinai",2.0,4.349033,#2 of 45 Nightlife in Sharm El Sheikh,45.0
1,"Luxor, Nile River Valley",1.0,4.434324,#1 of 59 Shopping in Luxor,59.0
2,"El Gouna, Hurghada, Red Sea and Sinai",4.0,4.404173,#4 of 86 Outdoor Activities in El Gouna,86.0
3,"Hurghada, Red Sea and Sinai",1.0,4.362678,#1 of 35 Spas & Wellness in Hurghada,35.0
4,"Sharm El Sheikh, South Sinai, Red Sea and Sinai",1.0,4.453663,#1 of 104 Transportation in Sharm El Sheikh,104.0
...,...,...,...,...,...
35831,"Joal Fadiouth, La Petite Cote, Thies Region",,,,
35832,"Nianing, La Petite Cote, Thies Region",,,,
35833,"Fatick, Fatick Region",,,,
35834,"Ngor, Dakar, Dakar Region",,,,


In [18]:
df[['name','rankingString', 'type']]

Unnamed: 0,name,rankingString,type
0,Soho House Sharm El Sheikh,#2 of 45 Nightlife in Sharm El Sheikh,ATTRACTION
1,Nobles Art Gallery,#1 of 59 Shopping in Luxor,ATTRACTION
2,YallaHorse Riding,#4 of 86 Outdoor Activities in El Gouna,ATTRACTION
3,Mividaspa at Jaz Aquamarine Resort,#1 of 35 Spas & Wellness in Hurghada,ATTRACTION
4,Sharm Airport Transfers Karim,#1 of 104 Transportation in Sharm El Sheikh,ATTRACTION
...,...,...,...
35831,Sandcreek Village,,HOTEL
35832,Chambres d'Hotes,,HOTEL
35833,Sessene,,HOTEL
35834,Havre de paix aux Almadie,,HOTEL


In [19]:


# Assuming your data is in a DataFrame called 'df' and the column is named 'rankingString'
# Create new columns
df['RankingType'] = ""
df['Location'] = ""
df['Numerator'] = ""
df['Denominator'] = ""

# Iterate through the rows and extract the information
for index, row in df.iterrows():
    # Check if the value is NaN
    if pd.isnull(row['rankingString']):
        continue

    if match := re.match(
        r'#(\d+)\s+of\s+(\d+)\s+(.*?)\s+in\s+(.*?)$', row['rankingString']
    ):
        numerator = match.group(1)
        denominator = match.group(2)
        ranking_type = match.group(3)
        location = match.group(4)

        # Update the new columns
        df.at[index, 'RankingType'] = ranking_type
        df.at[index, 'Location'] = location
        df.at[index, 'Numerator'] = numerator
        df.at[index, 'Denominator'] = denominator



In [None]:
df.RankingType.value_counts()

In [14]:
df.columns

Index(['id', 'type', 'category', 'subcategories', 'name', 'locationString',
       'image', 'photoCount', 'awards', 'rankingPosition', 'rating',
       'rawRanking', 'address', 'addressObj', 'localName', 'latitude',
       'longitude', 'webUrl', 'rankingString', 'rankingDenominator',
       'neighborhoodLocations', 'nearestMetroStations', 'ancestorLocations',
       'ratingHistogram', 'numberOfReviews', 'reviewTags', 'reviews',
       'amenities', 'RankingType', 'Location', 'Numerator', 'Denominator'],
      dtype='object')

After splitting the rankingString column to their respective elements. we observe below that the new column ranking type has some elements that are similar but grouped differently. 

In [50]:
df.RankingType.value_counts()

Specialty lodging          13571
                            9930
things to do                5348
hotels                      4961
Water & Amusement Parks      559
Transportation               532
places to eat                487
Shopping                     163
Nightlife                    133
Spas & Wellness              115
Classes & Workshops           37
Name: RankingType, dtype: int64

We will then combine similar elements to have reduced distribution within the column

In [49]:
# Define the mappings to combine similar values
mappings = {
    'hotel': 'hotels',
    'B&B / Inn': 'B&Bs / Inns',
    'Sights & Landmarks': 'Nature & Parks',
    'Fun & Games': 'Outdoor Activities',
    'Boat Tours & Water Sports': 'Water & Amusement Parks',
    'Traveler Resources': 'Shopping',
    'Concerts & Shows': 'Nightlife',
    'Food & Drink': 'places to eat',
    'Nature & Parks': 'things to do',
    'Museums': 'things to do',
    'Tours' : 'things to do',
    'Outdoor Activities': 'things to do',
    'B&Bs / Inns': 'Specialty lodging'
}

# Replace the values in the 'Ranking Type' column
df['RankingType'] = df['RankingType'].replace(mappings)

In [51]:
df

Unnamed: 0,id,type,category,subcategories,name,locationString,image,photoCount,awards,rankingPosition,...,ancestorLocations,ratingHistogram,numberOfReviews,reviewTags,reviews,amenities,RankingType,Location,Numerator,Denominator
0,2704993,ATTRACTION,attraction,[Nature & Parks],Congoloisirs,Kinshasa,https://media-cdn.tripadvisor.com/media/photo-...,9,[],17.0,...,"[{'id': '294187', 'name': 'Kinshasa', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 4, 'count...",9,[],[],,things to do,Kinshasa,17,105
1,1536776,ATTRACTION,attraction,[Nature & Parks],Okapi Wildlife Reserve,Orientale Province,https://media-cdn.tripadvisor.com/media/photo-...,3,[],1.0,...,"[{'id': '1536771', 'name': 'Orientale Province...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",2,[],[],,things to do,Orientale Province,1,4
2,13203729,ATTRACTION,attraction,"[Shopping, Food & Drink]",Marche Nouveau DAIPN,Kinshasa,https://media-cdn.tripadvisor.com/media/photo-...,12,[],21.0,...,"[{'id': '294187', 'name': 'Kinshasa', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",3,[],[],,things to do,Kinshasa,21,105
3,8661504,HOTEL,hotel,[Specialty Lodging],Bukima Tented Camp,"Rumangabo, North Kivu Province",https://media-cdn.tripadvisor.com/media/photo-...,79,[],2.0,...,"[{'id': '3656749', 'name': 'Rumangabo', 'abbre...","{'count1': 1, 'count2': 0, 'count3': 0, 'count...",34,[],[],"Restaurant, Mountain View",Specialty lodging,Rumangabo,2,3
4,10414108,HOTEL,hotel,[Specialty Lodging],"Tchegera Island Tented Camp, Virunga National ...","Goma, North Kivu Province",https://media-cdn.tripadvisor.com/media/photo-...,109,[],1.0,...,"[{'id': '303843', 'name': 'Goma', 'abbreviatio...","{'count1': 0, 'count2': 0, 'count3': 1, 'count...",29,"[{'text': 'gorilla trekking', 'reviews': 3}, {...",[],"Multilingual Staff, Restaurant, Bar/Lounge, Fr...",Specialty lodging,Goma,1,17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35831,12216827,HOTEL,hotel,[Specialty Lodging],Casa Santos Pinto,"Curral das Vacas, Santo Antao",,0,[],,...,"[{'id': '12880045', 'name': 'Curral das Vacas'...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",0,[],[],"Shuttle Bus Service, Restaurant, Bar/Lounge, P...",,,,
35832,23200009,HOTEL,hotel,[Bed and Breakfast],Kelly GuestHouse - Lovely Bedroom - Plateau Ci...,"Praia, Santiago",https://media-cdn.tripadvisor.com/media/partne...,0,[],,...,"[{'id': '293775', 'name': 'Praia', 'abbreviati...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",0,[],[],,,,,
35833,13423426,HOTEL,hotel,[Bed and Breakfast],Luz Esperanca,"Pedra Badejo, Santiago",,0,[],,...,"[{'id': '1601793', 'name': 'Pedra Badejo', 'ab...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",0,[],[],"Kids Activities, Free parking, Airport transpo...",,,,
35834,12957229,HOTEL,hotel,[Specialty Lodging],Pensao Entre Nos,"Tarrafal, Santiago",,0,[],,...,"[{'id': '482851', 'name': 'Tarrafal', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",0,[],[],"Kids Activities, Free parking, Airport transpo...",,,,


In [53]:
df.RankingType

0             things to do
1             things to do
2             things to do
3        Specialty lodging
4        Specialty lodging
               ...        
35831                     
35832                     
35833                     
35834                     
35835                     
Name: RankingType, Length: 35836, dtype: object

In [61]:
empty_rows = df[df['RankingType'].isnull() | df['RankingType'].eq('')]
empty_rows[['RankingType', 'name', 'type']]


Unnamed: 0,RankingType,name,type
203,,Salonga National Park,ATTRACTION
304,,Les Assemblees de l'Eternel,ATTRACTION
308,,Aquasplash,ATTRACTION
309,,Parc de la Vallee de la N'Sele,ATTRACTION
310,,Cathedrale de Butembo,ATTRACTION
...,...,...,...
35831,,Casa Santos Pinto,HOTEL
35832,,Kelly GuestHouse - Lovely Bedroom - Plateau Ci...,HOTEL
35833,,Luz Esperanca,HOTEL
35834,,Pensao Entre Nos,HOTEL


In [65]:
speciality_lodging_rows = empty_rows[empty_rows['type'] == 'HOTEL'][['RankingType', 'name', 'type']]
speciality_lodging_rows

Unnamed: 0,RankingType,name,type
607,,Espace Fakala III,HOTEL
608,,Hotel Bercail,HOTEL
609,,Hotel Silem,HOTEL
610,,Guest House Mwamini,HOTEL
611,,La Refuge Hotel,HOTEL
...,...,...,...
35831,,Casa Santos Pinto,HOTEL
35832,,Kelly GuestHouse - Lovely Bedroom - Plateau Ci...,HOTEL
35833,,Luz Esperanca,HOTEL
35834,,Pensao Entre Nos,HOTEL


In [55]:
null_values = df[df['RankingType'].isna()]
null_values

Unnamed: 0,id,type,category,subcategories,name,locationString,image,photoCount,awards,rankingPosition,...,ancestorLocations,ratingHistogram,numberOfReviews,reviewTags,reviews,amenities,RankingType,Location,Numerator,Denominator


In [20]:
# Replace NaN values with "bathroom only" where type is "attraction"
df.loc[(df['type'] == 'RESTAURANT') & (df['amenities'].isna()), 'amenities'] = 'restaurant'


In [21]:
df.loc[(df['type'] == 'ATTRACTION') & (df['amenities'].isna()), 'amenities'] = 'bathroom only'

In [22]:
df['amenities'] = df['amenities'].apply(lambda x: ', '.join(x) if isinstance(x, list) else '')


In [23]:
df['amenities'].isnull().value_counts()

False    35836
Name: amenities, dtype: int64

In [24]:
df['amenities'].isna().value_counts()

False    35836
Name: amenities, dtype: int64

In [25]:

hotel_rows = df[df['type'] == 'RESTAURANT']
hotel_amenities = hotel_rows['amenities']
hotel_amenities

59        
60        
61        
62        
64        
        ..
26858     
26860     
26862     
26896     
26898     
Name: amenities, Length: 416, dtype: object

In [39]:
df[['type', 'amenities']]

Unnamed: 0,type,amenities
0,ATTRACTION,bathroom only
1,ATTRACTION,bathroom only
2,ATTRACTION,bathroom only
3,HOTEL,"[Restaurant, Mountain View]"
4,HOTEL,"[Multilingual Staff, Restaurant, Bar/Lounge, F..."
...,...,...
35831,HOTEL,"[Shuttle Bus Service, Restaurant, Bar/Lounge, ..."
35832,HOTEL,[]
35833,HOTEL,"[Kids Activities, Free parking, Airport transp..."
35834,HOTEL,"[Kids Activities, Free parking, Airport transp..."


In [2]:
from pandas_profiling import ProfileReport

  from pandas_profiling import ProfileReport


In [69]:
import pandas_profiling


In [1]:
profile_trip = pandas_profiling.ProfileReport(df)
profile_trip.to_file("df.html")

NameError: name 'pandas_profiling' is not defined