# Predicting preferred destination  based on taste and preference

The goal is to build a machine learning model that can predict hotel ratings based on customer reviews, budget, specific locations, and the type of residence. The dataset is scraped from TripAdvisor and it contains information about various hotels, including their ratings, reviews, amenities, pricing, geographical coordinates, and residence types (e.g., hotel, bed and breakfast, specialty lodging). By analyzing the text reviews along with these additional factors, the objective is to develop a model that can accurately predict the ratings of new, unseen hotels based on customer reviews, budget constraints, location preferences, and residence type.

Approach:

Data Preprocessing: Clean and preprocess the text reviews by removing stopwords, punctuation, and performing tokenization. Convert the text data into a numerical representation suitable for modeling. Handle missing values, if any, in the budget, location, and residence type columns.

Feature Engineering: Extract additional features from the dataset, such as review sentiment scores, review length, and any other relevant information. Engineer new features related to budget, location, and residence type, such as price range categories, geographical distance from landmarks, and one-hot encoding of residence types.

Model Selection: Experiment with different supervised learning models, such as linear regression, decision trees, random forests, or neural networks, to find the best model for predicting hotel ratings considering customer reviews, budget, location, and residence type. Evaluate the models using appropriate evaluation metrics like mean squared error (MSE) or mean absolute error (MAE).

Model Training and Evaluation: Split the dataset into training and testing sets. Train the selected model on the training set and evaluate its performance on the testing set. Fine-tune the model parameters to improve its accuracy. Perform cross-validation to assess the model's generalization capabilities.

In [1]:
import pandas as pd
import json
import glob
import re

In [2]:

def read_json_files(json_files):
    dfs = []
    for file in json_files:
        with open(file, encoding='utf-8') as f:
            json_data = json.load(f)
            df = pd.DataFrame(json_data)
            dfs.append(df)
    return pd.concat(dfs)

json_files = ['..\Data\drc.json', '..\Data\egypt.json', '..\Data\ethiopia.json',
                '..\Data\kenya.json', '..\Data\Madagascar.json', '..\Data\morocco.json',
                r'..\Data\nigeria.json', r'..\Data\rwanda.json', '..\Data\seychelles.json',
                r'..\Data\tanzania.json', r'..\Data\uganda.json', r'..\Data\namibia.json',
                '..\Data\south_africa.json', '..\Data\malawi.json', '..\Data\Senegal.json',
                '..\Data\zambia.json', '..\Data\Ghana.json', '..\Data\Botswana.json',
                '..\Data\capeverde.json']

df = read_json_files(json_files)


In [3]:
df

Unnamed: 0,id,type,category,subcategories,name,locationString,description,image,photoCount,awards,...,establishmentTypes,ownersTopReasons,localLangCode,guideFeaturedInCopy,rentalDescriptions,photos,bedroomInfo,bathroomInfo,bathCount,baseDailyRate
0,2704993,ATTRACTION,attraction,[Nature & Parks],Congoloisirs,Kinshasa,,https://media-cdn.tripadvisor.com/media/photo-...,9,[],...,,,,,,,,,,
1,1536776,ATTRACTION,attraction,[Nature & Parks],Okapi Wildlife Reserve,Orientale Province,,https://media-cdn.tripadvisor.com/media/photo-...,3,[],...,,,,,,,,,,
2,13203729,ATTRACTION,attraction,"[Shopping, Food & Drink]",Marche Nouveau DAIPN,Kinshasa,,https://media-cdn.tripadvisor.com/media/photo-...,12,[],...,,,,,,,,,,
3,8661504,HOTEL,hotel,[Specialty Lodging],Bukima Tented Camp,"Rumangabo, North Kivu Province",Just outside the Virunga National Park boundar...,https://media-cdn.tripadvisor.com/media/photo-...,79,[],...,,,,,,,,,,
4,10414108,HOTEL,hotel,[Specialty Lodging],"Tchegera Island Tented Camp, Virunga National ...","Goma, North Kivu Province",,https://media-cdn.tripadvisor.com/media/photo-...,109,[],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1273,12216827,HOTEL,hotel,[Specialty Lodging],Casa Santos Pinto,"Curral das Vacas, Santo Antao",,,0,[],...,,,,,,,,,,
1274,23200009,HOTEL,hotel,[Bed and Breakfast],Kelly GuestHouse - Lovely Bedroom - Plateau Ci...,"Praia, Santiago",,https://media-cdn.tripadvisor.com/media/partne...,0,[],...,,,,,,,,,,
1275,13423426,HOTEL,hotel,[Bed and Breakfast],Luz Esperanca,"Pedra Badejo, Santiago",,,0,[],...,,,,,,,,,,
1276,12957229,HOTEL,hotel,[Specialty Lodging],Pensao Entre Nos,"Tarrafal, Santiago",,,0,[],...,,,,,,,,,,


In [4]:
df.isnull().sum()

id                   0
type                 0
category             0
subcategories     1339
name                 0
                 ...  
photos           34497
bedroomInfo      34497
bathroomInfo     34497
bathCount        34497
baseDailyRate    34568
Length: 65, dtype: int64

In [5]:
df.columns

Index(['id', 'type', 'category', 'subcategories', 'name', 'locationString',
       'description', 'image', 'photoCount', 'awards', 'rankingPosition',
       'rating', 'rawRanking', 'phone', 'address', 'addressObj', 'localName',
       'localAddress', 'email', 'latitude', 'longitude', 'webUrl', 'website',
       'rankingString', 'rankingDenominator', 'neighborhoodLocations',
       'nearestMetroStations', 'ancestorLocations', 'ratingHistogram',
       'numberOfReviews', 'reviewTags', 'reviews', 'booking', 'offerGroup',
       'subtype', 'hotelClass', 'amenities', 'numberOfRooms', 'priceLevel',
       'priceRange', 'roomTips', 'checkInDate', 'checkOutDate', 'offers',
       'hotelClassAttribution', 'isClosed', 'isLongClosed', 'openNowText',
       'cuisines', 'mealTypes', 'dishes', 'features', 'dietaryRestrictions',
       'hours', 'menuWebUrl', 'establishmentTypes', 'ownersTopReasons',
       'localLangCode', 'guideFeaturedInCopy', 'rentalDescriptions', 'photos',
       'bedroomInfo', '

In [6]:
df.reviewTags.value_counts()

[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

In [7]:
# Check null values and filter columns with more than 4000 null values
null_counts = df.isnull().sum()
columns_above_threshold = null_counts[null_counts > 10000].index

# Print the columns with more than 4000 null values
list(columns_above_threshold)


['description',
 'phone',
 'localAddress',
 'email',
 'website',
 'booking',
 'offerGroup',
 'subtype',
 'hotelClass',
 'numberOfRooms',
 'priceLevel',
 'priceRange',
 'roomTips',
 'checkInDate',
 'checkOutDate',
 'offers',
 'hotelClassAttribution',
 'isClosed',
 'isLongClosed',
 'openNowText',
 'cuisines',
 'mealTypes',
 'dishes',
 'features',
 'dietaryRestrictions',
 'hours',
 'menuWebUrl',
 'establishmentTypes',
 'ownersTopReasons',
 'localLangCode',
 'guideFeaturedInCopy',
 'rentalDescriptions',
 'photos',
 'bedroomInfo',
 'bathroomInfo',
 'bathCount',
 'baseDailyRate']

In [8]:
# we will drop the following columns because they do not have any contribution to our objectives.
# some also contain too many null values to fill. 
cols_to_drop = columns_above_threshold

df.drop(columns=cols_to_drop, inplace=True)

In [9]:
list(df.columns)

['id',
 'type',
 'category',
 'subcategories',
 'name',
 'locationString',
 'image',
 'photoCount',
 'awards',
 'rankingPosition',
 'rating',
 'rawRanking',
 'address',
 'addressObj',
 'localName',
 'latitude',
 'longitude',
 'webUrl',
 'rankingString',
 'rankingDenominator',
 'neighborhoodLocations',
 'nearestMetroStations',
 'ancestorLocations',
 'ratingHistogram',
 'numberOfReviews',
 'reviewTags',
 'reviews',
 'amenities']

In [10]:
df[['locationString','rankingPosition','rawRanking','rankingString','rankingDenominator']]

Unnamed: 0,locationString,rankingPosition,rawRanking,rankingString,rankingDenominator
0,Kinshasa,17.0,2.778074,#17 of 105 things to do in Kinshasa,105
1,Orientale Province,1.0,2.751658,#1 of 4 things to do in Orientale Province,4
2,Kinshasa,21.0,2.773659,#21 of 105 things to do in Kinshasa,105
3,"Rumangabo, North Kivu Province",2.0,3.351389,#2 of 3 Specialty lodging in Rumangabo,3
4,"Goma, North Kivu Province",1.0,3.464931,#1 of 17 Specialty lodging in Goma,17
...,...,...,...,...,...
1273,"Curral das Vacas, Santo Antao",,,,
1274,"Praia, Santiago",,,,
1275,"Pedra Badejo, Santiago",,,,
1276,"Tarrafal, Santiago",,,,


In [11]:
df[['name','rankingString']]

Unnamed: 0,name,rankingString
0,Congoloisirs,#17 of 105 things to do in Kinshasa
1,Okapi Wildlife Reserve,#1 of 4 things to do in Orientale Province
2,Marche Nouveau DAIPN,#21 of 105 things to do in Kinshasa
3,Bukima Tented Camp,#2 of 3 Specialty lodging in Rumangabo
4,"Tchegera Island Tented Camp, Virunga National ...",#1 of 17 Specialty lodging in Goma
...,...,...
1273,Casa Santos Pinto,
1274,Kelly GuestHouse - Lovely Bedroom - Plateau Ci...,
1275,Luz Esperanca,
1276,Pensao Entre Nos,


In [12]:


# Assuming your data is in a DataFrame called 'df' and the column is named 'rankingString'
# Create new columns
df['RankingType'] = ""
df['Location'] = ""
df['Numerator'] = ""
df['Denominator'] = ""

# Iterate through the rows and extract the information
for index, row in df.iterrows():
    # Check if the value is NaN
    if pd.isnull(row['rankingString']):
        continue
    
    # Use regular expressions to extract the information
    match = re.match(r'#(\d+)\s+of\s+(\d+)\s+(.*?)\s+in\s+(.*?)$', row['rankingString'])
    
    # Check if the match is successful
    if match:
        numerator = match.group(1)
        denominator = match.group(2)
        ranking_type = match.group(3)
        location = match.group(4)
        
        # Update the new columns
        df.at[index, 'RankingType'] = ranking_type
        df.at[index, 'Location'] = location
        df.at[index, 'Numerator'] = numerator
        df.at[index, 'Denominator'] = denominator



In [13]:
df.columns

Index(['id', 'type', 'category', 'subcategories', 'name', 'locationString',
       'image', 'photoCount', 'awards', 'rankingPosition', 'rating',
       'rawRanking', 'address', 'addressObj', 'localName', 'latitude',
       'longitude', 'webUrl', 'rankingString', 'rankingDenominator',
       'neighborhoodLocations', 'nearestMetroStations', 'ancestorLocations',
       'ratingHistogram', 'numberOfReviews', 'reviewTags', 'reviews',
       'amenities', 'RankingType', 'Location', 'Numerator', 'Denominator'],
      dtype='object')

In [14]:
df.RankingType.value_counts()

B&Bs / Inns                  11523
Specialty lodging             9814
hotels                        4896
things to do                  3306
Boat Tours & Water Sports     1501
Outdoor Activities             950
Tours                          817
B&B / Inn                      633
Transportation                 608
Shopping                       361
Nightlife                      323
Food & Drink                   304
hotel                          299
Spas & Wellness                247
Classes & Workshops            133
Fun & Games                     76
Concerts & Shows                19
Museums                         19
                                 7
Name: RankingType, dtype: int64

In [16]:
# Define the mappings to combine similar values
mappings = {
    'hotel': 'hotels',
    'B&B / Inn': 'B&Bs / Inns',
    
}

# Replace the values in the 'Ranking Type' column
df['RankingType'] = df['RankingType'].replace(mappings)

In [17]:
df.amenities

0                                                     NaN
1                                                     NaN
2                                                     NaN
3                             [Restaurant, Mountain View]
4       [Multilingual Staff, Restaurant, Bar/Lounge, F...
                              ...                        
1273    [Shuttle Bus Service, Restaurant, Bar/Lounge, ...
1274                                                   []
1275    [Kids Activities, Free parking, Airport transp...
1276    [Kids Activities, Free parking, Airport transp...
1277                                                   []
Name: amenities, Length: 35836, dtype: object

In [18]:
null_values = df[df['amenities'].isnull()]
null_values

Unnamed: 0,id,type,category,subcategories,name,locationString,image,photoCount,awards,rankingPosition,...,ancestorLocations,ratingHistogram,numberOfReviews,reviewTags,reviews,amenities,RankingType,Location,Numerator,Denominator
0,2704993,ATTRACTION,attraction,[Nature & Parks],Congoloisirs,Kinshasa,https://media-cdn.tripadvisor.com/media/photo-...,9,[],17.0,...,"[{'id': '294187', 'name': 'Kinshasa', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 4, 'count...",9,[],[],,things to do,Cidade Velha,1,9
1,1536776,ATTRACTION,attraction,[Nature & Parks],Okapi Wildlife Reserve,Orientale Province,https://media-cdn.tripadvisor.com/media/photo-...,3,[],1.0,...,"[{'id': '1536771', 'name': 'Orientale Province...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",2,[],[],,things to do,Praia,1,20
2,13203729,ATTRACTION,attraction,"[Shopping, Food & Drink]",Marche Nouveau DAIPN,Kinshasa,https://media-cdn.tripadvisor.com/media/photo-...,12,[],21.0,...,"[{'id': '294187', 'name': 'Kinshasa', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",3,[],[],,things to do,Boa Vista,4,30
6,19492774,ATTRACTION,attraction,[Outdoor Activities],Cercle Elais,Kinshasa,https://media-cdn.tripadvisor.com/media/photo-...,8,[],14.0,...,"[{'id': '294187', 'name': 'Kinshasa', 'abbrevi...","{'count1': 1, 'count2': 0, 'count3': 0, 'count...",6,"[{'text': 'pool', 'reviews': 2}]",[],,things to do,Santa Maria,4,22
7,4889528,ATTRACTION,attraction,[Sights & Landmarks],Eglise CBFC-Gombe,Kinshasa,https://media-cdn.tripadvisor.com/media/photo-...,13,[],9.0,...,"[{'id': '294187', 'name': 'Kinshasa', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 1, 'count...",11,[],[],,hotels,Ribeira Grande,1,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
741,25810865,ATTRACTION,attraction,[Spas & Wellness],"Renova Spa, RIU Palace Boa Vista","Rabil, Boa Vista",https://media-cdn.tripadvisor.com/media/photo-...,7,[],,...,"[{'id': '13867343', 'name': 'Rabil', 'abbrevia...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",0,[],[],,Specialty lodging,Gaborone,18,108
743,17504922,ATTRACTION,attraction,"[Tours, Other, Boat Tours & Water Sports, Tran...",Sea Turtle Sal,"Santa Maria, Ilha do Sal",https://media-cdn.tripadvisor.com/media/photo-...,6,[],13.0,...,"[{'id': '482848', 'name': 'Santa Maria', 'abbr...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",2,[],[],,Transportation,Santa Maria,13,27
747,25416857,ATTRACTION,attraction,[Tours],Sidy Tours,"Sal Rei, Boa Vista",,0,[],,...,"[{'id': '1185333', 'name': 'Sal Rei', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",0,[],[],,B&Bs / Inns,Gaborone,21,125
749,20282100,ATTRACTION,attraction,"[Tours, Other, Transportation, Outdoor Activit...","Over Clauds Tours- Servico de Guia, Lda.","Mindelo, Sao Vicente",https://media-cdn.tripadvisor.com/media/photo-...,1,[],,...,"[{'id': '482855', 'name': 'Mindelo', 'abbrevia...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",0,[],[],,Specialty lodging,Maun,48,81


In [21]:
df['amenities'] = df['amenities'].apply(lambda x: tuple(x) if isinstance(x, list) else x)
grouped_counts = df[['name', 'amenities']].groupby('amenities').value_counts()

In [22]:
df

Unnamed: 0,id,type,category,subcategories,name,locationString,image,photoCount,awards,rankingPosition,...,ancestorLocations,ratingHistogram,numberOfReviews,reviewTags,reviews,amenities,RankingType,Location,Numerator,Denominator
0,2704993,ATTRACTION,attraction,[Nature & Parks],Congoloisirs,Kinshasa,https://media-cdn.tripadvisor.com/media/photo-...,9,[],17.0,...,"[{'id': '294187', 'name': 'Kinshasa', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 4, 'count...",9,[],[],,things to do,Cidade Velha,1,9
1,1536776,ATTRACTION,attraction,[Nature & Parks],Okapi Wildlife Reserve,Orientale Province,https://media-cdn.tripadvisor.com/media/photo-...,3,[],1.0,...,"[{'id': '1536771', 'name': 'Orientale Province...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",2,[],[],,things to do,Praia,1,20
2,13203729,ATTRACTION,attraction,"[Shopping, Food & Drink]",Marche Nouveau DAIPN,Kinshasa,https://media-cdn.tripadvisor.com/media/photo-...,12,[],21.0,...,"[{'id': '294187', 'name': 'Kinshasa', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",3,[],[],,things to do,Boa Vista,4,30
3,8661504,HOTEL,hotel,[Specialty Lodging],Bukima Tented Camp,"Rumangabo, North Kivu Province",https://media-cdn.tripadvisor.com/media/photo-...,79,[],2.0,...,"[{'id': '3656749', 'name': 'Rumangabo', 'abbre...","{'count1': 1, 'count2': 0, 'count3': 0, 'count...",34,[],[],"(Restaurant, Mountain View)",things to do,Boa Vista,3,30
4,10414108,HOTEL,hotel,[Specialty Lodging],"Tchegera Island Tented Camp, Virunga National ...","Goma, North Kivu Province",https://media-cdn.tripadvisor.com/media/photo-...,109,[],1.0,...,"[{'id': '303843', 'name': 'Goma', 'abbreviatio...","{'count1': 0, 'count2': 0, 'count3': 1, 'count...",29,"[{'text': 'gorilla trekking', 'reviews': 3}, {...",[],"(Multilingual Staff, Restaurant, Bar/Lounge, F...",B&Bs / Inns,Cha das Caldeiras,1,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1273,12216827,HOTEL,hotel,[Specialty Lodging],Casa Santos Pinto,"Curral das Vacas, Santo Antao",,0,[],,...,"[{'id': '12880045', 'name': 'Curral das Vacas'...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",0,[],[],"(Shuttle Bus Service, Restaurant, Bar/Lounge, ...",Specialty lodging,Sekondi-Takoradi,4,33
1274,23200009,HOTEL,hotel,[Bed and Breakfast],Kelly GuestHouse - Lovely Bedroom - Plateau Ci...,"Praia, Santiago",https://media-cdn.tripadvisor.com/media/partne...,0,[],,...,"[{'id': '293775', 'name': 'Praia', 'abbreviati...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",0,[],[],(),B&Bs / Inns,Accra,74,363
1275,13423426,HOTEL,hotel,[Bed and Breakfast],Luz Esperanca,"Pedra Badejo, Santiago",,0,[],,...,"[{'id': '1601793', 'name': 'Pedra Badejo', 'ab...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",0,[],[],"(Kids Activities, Free parking, Airport transp...",Specialty lodging,Bolgatanga,3,19
1276,12957229,HOTEL,hotel,[Specialty Lodging],Pensao Entre Nos,"Tarrafal, Santiago",,0,[],,...,"[{'id': '482851', 'name': 'Tarrafal', 'abbrevi...","{'count1': 0, 'count2': 0, 'count3': 0, 'count...",0,[],[],"(Kids Activities, Free parking, Airport transp...",B&Bs / Inns,Techiman,1,18


In [None]:
# data.to_csv(r"E:\Documents\data science\Capstone\data1")

In [None]:
from pandas_profiling import ProfileReport

In [None]:
import pandas_profiling


In [None]:
profile_trip = pandas_profiling.ProfileReport(df)
profile_trip.to_file("df.html")