# Predicting preferred destination  based on taste and preference

The goal is to build a machine learning model that can predict hotel ratings based on customer reviews, budget, specific locations, and the type of residence. The dataset is scraped from TripAdvisor and it contains information about various hotels, including their ratings, reviews, amenities, pricing, geographical coordinates, and residence types (e.g., hotel, bed and breakfast, specialty lodging). By analyzing the text reviews along with these additional factors, the objective is to develop a model that can accurately predict the ratings of new, unseen hotels based on customer reviews, budget constraints, location preferences, and residence type.

Approach:

Data Preprocessing: Clean and preprocess the text reviews by removing stopwords, punctuation, and performing tokenization. Convert the text data into a numerical representation suitable for modeling. Handle missing values, if any, in the budget, location, and residence type columns.

Feature Engineering: Extract additional features from the dataset, such as review sentiment scores, review length, and any other relevant information. Engineer new features related to budget, location, and residence type, such as price range categories, geographical distance from landmarks, and one-hot encoding of residence types.

Model Selection: Experiment with different supervised learning models, such as linear regression, decision trees, random forests, or neural networks, to find the best model for predicting hotel ratings considering customer reviews, budget, location, and residence type. Evaluate the models using appropriate evaluation metrics like mean squared error (MSE) or mean absolute error (MAE).

Model Training and Evaluation: Split the dataset into training and testing sets. Train the selected model on the training set and evaluate its performance on the testing set. Fine-tune the model parameters to improve its accuracy. Perform cross-validation to assess the model's generalization capabilities.

In [51]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

import json
import glob
import re

In [52]:
def read_json_files(json_files):
    dfs = []
    for file in json_files:
        with open(file) as f:
            json_data = json.load(f)
            df = pd.DataFrame(json_data)
            dfs.append(df)

    return pd.concat(dfs, ignore_index=True)



In [53]:
json_files = ['..\Data\drc.json','..\Data\egypt.json', '..\Data\ethiopia.json',
                '..\Data\kenya.json', '..\Data\Madagascar.json', '..\Data\morocco.json',
                r'..\Data\nigeria.json', r'..\Data\rwanda.json', '..\Data\seychelles.json',
                r'..\Data\tanzania.json', r'..\Data\uganda.json', r'..\Data\namibia.json',
                '..\Data\south_africa.json', '..\Data\malawi.json', '..\Data\Senegal.json',
                '..\Data\zambia.json', '..\Data\Ghana.json', '..\Data\Botswana.json', 
                '..\Data\capeverde.json' ]
df = read_json_files(json_files)



In [54]:
df

Unnamed: 0,id,type,category,subcategories,name,locationString,description,image,photoCount,awards,...,establishmentTypes,ownersTopReasons,localLangCode,guideFeaturedInCopy,rentalDescriptions,photos,bedroomInfo,bathroomInfo,bathCount,baseDailyRate
0,2704993,ATTRACTION,attraction,[Nature & Parks],Congoloisirs,Kinshasa,,https://media-cdn.tripadvisor.com/media/photo-...,9,[],...,,,,,,,,,,
1,1536776,ATTRACTION,attraction,[Nature & Parks],Okapi Wildlife Reserve,Orientale Province,,https://media-cdn.tripadvisor.com/media/photo-...,3,[],...,,,,,,,,,,
2,13203729,ATTRACTION,attraction,"[Shopping, Food & Drink]",Marche Nouveau DAIPN,Kinshasa,,https://media-cdn.tripadvisor.com/media/photo-...,12,[],...,,,,,,,,,,
3,8661504,HOTEL,hotel,[Specialty Lodging],Bukima Tented Camp,"Rumangabo, North Kivu Province",Just outside the Virunga National Park boundar...,https://media-cdn.tripadvisor.com/media/photo-...,79,[],...,,,,,,,,,,
4,10414108,HOTEL,hotel,[Specialty Lodging],"Tchegera Island Tented Camp, Virunga National ...","Goma, North Kivu Province",,https://media-cdn.tripadvisor.com/media/photo-...,109,[],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35831,12216827,HOTEL,hotel,[Specialty Lodging],Casa Santos Pinto,"Curral das Vacas, Santo Antao",,,0,[],...,,,,,,,,,,
35832,23200009,HOTEL,hotel,[Bed and Breakfast],Kelly GuestHouse - Lovely Bedroom - Plateau Ci...,"Praia, Santiago",,https://media-cdn.tripadvisor.com/media/partne...,0,[],...,,,,,,,,,,
35833,13423426,HOTEL,hotel,[Bed and Breakfast],Luz Esperanca,"Pedra Badejo, Santiago",,,0,[],...,,,,,,,,,,
35834,12957229,HOTEL,hotel,[Specialty Lodging],Pensao Entre Nos,"Tarrafal, Santiago",,,0,[],...,,,,,,,,,,


In [55]:
df.isnull().sum()

id                   0
type                 0
category             0
subcategories     1339
name                 0
                 ...  
photos           34497
bedroomInfo      34497
bathroomInfo     34497
bathCount        34497
baseDailyRate    34568
Length: 65, dtype: int64

In [56]:
df.columns

Index(['id', 'type', 'category', 'subcategories', 'name', 'locationString',
       'description', 'image', 'photoCount', 'awards', 'rankingPosition',
       'rating', 'rawRanking', 'phone', 'address', 'addressObj', 'localName',
       'localAddress', 'email', 'latitude', 'longitude', 'webUrl', 'website',
       'rankingString', 'rankingDenominator', 'neighborhoodLocations',
       'nearestMetroStations', 'ancestorLocations', 'ratingHistogram',
       'numberOfReviews', 'reviewTags', 'reviews', 'booking', 'offerGroup',
       'subtype', 'hotelClass', 'amenities', 'numberOfRooms', 'priceLevel',
       'priceRange', 'roomTips', 'checkInDate', 'checkOutDate', 'offers',
       'hotelClassAttribution', 'isClosed', 'isLongClosed', 'openNowText',
       'cuisines', 'mealTypes', 'dishes', 'features', 'dietaryRestrictions',
       'hours', 'menuWebUrl', 'establishmentTypes', 'ownersTopReasons',
       'localLangCode', 'guideFeaturedInCopy', 'rentalDescriptions', 'photos',
       'bedroomInfo', '

In [57]:
df.reviewTags.value_counts()

[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

In [58]:
# Check null values and filter columns with more than 10000 null values
null_counts = df.isnull().sum()
columns_above_threshold = null_counts[null_counts > 10000].index

# Print the columns with more than 10000 null values
list(columns_above_threshold)


['description',
 'phone',
 'localAddress',
 'email',
 'website',
 'booking',
 'offerGroup',
 'subtype',
 'hotelClass',
 'numberOfRooms',
 'priceLevel',
 'priceRange',
 'roomTips',
 'checkInDate',
 'checkOutDate',
 'offers',
 'hotelClassAttribution',
 'isClosed',
 'isLongClosed',
 'openNowText',
 'cuisines',
 'mealTypes',
 'dishes',
 'features',
 'dietaryRestrictions',
 'hours',
 'menuWebUrl',
 'establishmentTypes',
 'ownersTopReasons',
 'localLangCode',
 'guideFeaturedInCopy',
 'rentalDescriptions',
 'photos',
 'bedroomInfo',
 'bathroomInfo',
 'bathCount',
 'baseDailyRate']

#### Removing Irrelevant columns
There are several columns and attributes that are not useful for our analysis. These include **'image'**, **'photoCount'**, **'awards'**, **'phone'**, **'address'** , **'email'** **'webUrl'**, **'website'** **'neighborhoodLocations'**, **'nearestMetroStations'**, **'booking'**, **'offerGroup'**, **'subtype'**, **'hotelClass'**, **'roomTips'**, **'checkInDate'**, **'checkOutDate'**, **'offers'**, **'hotelClassAttribution'**, **'localLangCode'**, **'isClosed'**, **'isLongClosed'**, **'openNowText'**, **'dietaryRestrictions'**, **'hours'**, **'menuWebUrl'**, **'establishmentTypes'**, **'ownersTopReasons'**, **'guideFeaturedInCopy'**, **'rentalDescriptions'** and **'photos'**.

In [59]:

columns_to_drop = ['image', 'photoCount', 'awards', 'phone', 'address', 'email', 
                   'webUrl', 'website', 'neighborhoodLocations', 'nearestMetroStations', 
                   'booking', 'offerGroup', 'subtype', 'hotelClass', 'roomTips', 'checkInDate', 
                   'checkOutDate', 'offers', 'hotelClassAttribution', 'localLangCode', 'isClosed', 
                   'isLongClosed', 'openNowText', 'dietaryRestrictions', 'hours', 'menuWebUrl', 
                   'establishmentTypes', 'ownersTopReasons', 'guideFeaturedInCopy', 'rentalDescriptions','photos']
df.drop(columns=columns_to_drop, inplace=True)
df.head()

Unnamed: 0,id,type,category,subcategories,name,locationString,description,rankingPosition,rating,rawRanking,...,priceLevel,priceRange,cuisines,mealTypes,dishes,features,bedroomInfo,bathroomInfo,bathCount,baseDailyRate
0,2704993,ATTRACTION,attraction,[Nature & Parks],Congoloisirs,Kinshasa,,17.0,4.0,2.778074,...,,,,,,,,,,
1,1536776,ATTRACTION,attraction,[Nature & Parks],Okapi Wildlife Reserve,Orientale Province,,1.0,5.0,2.751658,...,,,,,,,,,,
2,13203729,ATTRACTION,attraction,"[Shopping, Food & Drink]",Marche Nouveau DAIPN,Kinshasa,,21.0,5.0,2.773659,...,,,,,,,,,,
3,8661504,HOTEL,hotel,[Specialty Lodging],Bukima Tented Camp,"Rumangabo, North Kivu Province",Just outside the Virunga National Park boundar...,2.0,4.5,3.351389,...,,,,,,,,,,
4,10414108,HOTEL,hotel,[Specialty Lodging],"Tchegera Island Tented Camp, Virunga National ...","Goma, North Kivu Province",,1.0,5.0,3.464931,...,,,,,,,,,,


In [60]:
# we will drop the following columns because they do not have any contribution to our objectives.
# some also contain too many null values to fill. 
# cols_to_drop = columns_above_threshold

# df.drop(columns=cols_to_drop, inplace=True)

In [61]:
list(df.columns)

['id',
 'type',
 'category',
 'subcategories',
 'name',
 'locationString',
 'description',
 'rankingPosition',
 'rating',
 'rawRanking',
 'addressObj',
 'localName',
 'localAddress',
 'latitude',
 'longitude',
 'rankingString',
 'rankingDenominator',
 'ancestorLocations',
 'ratingHistogram',
 'numberOfReviews',
 'reviewTags',
 'reviews',
 'amenities',
 'numberOfRooms',
 'priceLevel',
 'priceRange',
 'cuisines',
 'mealTypes',
 'dishes',
 'features',
 'bedroomInfo',
 'bathroomInfo',
 'bathCount',
 'baseDailyRate']

In [50]:
df[['locationString','rankingPosition','rawRanking','rankingString','rankingDenominator']]

Unnamed: 0,locationString,rankingPosition,rawRanking,rankingString,rankingDenominator
0,Kinshasa,17.0,2.778074,17.0,105
1,Orientale Province,1.0,2.751658,1.0,4
2,Kinshasa,21.0,2.773659,21.0,105
3,"Rumangabo, North Kivu Province",2.0,3.351389,2.0,3
4,"Goma, North Kivu Province",1.0,3.464931,1.0,17
...,...,...,...,...,...
35831,"Curral das Vacas, Santo Antao",,,52.0,
35832,"Praia, Santiago",,,52.0,
35833,"Pedra Badejo, Santiago",,,52.0,
35834,"Tarrafal, Santiago",,,52.0,


In [48]:
list(df['rankingString'])

['#17 of 105 things to do in Kinshasa',
 '#1 of 4 things to do in Orientale Province',
 '#21 of 105 things to do in Kinshasa',
 '#2 of 3 Specialty lodging in Rumangabo',
 '#1 of 17 Specialty lodging in Goma',
 '#14 of 43 hotels in Kinshasa',
 '#14 of 105 things to do in Kinshasa',
 '#9 of 105 things to do in Kinshasa',
 '#1 of 12 things to do in Lubumbashi',
 '#1 of 5 Specialty lodging in Matadi',
 '#10 of 105 things to do in Kinshasa',
 '#7 of 105 things to do in Kinshasa',
 '#23 of 105 things to do in Kinshasa',
 '#8 of 105 things to do in Kinshasa',
 '#11 of 105 things to do in Kinshasa',
 '#12 of 105 things to do in Kinshasa',
 '#15 of 105 things to do in Kinshasa',
 '#21 of 43 hotels in Kinshasa',
 '#9 of 67 B&Bs / Inns in Kinshasa',
 '#3 of 105 things to do in Kinshasa',
 '#1 of 2 things to do in Kisantu',
 '#4 of 105 things to do in Kinshasa',
 '#6 of 105 things to do in Kinshasa',
 '#2 of 105 things to do in Kinshasa',
 '#3 of 3 Specialty lodging in Rumangabo',
 '#1 of 1 things

In [13]:
df[['name','rankingString', 'type']]

Unnamed: 0,name,rankingString,type
0,Congoloisirs,#17 of 105 things to do in Kinshasa,ATTRACTION
1,Okapi Wildlife Reserve,#1 of 4 things to do in Orientale Province,ATTRACTION
2,Marche Nouveau DAIPN,#21 of 105 things to do in Kinshasa,ATTRACTION
3,Bukima Tented Camp,#2 of 3 Specialty lodging in Rumangabo,HOTEL
4,"Tchegera Island Tented Camp, Virunga National ...",#1 of 17 Specialty lodging in Goma,HOTEL
...,...,...,...
35831,Casa Santos Pinto,,HOTEL
35832,Kelly GuestHouse - Lovely Bedroom - Plateau Ci...,,HOTEL
35833,Luz Esperanca,,HOTEL
35834,Pensao Entre Nos,,HOTEL


In [14]:

# Create new columns
df['RankingType'] = ""
df['Location'] = ""
df['Numerator'] = ""
df['Denominator'] = ""

# Iterate through the rows and extract the information
for index, row in df.iterrows():
    # Check if the value is NaN
    if pd.isnull(row['rankingString']):
        continue

    if match := re.match(
        r'#(\d+)\s+of\s+(\d+)\s+(.*?)\s+in\s+(.*?)$', row['rankingString']
    ):
        numerator = match.group(1)
        denominator = match.group(2)
        ranking_type = match.group(3)
        location = match.group(4)

        # Update the new columns
        df.at[index, 'RankingType'] = ranking_type
        df.at[index, 'Location'] = location
        df.at[index, 'Numerator'] = numerator
        df.at[index, 'Denominator'] = denominator



In [15]:
df.RankingType.value_counts()

                             9930
Specialty lodging            7287
B&Bs / Inns                  6045
hotels                       4718
things to do                 3263
Outdoor Activities           1298
Tours                         693
Boat Tours & Water Sports     558
Transportation                532
places to eat                 326
hotel                         243
B&B / Inn                     239
Shopping                      162
Food & Drink                  161
Nightlife                     126
Spas & Wellness               115
Fun & Games                    73
Classes & Workshops            37
Nature & Parks                 12
Museums                         8
Concerts & Shows                7
Traveler Resources              1
Water & Amusement Parks         1
Sights & Landmarks              1
Name: RankingType, dtype: int64

In [16]:
df.columns

Index(['id', 'type', 'category', 'subcategories', 'name', 'locationString',
       'description', 'rankingPosition', 'rating', 'rawRanking', 'addressObj',
       'localName', 'localAddress', 'latitude', 'longitude', 'rankingString',
       'rankingDenominator', 'ancestorLocations', 'ratingHistogram',
       'numberOfReviews', 'reviewTags', 'reviews', 'amenities',
       'numberOfRooms', 'priceLevel', 'priceRange', 'cuisines', 'mealTypes',
       'dishes', 'features', 'bedroomInfo', 'bathroomInfo', 'bathCount',
       'baseDailyRate', 'RankingType', 'Location', 'Numerator', 'Denominator'],
      dtype='object')

After splitting the rankingString column to their respective elements. we observe below that the new column ranking type has some elements that are similar but grouped differently. 

We will then combine similar elements to have reduced distribution within the column

In [17]:
# Define the mappings to combine similar values
mappings = {
    'hotel': 'hotels',
    'B&B / Inn': 'B&Bs / Inns',
    'Sights & Landmarks': 'Nature & Parks',
    'Fun & Games': 'Outdoor Activities',
    'Boat Tours & Water Sports': 'Water & Amusement Parks',
    'Traveler Resources': 'Shopping',
    'Concerts & Shows': 'Nightlife',
    'Food & Drink': 'places to eat',
    'Nature & Parks': 'things to do',
    'Museums': 'things to do',
    'Tours' : 'things to do',
    'Outdoor Activities': 'things to do',
    'B&Bs / Inns': 'Specialty lodging'
}

# Replace the values in the 'Ranking Type' column
df['RankingType'] = df['RankingType'].replace(mappings)

In [18]:
df.RankingType.value_counts()

Specialty lodging          13332
                            9930
things to do                5274
hotels                      4961
Water & Amusement Parks      559
Transportation               532
places to eat                487
B&Bs / Inns                  239
Shopping                     163
Nightlife                    133
Spas & Wellness              115
Outdoor Activities            73
Classes & Workshops           37
Nature & Parks                 1
Name: RankingType, dtype: int64

In [63]:
# Fill missing values based on location and ranking type
df['Numerator'] = df.groupby(['Location', 'RankingType'])['Numerator'].apply(lambda x: x.ffill().bfill())

# Fill missing values in other columns with empty strings
df[['Denominator', 'Location']] = df[['Denominator', 'Location']].fillna('')

# Iterate over each row to update the denominator column
for index, row in df.iterrows():
    location = row['Location']
    ranking_type = row['RankingType']
    denominator_total = df[(df['Location'] == location) & (df['RankingType'] == ranking_type)]['Denominator'].sum()
    df.loc[index, 'Denominator'] = str(denominator_total)

KeyError: 'Location'

In [46]:
df.Numerator

0         17
1          1
2         21
3          2
4          1
        ... 
35831    NaN
35832    NaN
35833    NaN
35834    NaN
35835    NaN
Name: Numerator, Length: 35836, dtype: object

In [39]:
df[['RankingType', 'name', 'type']]

Unnamed: 0,RankingType,name,type
0,things to do,Congoloisirs,ATTRACTION
1,things to do,Okapi Wildlife Reserve,ATTRACTION
2,things to do,Marche Nouveau DAIPN,ATTRACTION
3,Specialty lodging,Bukima Tented Camp,HOTEL
4,Specialty lodging,"Tchegera Island Tented Camp, Virunga National ...",HOTEL
...,...,...,...
35831,hotel,Casa Santos Pinto,HOTEL
35832,hotel,Kelly GuestHouse - Lovely Bedroom - Plateau Ci...,HOTEL
35833,hotel,Luz Esperanca,HOTEL
35834,hotel,Pensao Entre Nos,HOTEL


In [21]:
#empty_rows = df[df['RankingType'].isnull() | df['RankingType'].eq('')]
#empty_rows[['RankingType', 'name', 'type']]


In [22]:
#speciality_lodging_rows = empty_rows[empty_rows['RankingType'] == 'things to do'][['RankingType', 'name', 'type']]
#speciality_lodging_rows

In [38]:
# Define the mapping of types to ranking types
type_mapping = {
    'ATTRACTION': 'things to do',
    'HOTEL': np.random.choice(['hotel', 'Specialty lodging'], size=1)[0],
    #'OTHER_TYPE_1': 'ranking type 1',
    #'OTHER_TYPE_2': 'ranking type 2',
    # Add more types and their corresponding ranking types as needed
}

# Fill empty rows in RankingType based on type
df['RankingType'] = np.where((df['RankingType'] == '') & (df['type'].map(type_mapping) != ''), df['type'].map(type_mapping), df['RankingType'])

In [24]:
#null_values = df[df['RankingType'].isna()]
#null_values

In [25]:
# Replace NaN values with "bathroom only" where type is "attraction"
df.loc[(df['type'] == 'RESTAURANT') & (df['amenities'].isna()), 'amenities'] = 'restaurant'


In [26]:
df.loc[(df['type'] == 'ATTRACTION') & (df['amenities'].isna()), 'amenities'] = 'bathroom only'

In [27]:
#
# df['amenities'] = df['amenities'].apply(lambda x: ', '.join(x) if isinstance(x, list) else '')


In [28]:
df['amenities'].isnull().value_counts()

False    35836
Name: amenities, dtype: int64

In [29]:

#hotel_rows = df[df['type'] == 'HOTEL']
#hotel_amenities = hotel_rows['amenities']
#hotel_amenities

In [30]:
df[['type', 'amenities']]

Unnamed: 0,type,amenities
0,ATTRACTION,bathroom only
1,ATTRACTION,bathroom only
2,ATTRACTION,bathroom only
3,HOTEL,"[Restaurant, Mountain View]"
4,HOTEL,"[Multilingual Staff, Restaurant, Bar/Lounge, F..."
...,...,...
35831,HOTEL,"[Shuttle Bus Service, Restaurant, Bar/Lounge, ..."
35832,HOTEL,[]
35833,HOTEL,"[Kids Activities, Free parking, Airport transp..."
35834,HOTEL,"[Kids Activities, Free parking, Airport transp..."


In [31]:
#from pandas_profiling import ProfileReport

In [32]:
#import pandas_profiling


In [33]:
#profile_trip = pandas_profiling.ProfileReport(df)
#profile_trip.to_file("df.html")