# Travel Destination Recommendation System Notebook

#### Authors
* 1
* 2 
* 3
* 4
* 5
* 6


## Problem Statement

The goal is to build a machine learning model that can predict hotel ratings based on customer reviews, budget, specific locations, and the type of residence. The dataset is scraped from TripAdvisor and it contains information about various hotels, including their ratings, reviews, amenities, pricing, geographical coordinates, and residence types (e.g., hotel, bed and breakfast, specialty lodging). By analyzing the text reviews along with these additional factors, the objective is to develop a model that can accurately predict the ratings of new, unseen hotels based on customer reviews, budget constraints, location preferences, and residence type.

Approach:

Data Preprocessing: Clean and preprocess the text reviews by removing stopwords, punctuation, and performing tokenization. Convert the text data into a numerical representation suitable for modeling. Handle missing values, if any, in the budget, location, and residence type columns.

Feature Engineering: Extract additional features from the dataset, such as review sentiment scores, review length, and any other relevant information. Engineer new features related to budget, location, and residence type, such as price range categories, geographical distance from landmarks, and one-hot encoding of residence types.

Model Selection: Experiment with different supervised learning models, such as linear regression, decision trees, random forests, or neural networks, to find the best model for predicting hotel ratings considering customer reviews, budget, location, and residence type. Evaluate the models using appropriate evaluation metrics like mean squared error (MSE) or mean absolute error (MAE).

Model Training and Evaluation: Split the dataset into training and testing sets. Train the selected model on the training set and evaluate its performance on the testing set. Fine-tune the model parameters to improve its accuracy. Perform cross-validation to assess the model's generalization capabilities.

## Objectives

## Data Understanding

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np

import json
import glob
import re

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go


import scipy.stats as stats



from  my_functions import DataCleaning

In [2]:
cleaner = DataCleaning()

In [3]:
# read the data 
json_files = [
    r'..\Data\drc.json', r'..\Data\egypt.json', r'..\Data\ethiopia.json',
    r'..\Data\kenya.json', r'..\Data\Madagascar.json', r'..\Data\morocco.json',
    r'..\Data\nigeria.json', r'..\Data\rwanda.json', r'..\Data\seychelles.json',
    r'..\Data\tanzania.json', r'..\Data\uganda.json', r'..\Data\namibia.json',
    '..\Data\south_africa.json', '..\Data\malawi.json', r'..\Data\Senegal.json',
    r'..\Data\zambia.json', r'..\Data\Ghana.json', r'..\Data\Botswana.json', 
    r'..\Data\capeverde.json'
]
cleaner.read_json_files(json_files)
df = cleaner.df

In [4]:
# preview the data
df.head()

Unnamed: 0,id,type,category,subcategories,name,locationString,description,image,photoCount,awards,...,establishmentTypes,ownersTopReasons,localLangCode,guideFeaturedInCopy,rentalDescriptions,photos,bedroomInfo,bathroomInfo,bathCount,baseDailyRate
0,2704993,ATTRACTION,attraction,[Nature & Parks],Congoloisirs,Kinshasa,,https://media-cdn.tripadvisor.com/media/photo-...,9,[],...,,,,,,,,,,
1,1536776,ATTRACTION,attraction,[Nature & Parks],Okapi Wildlife Reserve,Orientale Province,,https://media-cdn.tripadvisor.com/media/photo-...,3,[],...,,,,,,,,,,
2,13203729,ATTRACTION,attraction,"[Shopping, Food & Drink]",Marche Nouveau DAIPN,Kinshasa,,https://media-cdn.tripadvisor.com/media/photo-...,12,[],...,,,,,,,,,,
3,8661504,HOTEL,hotel,[Specialty Lodging],Bukima Tented Camp,"Rumangabo, North Kivu Province",Just outside the Virunga National Park boundar...,https://media-cdn.tripadvisor.com/media/photo-...,79,[],...,,,,,,,,,,
4,10414108,HOTEL,hotel,[Specialty Lodging],"Tchegera Island Tented Camp, Virunga National ...","Goma, North Kivu Province",,https://media-cdn.tripadvisor.com/media/photo-...,109,[],...,,,,,,,,,,


We observe the data has 35,836 rows and 65 columns. majority of these not useful in analysis and contain a large number of missing values that we cannot predict. Therefore, we will remove them from our dataset. 

### Data Cleaning. 

In [5]:
# Removing irrelevant columns. 
columns_to_drop = ['image', 'photoCount', 'awards', 'phone', 'address', 'email', 'localAddress', 'webUrl', 'website',
                   'neighborhoodLocations', 'nearestMetroStations', 'ancestorLocations', 'booking', 'offerGroup',
                   'subtype', 'hotelClass', 'roomTips', 'checkInDate', 'category', 'checkOutDate', 'offers',
                   'hotelClassAttribution', 'localLangCode', 'isClosed', 'ratingHistogram', 'isLongClosed',
                   'openNowText', 'dietaryRestrictions', 'hours', 'menuWebUrl', 'localName', 'establishmentTypes',
                   'ownersTopReasons', 'guideFeaturedInCopy', 'rentalDescriptions', 'photos']

cleaner.drop_columns(columns_to_drop)
cleaner.get_preview(df)

Unnamed: 0,id,type,subcategories,name,locationString,description,rankingPosition,rating,rawRanking,addressObj,...,priceLevel,priceRange,cuisines,mealTypes,dishes,features,bedroomInfo,bathroomInfo,bathCount,baseDailyRate
0,2704993,ATTRACTION,[Nature & Parks],Congoloisirs,Kinshasa,,17.0,4.0,2.778074,"{'street1': 'Avenue de la Liberation', 'street...",...,,,,,,,,,,
1,1536776,ATTRACTION,[Nature & Parks],Okapi Wildlife Reserve,Orientale Province,,1.0,5.0,2.751658,"{'street1': '', 'street2': '', 'city': None, '...",...,,,,,,,,,,
2,13203729,ATTRACTION,"[Shopping, Food & Drink]",Marche Nouveau DAIPN,Kinshasa,,21.0,5.0,2.773659,"{'street1': 'Place des evolues', 'street2': No...",...,,,,,,,,,,
3,8661504,HOTEL,[Specialty Lodging],Bukima Tented Camp,"Rumangabo, North Kivu Province",Just outside the Virunga National Park boundar...,2.0,4.5,3.351389,"{'street1': 'Virunga National Park', 'street2'...",...,,,,,,,,,,
4,10414108,HOTEL,[Specialty Lodging],"Tchegera Island Tented Camp, Virunga National ...","Goma, North Kivu Province",,1.0,5.0,3.464931,"{'street1': None, 'street2': None, 'city': 'Go...",...,,,,,,,,,,


### Check for missing values

In [6]:
# Calculates the percentage of missing values in each column
cleaner.missing_values_percentage(df)

features              98.839156
dishes                98.839156
mealTypes             98.839156
cuisines              98.839156
baseDailyRate         96.461659
bathCount             96.263534
bathroomInfo          96.263534
bedroomInfo           96.263534
priceRange            60.545262
priceLevel            60.288537
numberOfRooms         52.349593
description           43.830227
amenities             27.011943
rankingDenominator    25.856680
rankingString         25.856680
rawRanking            25.856680
rankingPosition       25.856680
rating                25.477174
longitude             16.000670
latitude              16.000670
locationString         3.736466
reviewTags             3.736466
addressObj             3.736466
subcategories          3.736466
dtype: float64

Some columns such as features, dishes, mealTypes, cuisines, baseDailyRate, bathCount, bathroomInfo, bedroomInfo have missing values above 90 percent. We opt to drop them.

In [7]:
# Drops columns with missing values percentage above the specified threshold of 90
cleaner.drop_above_threshold(90)
cleaner.get_preview(df)

Unnamed: 0,id,type,subcategories,name,locationString,description,rankingPosition,rating,rawRanking,addressObj,...,longitude,rankingString,rankingDenominator,numberOfReviews,reviewTags,reviews,amenities,numberOfRooms,priceLevel,priceRange
0,2704993,ATTRACTION,[Nature & Parks],Congoloisirs,Kinshasa,,17.0,4.0,2.778074,"{'street1': 'Avenue de la Liberation', 'street...",...,,#17 of 105 things to do in Kinshasa,105,9,[],[],,,,
1,1536776,ATTRACTION,[Nature & Parks],Okapi Wildlife Reserve,Orientale Province,,1.0,5.0,2.751658,"{'street1': '', 'street2': '', 'city': None, '...",...,,#1 of 4 things to do in Orientale Province,4,2,[],[],,,,
2,13203729,ATTRACTION,"[Shopping, Food & Drink]",Marche Nouveau DAIPN,Kinshasa,,21.0,5.0,2.773659,"{'street1': 'Place des evolues', 'street2': No...",...,15.3087,#21 of 105 things to do in Kinshasa,105,3,[],[],,,,
3,8661504,HOTEL,[Specialty Lodging],Bukima Tented Camp,"Rumangabo, North Kivu Province",Just outside the Virunga National Park boundar...,2.0,4.5,3.351389,"{'street1': 'Virunga National Park', 'street2'...",...,29.43431,#2 of 3 Specialty lodging in Rumangabo,3,34,[],[],"[Restaurant, Mountain View]",6.0,,
4,10414108,HOTEL,[Specialty Lodging],"Tchegera Island Tented Camp, Virunga National ...","Goma, North Kivu Province",,1.0,5.0,3.464931,"{'street1': None, 'street2': None, 'city': 'Go...",...,29.117218,#1 of 17 Specialty lodging in Goma,17,29,"[{'text': 'gorilla trekking', 'reviews': 3}, {...",[],"[Multilingual Staff, Restaurant, Bar/Lounge, F...",6.0,,


Unnamed: 0,id,type,subcategories,name,locationString,description,rankingPosition,rating,rawRanking,addressObj,...,longitude,rankingString,rankingDenominator,numberOfReviews,reviewTags,reviews,amenities,numberOfRooms,priceLevel,priceRange
0,2704993,ATTRACTION,[Nature & Parks],Congoloisirs,Kinshasa,,17.0,4.0,2.778074,"{'street1': 'Avenue de la Liberation', 'street...",...,,#17 of 105 things to do in Kinshasa,105,9,[],[],,,,
1,1536776,ATTRACTION,[Nature & Parks],Okapi Wildlife Reserve,Orientale Province,,1.0,5.0,2.751658,"{'street1': '', 'street2': '', 'city': None, '...",...,,#1 of 4 things to do in Orientale Province,4,2,[],[],,,,
2,13203729,ATTRACTION,"[Shopping, Food & Drink]",Marche Nouveau DAIPN,Kinshasa,,21.0,5.0,2.773659,"{'street1': 'Place des evolues', 'street2': No...",...,15.3087,#21 of 105 things to do in Kinshasa,105,3,[],[],,,,
3,8661504,HOTEL,[Specialty Lodging],Bukima Tented Camp,"Rumangabo, North Kivu Province",Just outside the Virunga National Park boundar...,2.0,4.5,3.351389,"{'street1': 'Virunga National Park', 'street2'...",...,29.43431,#2 of 3 Specialty lodging in Rumangabo,3,34,[],[],"[Restaurant, Mountain View]",6.0,,
4,10414108,HOTEL,[Specialty Lodging],"Tchegera Island Tented Camp, Virunga National ...","Goma, North Kivu Province",,1.0,5.0,3.464931,"{'street1': None, 'street2': None, 'city': 'Go...",...,29.117218,#1 of 17 Specialty lodging in Goma,17,29,"[{'text': 'gorilla trekking', 'reviews': 3}, {...",[],"[Multilingual Staff, Restaurant, Bar/Lounge, F...",6.0,,


In [None]:



outlier_latitudes = [10.8, 23.58, 18.02, 38.69, 35.80, 40.43, 32.96, 38.10, 0.5769, -5.986, -19.62, -0.5236, 15.05,
                     21.16, 21.25, 20.93, 22.46, 24.02, 0.69, 1.50, 10.99, 13.081]
outlier_longitudes = [-68.30, -69.54, -63.04, -9.4, -7.50, -3.70, 11.98, 25.81, 81.51, 0, -14.27, -21.81, -39.59,
                      -39.04, -38.17, -37.59, -36.64, -34.67, 0, 103.86, 76.96, 80.274]




cleaner.split_price_range()
cleaner.fill_missing_prices()
cleaner.clean_amenities()
cleaner.replace_nan_amenities()
cleaner.populate_empty_lists(['restaurant', 'bathroom', 'room'])
cleaner.extract_ranking_info()
mappings = {
    'hotel': 'hotels',
    'B&B / Inn': 'B&Bs / Inns',
    'Sights & Landmarks': 'Nature & Parks',
    'Fun & Games': 'Outdoor Activities',
    'Boat Tours & Water Sports': 'Water & Amusement Parks',
    'Traveler Resources': 'Shopping',
    'Concerts & Shows': 'Nightlife',
    'Food & Drink': 'places to eat',
    'Nature & Parks': 'things to do',
    'Museums': 'things to do',
    'Tours': 'things to do',
    'Outdoor Activities': 'things to do',
    'B&Bs / Inns': 'Specialty lodging',
}
cleaner.replace_ranking_types(mappings)
cleaner.split_ranking_string()
cleaner.calculate_regional_rating()
type_mapping = {
    'ATTRACTION': 'things to do',
    'HOTEL': np.random.choice(['hotel', 'Specialty lodging'], size=1)[0],
}
cleaner.fill_ranking_type(type_mapping)
cleaner.clean_ratings()
cleaner.clean_review_tags()
cleaner.fill_missing_coordinates()
cleaner.remove_outliers(outlier_latitudes, outlier_longitudes)
cleaner.clean_subcategories()
cleaner.drop_missing_values(['addressObj'])
cleaner.extract_country_and_city()
cleaner.drop_unused_columns(['rankingPosition', 'addressObj', 'rawRanking', 'rankingString', 'rankingDenominator',
                             'reviews', 'numberOfRooms', 'priceLevel', 'priceRange', 'reviewTags', 'Location',
                             'rankingtype', 'Numerator', 'Denominator'])
cleaner.replace_empty_strings()
cleaner.drop_rows_with_nan()
cleaner.save_to_csv(r'../Data/clean_data.csv')


: 

: 

~~~ Split to new notebook here~~~~~~~

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np

import json
import glob
import re
import pickle

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

from sklearn.preprocessing import StandardScaler, normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import r2_score

from surprise import Dataset, Reader, KNNBasic, SVD, NMF, KNNWithMeans, SVDpp
from surprise.model_selection import train_test_split
from surprise import accuracy as sup_accuracy



import warnings
# Ignore future deprecation warnings
warnings.filterwarnings("ignore", category=FutureWarning)


: 

In [None]:
clean_df = pd.read_csv('../data/clean_data.csv')

: 

In [None]:
clean_df.head()

: 

In [None]:
location = ['Chichiriviche']
filtered_data = clean_df[clean_df['locationString'].isin(location)]
len(filtered_data)

: 

In [None]:
countries = ['Spain', 'Portugal', 'Venezuela', 'Carribbean', 'Georgia']
filtered_data = clean_df[clean_df['country'].isin(countries)]
len(filtered_data)

: 

In [None]:
# Drop the filtered rows from the original DataFrame
clean_df = clean_df.drop(filtered_data.index)

: 

In [None]:
# Get value counts of production companies
subcategory_counts = clean_df['subcategories'].value_counts()

# Select the top 5 production companies
top_subcategories = subcategory_counts[1:6]

# Plot the top production companies
plt.figure(figsize=(10, 6))
top_subcategories.plot(kind='bar')
plt.title('Top Subcategories Destinations')
plt.xlabel('Subcategory')
plt.ylabel('Count')
plt.savefig('../Data/images/top_subcategories')
plt.show()

: 

In [None]:
# Plotting 'Subcategoris' (top 10)
plt.figure(figsize=(10, 6))
top_10_subcategories = clean_df['subcategories'].explode().value_counts().head(10)
top_10_subcategories.plot(kind='bar')
plt.title('Top 10 subcategories')
plt.xlabel('Subcategory')
plt.ylabel('Count')
plt.savefig('../Data/images/top_10_subcategories_individually.png')
plt.show()

: 

In [None]:
# Histogram plots for each data
# Select the numerical variables to plot
num_cols_to_plot = clean_df.select_dtypes(include=['int64', 'float64']).columns.drop([])
print("Columns to plot:",num_cols_to_plot )
# Create a histogram for each variable
clean_df[num_cols_to_plot].hist(figsize=(25, 12))
plt.savefig('../Data/images/Columnstoplot')
plt.show()

: 

In [None]:
# Detect outliers using z-score method
zscore_threshold = 3  # Adjust this threshold based on your data and requirements
outliers = df[(df['latitude'] >= -35) & (df['latitude'] <= 37) & (df['longitude'] >= -25) & (df['longitude'] <= 60) &
                (np.abs(stats.zscore(df[['latitude', 'longitude']])) > zscore_threshold).any(axis=1)]

# Replace outliers with NaN values in the original DataFrame
df.loc[outliers.index, ['latitude', 'longitude']] = None


# Define the map layout
layout = go.Layout(
    title='Places to visit by Location',
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        style='stamen-terrain',
        bearing=0,
        center=dict(lat=8, lon=20),
        pitch=0,
        zoom=2
    ),
)

# Define the map data as a scatter plot of the coordinates
data = go.Scattermapbox(
    lat=clean_df['latitude'],
    lon=clean_df['longitude'],
    mode='markers',
    marker=dict(
        size=5,
        color=clean_df['rating'],
        opacity=0.8
    ),
    text=['Price: ${}'.format(i) for i in clean_df['UpperPrice']],
    hovertext = clean_df.apply(lambda x: f"Ranking Type: ${x['RankingType']}, Location: {x['locationString']}", axis=1),
)


# Create the map figure and show it
fig = go.Figure(data=[data], layout=layout)
plt.savefig('../Data/images/map')
fig.show()

In [None]:
clean_df.columns

: 

: 

: 

: 

In [None]:
# Select the variables you want to plot

cols_to_plot = ['latitude', 'longitude', 'numberOfReviews', 'LowerPrice', 'UpperPrice']

######## Create a subplot grid
fig, axes = plt.subplots(nrows=1, ncols=len(cols_to_plot), figsize=(12, 12), sharey=True)

####### Create a boxplot for each variable in a separate subplot
for i, col in enumerate(cols_to_plot):
    axes[i].boxplot(clean_df[col])
    axes[i].set_title(col)
    axes[i].tick_params(axis='both', which='major')

# Adjust spacing between subplots
plt.tight_layout()
# save te figure
# plt.savefig(r"..Data/images/Outliers.png")
# Show the figure
plt.show()

: 

In [None]:
# Checking for outliers in the 'latitude' column
plt.boxplot(clean_df['LowerPrice'])
plt.xlabel("LowerPrice", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.title(" Distribution", fontsize=15)
# save the figure
#plt.savefig(".data/images/popularity_outliers_plot")
plt.show()

: 

In [None]:
# Checking for outliers in the 'popularity' column
plt.boxplot(clean_df['UpperPrice'])
plt.xlabel("UpperPrice", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.title(" Distribution", fontsize=15)
# save the figure
# plt.savefig(".data/images/popularity_outliers_plot")
plt.show()

: 

In [None]:
## Multicollinearity

# Create a correlation matrix
corr_matrix = clean_df.corr()
# Create a fig size
plt.figure(figsize=(16, 16))
# Create a mask to show only the lower triangle
mask = np.zeros_like(corr_matrix, dtype=bool)
mask[np.triu_indices_from(mask)] = True
# Plot the heatmap with the lower triangle mask applied
sns.heatmap(corr_matrix, mask=mask, cmap='magma', center=0, annot=True)
# Show the plot
plt.show();
# Save figure
plt.savefig('../Data/images/multicollinearity.png')

: 

## Modelling

Step 1: Prepare the data

Load the sample data into a suitable data structure, such as a pandas DataFrame.
Preprocess the data if necessary, including handling missing values, converting categorical variables to numerical representations, and normalizing numerical features.
Step 2: Split the data

Split the data into training and testing sets. Typically, an 80-20 split is used, but you can adjust the ratio based on the size of your dataset.
Step 3: Choose recommendation models

There are several recommendation models you can choose from, depending on the nature of your data and the problem you want to solve. Here are a few popular models:
Collaborative Filtering: This approach recommends items based on users' past behavior and preferences.
Content-Based Filtering: This approach recommends items based on the similarity between items' characteristics and users' preferences.
Matrix Factorization: This approach decomposes the user-item rating matrix to find latent factors and make recommendations.
Neural Networks: You can also use deep learning models like neural networks for recommendation tasks.
Step 4: Train and evaluate the models

For each model you choose, train it using the training set.
Evaluate the trained model's performance using appropriate evaluation metrics such as precision, recall, or Mean Average Precision (MAP).
Repeat the training and evaluation process for each model.

Step 5: Choose the best model

Compare the performance of the different models based on the evaluation metrics.
Select the model that performs best according to your evaluation criteria.

Step 6: Fine-tune and optimize the chosen model

Once you have selected the best model, you can further fine-tune and optimize its hyperparameters using techniques like cross-validation or grid search.

Step 7: Deploy the recommendation system

Once you are satisfied with the performance of your chosen and optimized model, you can deploy it to make real-time recommendations.

In [None]:
clean_df.columns

: 

In [None]:
clean_df[['id', 'rating','name', 'Rank', 'Total','regional_rating','subcategories', 'RankingType', 'locationString','country','city', 'LowerPrice', 'UpperPrice']]


: 

### 1. Prepare the data

##### * Dealing with outliers in the numerical columns

In [None]:
from sklearn.cluster import KMeans

# Select the numerical features for clustering
#numerical_columns = clean_df.select_dtypes(include=[np.number]).columns
#numerical_data = clean_df[numerical_columns]

# Apply K-means clustering
#kmeans = KMeans(n_clusters=3)  # Specify the number of clusters
#kmeans.fit(numerical_data)

# Assign each data point to a cluster
#labels = kmeans.labels_

# Identify the cluster with the outliers
#outlier_cluster = np.argmax(np.bincount(labels))

# Remove the rows belonging to the outlier cluster
#clean_df = clean_df[labels != outlier_cluster]


: 

In [None]:
clean_df.shape

: 

#### * Cleaning and transforming textual data

In [None]:
textual_data = clean_df[['subcategories', 'RankingType', 'locationString', 'country', 'city', 'amenities']]
textual_data

: 

In [None]:
# Convert object columns to categorical
clean_df['type'] = clean_df['type'].astype('category')
clean_df['amenities'] = clean_df['amenities'].astype('category')
clean_df['subcategories'] = clean_df['subcategories'].astype('category')
#clean_df['locationString'] = clean_df['locationString'].astype('category')

: 

In [None]:
# Create a list of unique values in the column
unique_subcategory_values = list(clean_df["subcategories"].unique())

# Create a dictionary that maps each unique value to a unique number
subcategory_map = {}
for index, value in enumerate(unique_subcategory_values):
    subcategory_map[value] = index + 1
    
# Create a new column with the encoded values
clean_df['subcategories_mapped'] = clean_df['subcategories'].map(subcategory_map)


: 

In [None]:
# Create a list of unique values in the column
unique_ammenities_values = list(clean_df["amenities"].unique())

# Create a dictionary that maps each unique value to a unique number
amenities_mapping = {}
for index, value in enumerate(unique_ammenities_values):
    amenities_mapping[value] = index + 1

# Use the map() function to map the values in the column to their respective numbers
clean_df["amenities_mapped"] = clean_df["amenities"].map(amenities_mapping)

: 

In [None]:
# Create a list of unique values in the column
unique_RankingType_values = list(clean_df["RankingType"].unique())

# Create a dictionary that maps each unique value to a unique number
RankingType_mapping = {}
for index, value in enumerate(unique_RankingType_values):
    RankingType_mapping[value] = index + 1

# Use the map() function to map the values in the column to their respective numbers
clean_df["RankingType_mapped"] = clean_df["RankingType"].map(RankingType_mapping)

: 

In [None]:
# Create a list of unique values in the column
unique_locationString_values = list(clean_df["locationString"].unique())

# Create a dictionary that maps each unique value to a unique number
locationString_mapping = {}
for index, value in enumerate(unique_locationString_values):
    locationString_mapping[value] = index + 1

# Use the map() function to map the values in the column to their respective numbers
clean_df["locationString_mapped"] = clean_df["locationString"].map(locationString_mapping)

: 

In [None]:
# Create a list of unique values in the column
unique_country_values = list(clean_df["country"].unique())

# Create a dictionary that maps each unique value to a unique number
country_mapping = {}
for index, value in enumerate(unique_country_values):
    country_mapping[value] = index + 1

# Use the map() function to map the values in the column to their respective numbers
clean_df["country_mapped"] = clean_df["country"].map(country_mapping)

: 

In [None]:
# Create a list of unique values in the column
unique_type_values = list(clean_df["type"].unique())

# Create a dictionary that maps each unique value to a unique number
type_mapping = {}
for index, value in enumerate(unique_type_values):
    type_mapping[value] = index + 1

# Use the map() function to map the values in the column to their respective numbers
clean_df["type_mapped"] = clean_df["type"].map(type_mapping)

: 

In [None]:
clean_df.head()

: 

### * GridSearch

In [None]:


# 3. Define the feature space
#features = ['rating', 'Rank', 'Total', 'regional_rating', 'amenities_mapped', 'RankingType_mapped', 'locationString_mapped','country_mapped', 'type_mapped']

# 4. Define the model and parameter grid
#model = SVR()
#param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}

# 5. Split the data
#X = clean_df[features]
#y = clean_df['subcategories_mapped']
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 6. Perform grid search
#grid_search = GridSearchCV(model, param_grid, scoring='neg_mean_squared_error')
#grid_search.fit(X_train, y_train)

# 7. Fit the data
#best_model = grid_search.best_estimator_

# 8. Evaluate the results
#best_params = grid_search.best_params_
#mse = -grid_search.best_score_
#rmse = np.sqrt(mse)

# Print the best parameters and evaluation metrics
#print("Best Parameters:", best_params)
#print("RMSE:", rmse)

: 

In [None]:
clean_df.info()

: 

: 

### * Nrmalization and Standardization

In [None]:
# Select the numerical columns for normalization
numerical_columns = ['rating', 'Rank', 'Total', 'regional_rating', 'LowerPrice', 'UpperPrice']

# Normalize the numerical columns
scaler = normalize
normalized_data = clean_df.copy()
normalized_data[numerical_columns] = scaler(clean_df[numerical_columns])

: 

In [None]:
clean_df.dtypes


: 

### Baseline Model

In [None]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df[['id', 'Rank', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Train the model
model = KNNBasic(random_state=42)
model.fit(trainset)

# Evaluate the model
predictions1 = model.test(testset)
accuracy1 = sup_accuracy.rmse(predictions1)
mae1 = sup_accuracy.mae(predictions1)

: 

The RMSE (Root Mean Squared Error) measures the average difference between the actual ratings and the predicted ratings. A lower RMSE value indicates better accuracy, as it means the model's predictions are closer to the actual ratings.

The MAE (Mean Absolute Error) measures the average absolute difference between the actual ratings and the predicted ratings. Like RMSE, a lower MAE value indicates better accuracy.

In this case, the model achieved a relatively low RMSE of 0.7981 and a low MAE of 0.5355, indicating that the model's predictions are quite accurate and have a small deviation from the actual ratings.

Overall, these evaluation results suggest that the model trained using the KNNBasic algorithm performs well in predicting ratings for the test set.

In [None]:
for prediction in predictions1:
    print(f"Predicted rating: {prediction.est:.2f}")
    print(f"Actual rating: {prediction.r_ui:.2f}")
    print("---")

: 

In [None]:
threshold = 3  # Define the threshold for positive predictions

true_positives = 0
false_positives = 0
false_negatives = 0

for prediction in predictions1:
    if prediction.est >= threshold:
        if prediction.r_ui >= threshold:
            true_positives += 1
        else:
            false_positives += 1
    elif prediction.r_ui >= threshold:
        false_negatives += 1

precision1 = true_positives / (true_positives + false_positives)
recall1 = true_positives / (true_positives + false_negatives)

print(f"Precision: {precision1:.2f}")
print(f"Recall: {recall1:.2f}")


: 

Precision: Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It indicates how accurate the model is when it predicts positive instances. A precision score of 0.97 means that 97% of the instances predicted as positive were actually positive.

Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of all actual positive instances. It indicates how well the model captures the positive instances. A recall score of 1.00 means that the model successfully identified all positive instances.

: 

: 

The RMSE (Root Mean Squared Error) value of 0.7598 indicates the average prediction error of the model on the test set. A lower RMSE value indicates better accuracy of the model's predictions.

In the context of collaborative filtering recommendation systems, the RMSE represents how well the model is able to predict user ratings for items. A lower RMSE implies that the model is better at predicting user preferences and provides more accurate recommendations.

RMSE of 0.7598 suggests that the model has reasonably good predictive performance. as can be observed below.

In [None]:
clean_df.columns

: 

In [None]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df[['id', 'Rank', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Train the model
model2 = SVD(random_state=42)
model2.fit(trainset)

# Evaluate the model
predictions2 = model2.test(testset)
accuracy2 = sup_accuracy.rmse(predictions2)
mae2 = sup_accuracy.mae(predictions2)

: 

the model performs almost the same as the Baseline

In [None]:
for prediction in predictions2:
    print(f"Predicted rating: {prediction.est:.2f}")
    print(f"Actual rating: {prediction.r_ui:.2f}")
    print("---")

: 

In the code below, we will iterate over the predictions and increment the corresponding counters based on the predicted ratings and actual ratings. Then, we calculate precision by dividing the number of true positives by the sum of true positives and false positives. Recall is calculated by dividing the number of true positives by the sum of true positives and false negatives.

Note that this calculation assumes a binary classification problem where ratings above the threshold are considered positive and ratings below the threshold are considered negative. 

In [None]:
threshold = 3  # Define the threshold for positive predictions

true_positives = 0
false_positives = 0
false_negatives = 0

for prediction in predictions2:
    if prediction.est >= threshold:
        if prediction.r_ui >= threshold:
            true_positives += 1
        else:
            false_positives += 1
    elif prediction.r_ui >= threshold:
        false_negatives += 1

precision2 = true_positives / (true_positives + false_positives)
recall2 = true_positives / (true_positives + false_negatives)

print(f"Precision: {precision2:.2f}")
print(f"Recall: {recall2:.2f}")


: 

Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It indicates how accurate the model is when it predicts positive instances. A precision score of 0.97 means that 97% of the instances predicted as positive were actually positive.

Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of all actual positive instances. It indicates how well the model captures the positive instances. A recall score of 1.00 means that the model successfully identified all positive instances.

In [None]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df[['id', 'regional_rating', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Train the model
model3 = KNNBasic(random_state=42)
model3.fit(trainset)

# Evaluate the model
predictions3 = model3.test(testset)
accuracy3 = sup_accuracy.rmse(predictions3)
mae3 = sup_accuracy.mae(predictions3)
threshold = 3  # Define the threshold for positive predictions

true_positives = 0
false_positives = 0
false_negatives = 0

for prediction in predictions3:
    if prediction.est >= threshold:
        if prediction.r_ui >= threshold:
            true_positives += 1
        else:
            false_positives += 1
    elif prediction.r_ui >= threshold:
        false_negatives += 1

precision3 = true_positives / (true_positives + false_positives)
recall3 = true_positives / (true_positives + false_negatives)

print(f"Precision: {precision3:.2f}")
print(f"Recall: {recall3:.2f}")

: 

Precision: 0.97
Precision is a metric that measures the proportion of correctly predicted positive ratings (relevant items) out of the total items predicted as positive. A precision of 0.97 indicates that the model's predictions for positive ratings are highly accurate, with a high proportion of correctly predicted relevant items.

Recall: 1.00
Recall is a metric that measures the proportion of correctly predicted positive ratings (relevant items) out of the total actual positive ratings. A recall of 1.00 indicates that the model is able to capture all the relevant items in its predictions, without missing any.

These additional metrics provide insights into the model's performance in terms of precision and recall. A high precision indicates that the model's positive predictions are reliable, while a high recall indicates that the model is able to identify most of the relevant items.

Overall, the model seems to perform well, with high precision and recall values, in addition to the low RMSE and MAE values previously mentioned. This suggests that the model is accurate in its predictions and can successfully identify relevant items.

In [None]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df[['id', 'regional_rating', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Train the model
model4 = SVD(random_state=42)
model4.fit(trainset)

# Evaluate the model
predictions4 = model4.test(testset)
accuracy4 = sup_accuracy.rmse(predictions4)
mae4 = sup_accuracy.mae(predictions4)
threshold = 3  # Define the threshold for positive predictions

true_positives = 0
false_positives = 0
false_negatives = 0

for prediction in predictions4:
    if prediction.est >= threshold:
        if prediction.r_ui >= threshold:
            true_positives += 1
        else:
            false_positives += 1
    elif prediction.r_ui >= threshold:
        false_negatives += 1

precision4 = true_positives / (true_positives + false_positives)
recall4 = true_positives / (true_positives + false_negatives)

print(f"Precision: {precision4:.2f}")
print(f"Recall: {recall4:.2f}")

: 

This model performs just as well as the above

In [None]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df[['id', 'regional_rating', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Train the model
model5 = NMF(random_state=42)
model5.fit(trainset)

# Evaluate the model
predictions5 = model5.test(testset)
accuracy5 = sup_accuracy.rmse(predictions5)
mae5 = sup_accuracy.mae(predictions5)
threshold = 3  # Define the threshold for positive predictions

true_positives = 0
false_positives = 0
false_negatives = 0

for prediction in predictions5:
    if prediction.est >= threshold:
        if prediction.r_ui >= threshold:
            true_positives += 1
        else:
            false_positives += 1
    elif prediction.r_ui >= threshold:
        false_negatives += 1

precision5 = true_positives / (true_positives + false_positives)
recall5 = true_positives / (true_positives + false_negatives)

print(f"Precision: {precision5:.2f}")
print(f"Recall: {recall5:.2f}")

: 

This model also does not change in performance

In [None]:
# model with KNNwithMeans
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df[['id', 'subcategories', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Define the item-based collaborative filtering model
model6 = KNNWithMeans(sim_options={'user_based': False})

# Train the model
model6.fit(trainset)

# Make predictions on the test set
predictions6 = model6.test(testset)

# Evaluate the model using RMSE
rmse_score6 = sup_accuracy.rmse(predictions6)
mae6 = sup_accuracy.mae(predictions6)
#print("MAE:", mae6)

: 

 In this case, the RMSE value is 0.7981, indicating that, on average, the predicted values deviate from the actual values by 0.7981. Lower values of RMSE indicate better accuracy. The MAE value of 0.5355 suggests that, on average, the predicted values deviate from the actual values by 0.5355. Like RMSE, lower values of MAE indicate better accuracy.

: 

In [None]:


# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df[['id', 'regional_rating', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Train the model
model7 = SVDpp(random_state=42)
model7.fit(trainset)

# Evaluate the model
predictions7 = model7.test(testset)

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings = [pred.r_ui for pred in predictions7]
predicted_ratings = [pred.est for pred in predictions7]

# Calculate the R-squared value
r_squared = r2_score(actual_ratings, predicted_ratings)
rmse_score7 = sup_accuracy.rmse(predictions7)
mae7 = sup_accuracy.mae(predictions7)
# Calculate the R-squared value using Surprise's accuracy module
# r_squared = accuracy.rsquared(predictions)
# Print the R-squared value
print("R-squared:", r_squared) 


: 

RMSE: 0.7897
RMSE (Root Mean Squared Error) is a metric that measures the average magnitude of the differences between the predicted ratings and the actual ratings. A lower RMSE value indicates that the model's predictions are closer to the actual ratings, suggesting better accuracy.

MAE: 0.5364
MAE (Mean Absolute Error) is a metric that measures the average magnitude of the differences between the predicted ratings and the actual ratings, without considering the direction of the differences. Like RMSE, a lower MAE value indicates better accuracy in the model's predictions.

R-squared: 0.0208
R-squared is a statistical metric that represents the proportion of the variance in the dependent variable (actual ratings) that can be explained by the independent variable (predicted ratings). An R-squared value of 0.0208 suggests that only a small portion of the variance in the actual ratings can be explained by the predicted ratings. In other words, the model's predictions may not capture the full complexity of the data and may have limited explanatory power.

Based on these metrics, the model has relatively low RMSE and MAE values, indicating good accuracy in its predictions. However, the low R-squared value suggests that the model may not fully capture the underlying patterns in the data and may have limited predictive power.

We will check the other models performance based on the r squared metric to see how they performed. 

In [None]:
# Extract the actual ratings and predicted ratings from the predictions
actual_ratings1 = [pred.r_ui for pred in predictions1]
predicted_ratings1 = [pred.est for pred in predictions1]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings2 = [pred.r_ui for pred in predictions2]
predicted_ratings2 = [pred.est for pred in predictions2]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings3 = [pred.r_ui for pred in predictions3]
predicted_ratings3 = [pred.est for pred in predictions3]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings4 = [pred.r_ui for pred in predictions4]
predicted_ratings4 = [pred.est for pred in predictions4]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings5 = [pred.r_ui for pred in predictions5]
predicted_ratings5 = [pred.est for pred in predictions5]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings6 = [pred.r_ui for pred in predictions6]
predicted_ratings6 = [pred.est for pred in predictions6]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings7 = [pred.r_ui for pred in predictions7]
predicted_ratings7 = [pred.est for pred in predictions7]

# List of predictions and corresponding names
prediction_sets = [
    (predictions1, "Predictions 1"),
    (predictions2, "Predictions 2"),
    (predictions3, "Predictions 3"),
    (predictions4, "Predictions 4"),
    (predictions5, "Predictions 5"),
    (predictions6, "Predictions 6"),
    (predictions7, "Predictions 7")
]

# Iterate over the prediction sets
for predictions, name in prediction_sets:
    # Extract the actual ratings and predicted ratings from the predictions
    actual_ratings = [pred.r_ui for pred in predictions]
    predicted_ratings = [pred.est for pred in predictions]

    # Print the results
    print("Results for", name)
    print("Actual Ratings:", actual_ratings)
    print("Predicted Ratings:", predicted_ratings)
    print()



: 

In [None]:
# List of actual ratings and predicted ratings
actual_ratings_list = [actual_ratings1, actual_ratings2, actual_ratings3, actual_ratings4, actual_ratings5, actual_ratings6, actual_ratings7 ]
predicted_ratings_list = [predicted_ratings1, predicted_ratings2, predicted_ratings3, predicted_ratings4, predicted_ratings5, predicted_ratings6, predicted_ratings7]

# Loop through the ratings lists
for i in range(len(actual_ratings_list)):
    actual_ratings = actual_ratings_list[i]
    predicted_ratings = predicted_ratings_list[i]
    
    # Calculate the R-squared value
    r_squared = r2_score(actual_ratings, predicted_ratings)
    
    # Print the R-squared value
    print(f"R-squared for Set {i+1}: {r_squared}")

: 

R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher value indicates a better fit of the model to the data. In this case, the R-squared values are negative, which suggests that the model does not fit the data well and may not be providing meaningful predictions.

In [None]:
from sklearn.preprocessing import MinMaxScaler
clean_df_scaled = clean_df[[ 'id', 'regional_rating', 'rating']]
# Apply MinMaxScaler to 'rating' and 'Rank' columns
scaler = MinMaxScaler()
clean_df_scaled[['rating', 'regional_rating']] = scaler.fit_transform(clean_df[['rating', 'regional_rating']])

clean_df_scaled[['id', 'rating', 'regional_rating']]

: 

In [None]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df_scaled[['id','regional_rating', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Define the item-based collaborative filtering model
model8 = SVDpp(random_state=42)

# Train the model
model8.fit(trainset)

# Make predictions on the test set
predictions8 = model8.test(testset)

# Evaluate the model using RMSE
rmse_score8 = sup_accuracy.rmse(predictions8)
mae8 = sup_accuracy.mae(predictions8)

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings8 = [pred.r_ui for pred in predictions8]
predicted_ratings8 = [pred.est for pred in predictions8]

# Calculate the R-squared value
r_squared8 = r2_score(actual_ratings8, predicted_ratings8)
print("R-squared:", r_squared8) 

: 

RMSE: 0.1842
RMSE (Root Mean Squared Error) is a metric that measures the average magnitude of the differences between the predicted values and the actual values. A lower RMSE value indicates that the model's predictions are closer to the actual values, suggesting better accuracy.

MAE: 0.0920
MAE (Mean Absolute Error) is a metric that measures the average magnitude of the differences between the predicted values and the actual values, without considering the direction of the differences. Like RMSE, a lower MAE value indicates better accuracy in the model's predictions.

R-squared: -0.3321
R-squared is a statistical metric that represents the proportion of the variance in the dependent variable (actual values) that can be explained by the independent variable (predicted values). A negative R-squared value suggests that the model does not capture the underlying patterns in the data and has limited predictive power. It indicates that the model's predictions are not significantly better than simply using the mean value of the dependent variable as the prediction.

Based on these metrics, the model has low RMSE and MAE values, indicating good accuracy in its predictions. However, the negative R-squared value suggests that the model may not be a good fit for the data and is not able to explain the variance in the actual values. 

In [None]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(normalized_data[['id', 'regional_rating', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Define the item-based collaborative filtering model
model9 = SVDpp(random_state=42)

# Train the model
model9.fit(trainset)

# Make predictions on the test set
predictions9 = model9.test(testset)

# Evaluate the model using RMSE
rmse_score9 = sup_accuracy.rmse(predictions9)
mae9 = sup_accuracy.mae(predictions9)

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings9 = [pred.r_ui for pred in predictions9]
predicted_ratings9 = [pred.est for pred in predictions9]

# Calculate the R-squared value
r_squared9 = r2_score(actual_ratings9, predicted_ratings9)
print("R-squared:", r_squared9) 

: 

RMSE: 0.9944
RMSE (Root Mean Squared Error) is a metric that measures the average magnitude of the differences between the predicted values and the actual values. A lower RMSE value indicates that the model's predictions are closer to the actual values, suggesting better accuracy.

MAE: 0.9944
MAE (Mean Absolute Error) is a metric that measures the average magnitude of the differences between the predicted values and the actual values, without considering the direction of the differences. Like RMSE, a lower MAE value indicates better accuracy in the model's predictions.

R-squared: -22346.1757
R-squared is a statistical metric that represents the proportion of the variance in the dependent variable (actual values) that can be explained by the independent variable (predicted values). A negative R-squared value suggests that the model does not capture the underlying patterns in the data and has limited predictive power. It indicates that the model's predictions are not significantly better than simply using the mean value of the dependent variable as the prediction.

Based on these metrics, the model has high RMSE and MAE values, indicating lower accuracy in its predictions. Additionally, the negative R-squared value suggests that the model is not able to explain the variance in the actual values and is not a good fit for the data. 

In [None]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(normalized_data[['id', 'regional_rating', 'rating']], reader)

# Build the full dataset with item attributes
item_attributes = normalized_data[['id', 'subcategories', 'locationString']]
data_feat = data.build_full_trainset()
data_feat.item_attributes = item_attributes

# Define the item-based collaborative filtering model
model10 = NMF( random_state=42)

# Train the model
model10.fit(data_feat)

# Make predictions on the test set
predictions10 = model10.test(testset)

# Evaluate the model using RMSE
rmse_score10 = sup_accuracy.rmse(predictions10)
mae10 = sup_accuracy.mae(predictions10)

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings10 = [pred.r_ui for pred in predictions10]
predicted_ratings10 = [pred.est for pred in predictions10]

# Calculate the R-squared value
r_squared10 = r2_score(actual_ratings10, predicted_ratings10)
print("R-squared:", r_squared10) 



: 

The RMSE and MAE values are identical, indicating that both models have the same average deviation between the actual ratings and the predicted ratings. Similarly, the R-squared value shows that both models explain a similar proportion of variance in the actual ratings.

Based on these evaluation results, we can conclude that the two models perform similarly in terms of predicting ratings for the test set. However, it's worth noting that the code for Model 2 also includes additional functionality to make item-based recommendations for a specific item.

In [None]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df_scaled[['id', 'regional_rating', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Define the item-based collaborative filtering model with regularization
model11 = SVDpp(reg_all=0.01, random_state=42)

# Train the model
model11.fit(trainset)

# Make predictions on the test set
predictions11 = model11.test(testset)

# Evaluate the model using RMSE
rmse_score11 = sup_accuracy.rmse(predictions11)
mae11 = sup_accuracy.mae(predictions11)

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings11 = [pred.r_ui for pred in predictions11]
predicted_ratings11 = [pred.est for pred in predictions11]

# Calculate the R-squared value
r_squared11 = r2_score(actual_ratings11, predicted_ratings11)
print("R-squared:", r_squared11)


: 

RMSE (Root Mean Squared Error) measures the average difference between the actual ratings and the predicted ratings. A lower RMSE value indicates better accuracy.
MAE (Mean Absolute Error) measures the average absolute difference between the actual ratings and the predicted ratings. A lower MAE value indicates better accuracy.
R-squared is a statistical measure that represents the proportion of the variance in the dependent variable (actual ratings) that can be explained by the independent variables (predicted ratings). A higher R-squared value indicates a better fit of the model to the data.
In this case, the model has achieved a low RMSE and MAE, which indicates good accuracy in predicting the ratings. However, the negative R-squared value suggests that the model does not fit the data well and performs worse than a simple mean model.

In [None]:
from surprise.model_selection import cross_validate
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df_scaled[['id', 'regional_rating', 'rating']], reader)
# Define the model
model12 = SVDpp()
# Perform cross-validation
cv_results = cross_validate(model12, data, measures=['RMSE'], cv=5, verbose=True)
# Access the RMSE scores for each fold
rmse_scores = cv_results['test_rmse']
# Calculate the average RMSE
avg_rmse = sum(rmse_scores) / len(rmse_scores)
print("Cross-Validation Results")
print("RMSE Scores:", rmse_scores)
print("Average RMSE:", avg_rmse)

: 

In [None]:
from collections import defaultdict

def calculate_precision_recall(predictions, threshold):
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    
    for prediction in predictions:
        if prediction.est >= threshold:
            if prediction.r_ui >= threshold:
                true_positives += 1
            else:
                false_positives += 1
        elif prediction.r_ui >= threshold:
            false_negatives += 1
    
    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    
    return precision, recall

def calculate_topn_hit_rate(predictions, topn):
    total_users = len(set([pred.uid for pred in predictions]))
    topn_hits = 0
    
    user_ratings = defaultdict(list)
    for pred in predictions:
        user_ratings[pred.uid].append((pred.iid, pred.est))
    
    for uid, ratings in user_ratings.items():
        ratings.sort(key=lambda x: x[1], reverse=True)
        topn_predictions = [iid for (iid, _) in ratings[:topn]]
        if uid in topn_predictions:
            topn_hits += 1
    
    hit_rate = topn_hits / total_users
    
    return hit_rate

def evaluate_prediction_sets(prediction_sets, threshold, topn):
    for predictions, name in prediction_sets:
        precision, recall = calculate_precision_recall(predictions, threshold)
        hit_rate = calculate_topn_hit_rate(predictions, topn)
        
        print("Results for", name)
        print(f"Precision: {precision:.2f}")
        print(f"Recall: {recall:.2f}")
        print(f"Top-{topn} Hit Rate: {hit_rate:.2f}")
        print()


: 

In [None]:

# Usage example
prediction_sets = [
    (predictions1, "Predictions 1"),
    (predictions2, "Predictions 2"),
    (predictions3, "Predictions 3"),
    (predictions4, "Predictions 4"),
    (predictions5, "Predictions 5"),
    (predictions6, "Predictions 6"),
    (predictions7, "Predictions 7"),
    (predictions6, "Predictions 8"),
    (predictions6, "Predictions 9"),
    (predictions6, "Predictions 10"),
    (predictions6, "Predictions 11"),
]

threshold = 3.5
topn = 5

evaluate_prediction_sets(prediction_sets, threshold, topn)


: 

The findings indicate that for all sets of predictions, the precision is 0.94, which means that out of the items recommended, 94% of them are actually relevant to the users. The recall is 1.00, indicating that all the relevant items are successfully retrieved among the recommendations. However, the top-5 hit rate is 0.00, suggesting that none of the top 5 recommended items were relevant to the users.

This implies that while the recommendations have high precision and recall, they fail to capture the users' preferences in the top 5 recommendations, this might be due to the fact that we do not have any user input in this dataset.  

#### Ensemble Methods

Ensemble methods combine multiple base models to improve the overall predictive performance.
We use the voting-based ensemble method called "Majority Voting" or "Voting Classifier". This method combines the predictions from multiple base models and selects the recommendation with the majority of votes.

In [None]:

# Convert Surprise Dataset to pandas DataFrame
df = pd.DataFrame(clean_df_scaled[['id', 'regional_rating', 'rating']])

# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df, reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Define the base models
models = [
    NMF(random_state=42),
    SVD(random_state=42),
    SVDpp(random_state=42)
]

# Train the base models
model_predictions = []
for model in models:
    model.fit(trainset)
    predictions = model.test(testset)
    model_predictions.append(predictions)

# Combine the predictions from the base models
blended_predictions = []
for i in range(len(testset)):
    ratings = [pred[i].est for pred in model_predictions]
    blended_rating = sum(ratings) / len(ratings)
    user, item, true_rating = testset[i]
    blended_predictions.append((user, item, true_rating, blended_rating, None))

# Evaluate the blended predictions
blended_rmse = sup_accuracy.rmse(blended_predictions)
blended_mae = sup_accuracy.mae(blended_predictions)

print("Blended RMSE:", blended_rmse)
print("Blended MAE:", blended_mae)

: 

The output provided is the result of evaluating the blended predictions using the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) metrics. The RMSE value is approximately 0.1820, and the MAE value is approximately 0.1157. These metrics are used to assess the accuracy of the predictions made by the blended model.

The RMSE measures the average difference between the predicted ratings and the true ratings on a scale from 0 to 5. A lower RMSE indicates better accuracy, with 0 being a perfect match between the predicted and true ratings.

The MAE measures the average absolute difference between the predicted ratings and the true ratings. Like RMSE, a lower MAE value indicates better accuracy, with 0 being a perfect match between the predicted and true ratings.

In this case, the blended model achieved an RMSE of 0.1820 and an MAE of 0.1157, which suggests that the predictions are relatively accurate and close to the true ratings.

In [None]:
#def calculate_mae(predictions):
    # Compute MAE for the provided predictions
    #mae = sup_accuracy.mae(predictions)
    #return mae

: 

In [None]:
#functon to create RMSE and MAE
#def calculate_metrics(data):
    # Split the data into train and test sets
    #trainset, testset = train_test_split(data, test_size=0.2)
    # Train the model
    #model = KNNBasic(random_state=42)
    #model.fit(trainset)
    # Evaluate the model
    #predictions = model.test(testset)
    # Compute and return RMSE and MAE
    #metrics = {}
    #metrics['rmse'] = accuracy.rmse(predictions)
    #metrics['mae'] = accuracy.mae(predictions)
    #return metrics

: 

In [None]:
# creating a relevant columns from the above dataset 
vectorization_columns = clean_df[['name', 'subcategories', 'amenities', 'amenities_mapped', 'subcategories_mapped']]

: 

In [None]:
vectorization_columns

: 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert relevant data into a list of strings
documents = []
for _, row in vectorization_columns.iterrows():
    name = row['name']
    subcategories = row['subcategories_mapped']
    amenities = row['amenities_mapped']
    doc = f"{name} {subcategories} {amenities}"
    documents.append(doc)

# Apply TF-IDF vectorization
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)

: 

: 

: 

In [None]:
from sklearn.metrics.pairwise import linear_kernel

# Compute cosine similarity matrix
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)


: 

In [None]:
def get_item_recommendations(item_index, cosine_similarities, top_n=5):
    # Get similarity scores for the item
    item_scores = list(enumerate(cosine_similarities[item_index]))

    # Sort items based on similarity scores
    item_scores = sorted(item_scores, key=lambda x: x[1], reverse=True)

    # Get top-N similar items
    top_items = item_scores[1 : top_n + 1]  # Exclude the item itself

    return top_items

# Get recommendations for a specific item (e.g., item with index 0)
item_index = 0
recommendations = get_item_recommendations(item_index, cosine_similarities)

# Print the top 5 recommendations
for item_id, similarity in recommendations:
    print(f"Item ID: {item_id}, Similarity: {similarity}")

: 

In [None]:
def get_item_recommendations(item_index, cosine_similarities, top_n=5):
    # Get similarity scores for the item
    item_scores = list(enumerate(cosine_similarities[item_index]))

    # Sort items based on similarity scores
    item_scores = sorted(item_scores, key=lambda x: x[1], reverse=True)

    # Get top-N similar items
    top_items = item_scores[1 : top_n + 1]  # Exclude the item itself

    return top_items

# Get recommendations for a specific item (e.g., item with index 0)
item_index = 0
recommendations = get_item_recommendations(item_index, cosine_similarities)

# Print the top 5 recommendations
for item_id, similarity in recommendations:
    print(f"Item ID: {item_id}, Similarity: {similarity}")


def recommend_attraction(self, rating_threshold):
    # Filter the DataFrame based on the rating threshold
    recommendations = self.clean_df[self.clean_df['rating'] > rating_threshold][['name', 'LowerPrice', 'UpperPrice','amenities', 'type', 'country']]

    # Reset the index of the recommendations DataFrame
    recommendations.reset_index(drop=True, inplace=True)

    return recommendations


: 

## Model Three

In [None]:
# Construct the TF-IDF Matrix
tfidfv2=TfidfVectorizer(analyzer='word', stop_words='english')
tfidfv_matrix2=tfidfv2.fit_transform(clean_df['amenities'])
print(tfidfv_matrix2.todense())
tfidfv_matrix2.todense().shape

: 

In [None]:
#with open('../Data/tfidf_matrix2.pkl', 'wb') as f:
 #   pickle.dump(tfidfv_matrix2, f)

: 

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
# Calculate similarity matrix
cosine_sim2 = cosine_similarity(tfidfv_matrix2, tfidfv_matrix2)

: 

In [None]:
# Create a Pandas Series to map movie titles to their indices
indices = pd.Series(data = list(clean_df.index), index = clean_df['name'])
indices

: 

In [None]:
def recommend_place(name, cosine_sim2, data):
    # Create a dictionary to map movie titles to their indices
    indices = {title: index for index, title in enumerate(clean_df['name'])}

    # Get the index of the movie that matches the title
    idx = indices[name]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim2[idx]))

    # Sort the movies based on the similarity scores
    sim_scores.sort(key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    indices = [x for x, _ in sim_scores]

    return clean_df.set_index('name').iloc[indices][
        ['country', 'RankingType', 'subcategories', 'LowerPrice', 'UpperPrice']
    ]

: 

In [None]:
recommend_place("St. Catherine's Monastery Guesthouse", cosine_sim2, clean_df)

: 

In [None]:
def recommend_amenities(amenity, cosine_sim2, data):
    # Group the data by amenities and count the occurrences
    grouped_df = data.groupby('amenities').size().reset_index(name='count')

    # Create a dictionary to map amenities to their indices
    indices = {amen: index for index, amen in enumerate(grouped_df['amenities'])}

    # Check if the amenity exists in the dictionary
    if amenity in indices:
        # Get the index of the amenity that matches the provided amenity
        idx = indices[amenity]

        # Get the pairwise similarity scores of all amenities with that amenity
        sim_scores = list(enumerate(cosine_sim2[idx]))

        # Sort the amenities based on the similarity scores
        sim_scores.sort(key=lambda x: x[1], reverse=True)

        # Get the scores of the 10 most similar amenities
        sim_scores = sim_scores[1:11]

        # Get the amenity indices
        indices = [x for x, _ in sim_scores]

        # Retrieve the recommended similar amenities from the original DataFrame
        recommended_amenities = data.iloc[indices]['amenities']
        return recommended_amenities
    else:
        return "Amenity not found."

: 

In [None]:
# Split the values in the 'amenities' column and convert them into sets
amenities_sets = clean_df['amenities'].str.split(', ').apply(set)

# Merge the sets of amenities to create a consolidated set of unique amenities
consolidated_amenities = set().union(*amenities_sets)

# Convert the consolidated set of amenities back to a single string value for each row
clean_df['consolidated_amenities'] = amenities_sets.apply(lambda x: ', '.join(consolidated_amenities.intersection(x)))

# Print the updated DataFrame with consolidated amenities
clean_df[[ 'consolidated_amenities']]


: 

In [None]:
def combine_similar_amenities(amenities):
    combined_amenities = []

    for amenity in amenities:
        amenity = amenity.lower().strip()  # Convert to lowercase and remove leading/trailing whitespaces
        
        # Combine similar amenities based on specific rules
        if amenity == "free wifi":
            amenity = "wifi"
        elif amenity == "free internet":
            amenity = "internet"
        elif amenity == "free internet":
            amenity = "free wifi"
        elif amenity == "free parking":
            amenity = "Secured Parking,"
        elif amenity == "kids pool":
            amenity = "Pool,"
        # Add more rules for other similar amenities if needed
        
        combined_amenities.append(amenity)
    
    return combined_amenities

# Example usage
amenities = [
    "bathroom only",
    "restaurant",
    "Restaurant",
    "Kids Activities",
    "Free Wifi",
    "Free parking",
    "Wifi",
    "Bar/Lounge",
    "Internet",
    "Restaurant",
    "Free Internet",
    "Breakfast included",
    "Room service"
    "Refrigerator in room", 
    "Clothes Rack",
    "Private Bathrooms",
    "Rooftop Terrace",
    "Hair Dryer",
]

consolidated_amenities = combine_similar_amenities(amenities)
print(consolidated_amenities)


: 

In [None]:
from ipywidgets import interact_manual
from IPython.display import display, HTML
from ipywidgets import Dropdown


def recommend_amenities(selected_amenity, cosine_sim2, data):
    # Create a dictionary to map amenities to their indices
    indices = {amen: index for index, amen in enumerate(data['consolidated_amenities'])}

    # Check if the amenity exists in the dictionary
    if selected_amenity in indices:
        # Get the index of the amenity that matches the provided amenity
        idx = indices[selected_amenity]

        # Get the pairwise similarity scores of all amenities with that amenity
        sim_scores = list(enumerate(cosine_sim2[idx]))

        # Sort the amenities based on the similarity scores
        sim_scores.sort(key=lambda x: x[1], reverse=True)

        # Get the scores of the 10 most similar amenities
        sim_scores = sim_scores[1:11]

        # Get the amenity indices
        indices = [x for x, _ in sim_scores]

        return data.set_index('consolidated_amenities').iloc[indices][
            [
                'country',
                'RankingType',
                'subcategories',
                'LowerPrice',
                'UpperPrice',
            ]
        ]
    else:
        return "Amenity not found."

# Create a dropdown menu with the available amenities
amenities_dropdown = Dropdown(options=clean_df['consolidated_amenities'].unique(), description='Select Amenity:')

#@interact_manual(amenity=amenities_dropdown)
def get_recommended_amenities(amenity):
    recommended_amenities = recommend_amenities(amenity, cosine_sim2, clean_df)
    if isinstance(recommended_amenities, str):
        display(HTML(recommended_amenities))
    else:
        display(recommended_amenities)

interact_manual(get_recommended_amenities, amenity=amenities_dropdown)

: 

In [None]:
recommend_amenities('Pool', cosine_sim2, clean_df)

: 

In [None]:
#with open('../Data/tfidf_matrix.pkl', 'wb') as f:
 #   pickle.dump(tfidf_matrix, f)
#with open('../Data/.cosine_similarities.pkl', 'wb') as f:
   # pickle.dump(cosine_similarities, f)
#with open('../Data/tfidf_matrix2.pkl', 'wb') as f:
 #   pickle.dump(tfidf_matrix2, f)
#with open('../Data/clean_df.pkl', 'wb') as f:
  #  pickle.dump(clean_df, f)

#with open('../Data/.cosine_sim2.pkl', 'wb') as f:
   # pickle.dump(cosine_sim2, f)

#with open('../Data/.indices.pkl', 'wb') as f:
 #   pickle.dump(indices, f)

: 

In [None]:
def get_similar_items(amenity_input, clean_df, cosine_similarities, top_n=5):
    amenity_input = amenity_input.lower().split()
    amenity_input[0] = amenity_input[0].capitalize()
    amenity_input = ' '.join(amenity_input)
    filtered_items = clean_df[clean_df['consolidated_amenities'].apply(lambda x: any(amenity_input.lower() in amenity.lower() for amenity in x))]
    if filtered_items.empty:
        print("No similar items found for the entered amenity.")
        return []
    item_index = filtered_items.index[0]
    item_scores = list(enumerate(cosine_similarities[item_index]))
    item_scores = sorted(item_scores, key=lambda x: x[1], reverse=True)
    top_items = item_scores[:top_n]
    return top_items
# Enter the amenity for which you want to find similar items
amenity_input = input("Enter the amenity: ")
# Get similar items based on the entered amenity
similar_items = get_similar_items(amenity_input, clean_df, cosine_similarities)
if similar_items:
    # Print the top recommendations based on the entered amenity
    for item_id, similarity in similar_items:
        item_data = clean_df.iloc[item_id]
        name = item_data['name']
        subcategories = item_data['subcategories']
        amenities = item_data['consolidated_amenities']
        print(f"Item ID: {item_id}, Similarity: {similarity}")
        print(f"Name: {name}")
        print(f"Subcategories: {subcategories}")
        print(f"Amenities: {amenities}")
        print("----------------------")

: 

: 

## Model Evaluation

In [None]:
class RecommenderSystem:
    def __init__(self, clean_df, tfidfv_matrix2, cosine_sim2, cosine_similarities, indices):
        self.clean_df = clean_df
        self.tfidfv_matrix2 = tfidfv_matrix2
        self.cosine_sim2 = cosine_sim2
        self.cosine_similarities = cosine_similarities
        self.indices = indices

    def recommend_attraction(self, rating_threshold):
        # Filter the DataFrame based on the rating threshold
        recommendations = self.clean_df[self.clean_df['rating'] > rating_threshold][['name', 'LowerPrice', 'UpperPrice','amenities', 'type', 'country']]

        # Reset the index of the recommendations DataFrame
        recommendations.reset_index(drop=True, inplace=True)

        return recommendations

    def recommend_amenities(self, query):
        # Check if the specified amenity exists in the dataset
        if query not in self.clean_df['amenities'].str.join(', '):
            st.error(f"Error: '{query}' does not exist in the dataset.")
            return None

        # Convert the string representation of amenities back into a list
        self.clean_df['amenities'] = self.clean_df['amenities'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

        # Get the index of the specified amenity
        indices = self.clean_df['amenities'].apply(lambda x: query in x if isinstance(x, list) else False)

        # Get the pairwise similarity scores of all items with the specified amenity
        sim_scores = self.cosine_sim2[indices]

        # Flatten the similarity scores
        sim_scores = sim_scores.flatten()

        # Get the indices of the sorted similarity scores
        indices = np.argsort(sim_scores)[::-1]

        # Get the sorted similarity scores
        sim_scores = sim_scores[indices]

        # Get the recommended items
        recommended_items = self.clean_df.iloc[indices]

        return recommended_items

    def recommend_place(self, name):
        # Create a dictionary to map place names to their indices
        indices = {title: index for index, title in enumerate(self.clean_df['country'])}

        # Check if the specified place exists in the dataset
        if name not in indices:
            st.error(f"Error: '{name}' does not exist in the dataset.")
            return None

        # Get the index of the specified place
        idx = indices[country]

        # Get the pairwise similarity scores of all places with the specified place
        sim_scores = list(enumerate(self.cosine_similarities[idx]))

        # Sort the places based on the similarity scores
        sim_scores.sort(key=lambda x: x[1], reverse=True)

        # Get the scores of the 10 most similar places
        sim_scores = sim_scores[1:11]

        # Get the indices of the top-N similar places
        indices = [x for x, _ in sim_scores]

        # Get the recommended places
        recommended_places = self.clean_df.set_index('country').iloc[indices][
            [
                'name',
                'RankingType',
                'subcategories',
                'LowerPrice',
                'UpperPrice',
            ]
        ]

        return recommended_places

    def get_item_recommendations(self, item_index, top_n=5):
        # Get similarity scores for the item
        item_scores = list(enumerate(self.cosine_similarities[item_index]))

        # Sort items based on similarity scores
        item_scores = sorted(item_scores, key=lambda x: x[1], reverse=True)

        # Get top-N similar items
        top_items = item_scores[1:top_n + 1]  # Exclude the item itself

        return top_items



: 

In [None]:
hybrid = RecommenderSystem('clean_df', 'tfidfv_matrix2', 'cosine_sim2', 'cosine_similarities', 'indices')

: 

In [None]:
recommend_place('South Africa', cosine_sim2, clean_df)

: 

## Tuning

: 

: 

## Deployment

In [None]:
class AfricuraRecommender:
    def __init__(self, data_path, cosine_sim_path):
        self.clean_df = pd.read_csv(data_path)
        self.cosine_similarities = pd.read_csv(cosine_sim_path)

    def combine_similar_amenities(self, amenities):
        combined_amenities = []

        for amenity in amenities:
            amenity = amenity.lower().strip()  # Convert to lowercase and remove leading/trailing whitespaces

            # Combine similar amenities based on specific rules
            if amenity == "free wifi":
                amenity = "wifi"
            elif amenity == "free internet":
                amenity = "internet"
            # Add more rules for other similar amenities if needed

            combined_amenities.append(amenity)

        return combined_amenities

    def recommend_amenities(self, selected_amenity):
        # Create a dictionary to map amenities to their indices
        indices = {amen: index for index, amen in enumerate(self.clean_df['consolidated_amenities'])}

        # Check if the amenity exists in the dictionary
        if selected_amenity in indices:
            # Get the index of the amenity that matches the provided amenity
            idx = indices[selected_amenity]

            # Get the pairwise similarity scores of all amenities with that amenity
            sim_scores = list(enumerate(self.cosine_similarities[idx]))

            # Sort the amenities based on the similarity scores
            sim_scores.sort(key=lambda x: x[1], reverse=True)

            # Get the scores of the 10 most similar amenities
            sim_scores = sim_scores[1:11]

            # Get the amenity indices
            indices = [x for x, _ in sim_scores]

            return self.clean_df.set_index('consolidated_amenities').iloc[indices][
                [
                    'country',
                    'RankingType',
                    'subcategories',
                    'LowerPrice',
                    'UpperPrice',
                ]
            ]
        else:
            return "Amenity not found."

    def recommend_place(self, name):
        # Create a dictionary to map place names to their indices
        indices = {title: index for index, title in enumerate(self.clean_df['name'])}

        # Check if the specified place exists in the dataset
        if name not in indices:
            st.error(f"Error: '{name}' does not exist in the dataset.")
            return None

        # Get the index of the specified place
        idx = indices[name]

        # Get the pairwise similarity scores of all places with the specified place
        sim_scores = list(enumerate(self.cosine_similarities[idx]))

        # Sort the places based on the similarity scores
        sim_scores.sort(key=lambda x: x[1], reverse=True)

        # Get the scores of the 10 most similar places
        sim_scores = sim_scores[1:11]

        # Get the indices of the top-N similar places
        indices = [x for x, _ in sim_scores]

        # Get the recommended places
        recommended_places = self.clean_df.set_index('name').iloc[indices][
            [
                'country',
                'RankingType',
                'subcategories',
                'LowerPrice',
                'UpperPrice',
            ]
        ]

        return recommended_places

    def get_item_recommendations(self, item_index, top_n=5):
        # Get similarity scores for the item
        item_scores = list(enumerate(self.cosine_similarities[item_index]))

        # Sort items based on similarity scores
        item_scores = sorted(item_scores, key=lambda x: x[1], reverse=True)

        # Get top-N similar items
        top_items = item_scores[1:top_n + 1]  # Exclude the item itself

        return top_items


: 

: 

## Conclusion and Recommendations