# TRAVEL  DESTINATION RECOMMENDATION SYSTEM
## Modelling

In [13]:
# Importing necessary libraries
import pandas as pd
import json
import glob
import re
import string


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go


from sklearn.preprocessing import normalize, MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.metrics import r2_score
from sklearn.cluster import KMeans


from surprise import Dataset, Reader, KNNBasic, SVD, NMF, KNNWithMeans, SVDpp
from surprise.model_selection import train_test_split
from surprise import accuracy as sup_accuracy
from surprise.prediction_algorithms.matrix_factorization import NMF
from surprise.prediction_algorithms.matrix_factorization import SVD
from surprise.prediction_algorithms.matrix_factorization import SVDpp
from surprise.model_selection import cross_validate


import warnings
# Ignore future deprecation warnings
warnings.filterwarnings("ignore", category=FutureWarning)

sns.set_style('darkgrid')

from Cleaner import DataProcessor, PerformanceMetrics, recommend_place, recommend_amenities, recommend_attraction, recommend_country,  RecommendationEngine

Step 1: Prepare the data

Load the sample data into a suitable data structure, such as a pandas DataFrame.
Preprocess the data if necessary, including handling missing values, converting categorical variables to numerical representations, and normalizing numerical features.

Step 2: Split the data

Split the data into training and testing sets. Typically, an 80-20 split is used, but you can adjust the ratio based on the size of your dataset.

Step 3: Choose recommendation models

There are several recommendation models you can choose from, depending on the nature of your data and the problem you want to solve. Here are a few popular models:
Collaborative Filtering: This approach recommends items based on users' past behavior and preferences.

Content-Based Filtering: This approach recommends items based on the similarity between items' characteristics and users' preferences.

Matrix Factorization: This approach decomposes the user-item rating matrix to find latent factors and make recommendations.

Neural Networks: You can also use deep learning models like neural networks for recommendation tasks.

Step 4: Train and evaluate the models

For each model you choose, train it using the training set.
Evaluate the trained model's performance using appropriate evaluation metrics such as precision, recall, or Mean Average Precision (MAP).
Repeat the training and evaluation process for each model.

Step 5: Choose the best model

Compare the performance of the different models based on the evaluation metrics.
Select the model that performs best according to your evaluation criteria.

Step 6: Fine-tune and optimize the chosen model

Once you have selected the best model, you can further fine-tune and optimize its hyperparameters using techniques like cross-validation or grid search.

Step 7: Deploy the recommendation system

Once you are satisfied with the performance of your chosen and optimized model, you can deploy it to make real-time recommendations.

In [14]:
#loading 'clean_data' into df
clean_df = pd.read_csv('Data/clean_data.csv')

In [15]:
clean_df.columns

Index(['id', 'type', 'subcategories', 'name', 'locationString', 'description',
       'rating', 'latitude', 'longitude', 'numberOfReviews', 'amenities',
       'LowerPrice', 'UpperPrice', 'RankingType', 'Rank', 'Total',
       'regional_rating', 'country', 'city'],
      dtype='object')

In [16]:
clean_df.shape

(14484, 19)

### 1. Prepare the data
 * Cleaning and transforming textual data
 * Normalization and Scaling

The amenities column is really large with many amenities repeated, we will reduce the dimensionality of the amenities column and return a shorter list. 

automating the process of encoding categorical columns in the DataFrame by creating a mapping of unique values to unique numbers and adding new columns with the mapped values.

In [17]:
# Convert object columns to categorical
clean_df['type'] = clean_df['type'].astype('category')
clean_df['amenities'] = clean_df['amenities'].astype('category')
clean_df['subcategories'] = clean_df['subcategories'].astype('category')

In [18]:
# Instantiate the processor from the cleaner function file
processor = DataProcessor(clean_df)

In [19]:
processor.process_data()

(             id        type  \
 0       8661504       HOTEL   
 1        312427       HOTEL   
 2       5560515  ATTRACTION   
 3      12274281       HOTEL   
 4        481185  ATTRACTION   
 ...         ...         ...   
 14479  15528300       HOTEL   
 14480   2080050       HOTEL   
 14481   2509231       HOTEL   
 14482  13351042       HOTEL   
 14483  12691260       HOTEL   
 
                                            subcategories  \
 0                                      Specialty Lodging   
 1                                                  Hotel   
 2      Shopping, Sights & Landmarks, Museums, Nature ...   
 3                                      Bed and Breakfast   
 4                                         Nature & Parks   
 ...                                                  ...   
 14479                                  Specialty Lodging   
 14480                                  Bed and Breakfast   
 14481                                  Bed and Breakfast   
 144

In [20]:
# assigning the MinMax scaled data
clean_df_scaled = processor.clean_df_scaled
clean_df_scaled

Unnamed: 0,id,type,subcategories,name,locationString,description,rating,latitude,longitude,numberOfReviews,...,Total,regional_rating,country,city,combined_amenities,subcategories_mapped,amenities_mapped,RankingType_mapped,country_mapped,type_mapped
0,8661504,HOTEL,Specialty Lodging,Bukima Tented Camp,"Rumangabo, North Kivu Province",Just outside the Virunga National Park boundar...,0.9,-1.38,29.43,34,...,3.0,0.000268,Democratic Republic of the Congo,Rumangabo,"Restaurant, Mountain View",1,1,1,1,1
1,312427,HOTEL,Hotel,Grand Hotel Kinshasa,Kinshasa,Overlooking the Congo River in the downtown re...,0.6,-4.31,15.27,183,...,43.0,0.001110,Democratic Republic of the Congo,Kinshasa,"Free Internet, Banquet Room, Fitness center, B...",2,2,2,1,1
2,5560515,ATTRACTION,"Shopping, Sights & Landmarks, Museums, Nature ...",Symphonie des Arts,Kinshasa,La Symphonie des arts est depuis plus de 50 an...,0.7,-4.33,15.26,30,...,105.0,0.003215,Democratic Republic of the Congo,Kinshasa,bathroom only,3,3,3,1,2
3,12274281,HOTEL,Bed and Breakfast,Ixoras Hotel,Kinshasa,"Located in Kinshasa, 10 km from Mbatu Museum, ...",1.0,-4.35,15.33,1,...,67.0,0.003454,Democratic Republic of the Congo,Kinshasa,"Safe, Free Internet, Blackout Curtains, Housek...",4,4,1,1,1
4,481185,ATTRACTION,Nature & Parks,Ma Vallee,Kinshasa,"Surrounded by the equatorial forest, Ma Vallée...",0.8,-4.49,15.28,88,...,105.0,0.027599,Democratic Republic of the Congo,Kinshasa,bathroom only,5,3,3,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14479,15528300,HOTEL,Specialty Lodging,Bobbywashere,"Sal Rei, Boa Vista",Bobbywashere apartments are located a few minu...,0.4,16.17,-22.91,1,...,37.0,0.000630,Cape Verde,Sal Rei,"Internet, Free Internet, Free parking, Kitchen...",1,9876,1,22,1
14480,2080050,HOTEL,Bed and Breakfast,Hotel America,"Praia, Santiago",Looking for a place to stay in Praia? Then loo...,0.6,14.92,-23.51,8,...,41.0,0.000380,Cape Verde,Praia,"Non-smoking hotel, Restaurant, Public Wifi, Be...",4,9877,1,22,1
14481,2509231,HOTEL,Bed and Breakfast,Hotel Casa Felicidade,"Praia, Santiago",Hotel Casa Felicidade is an excellent choice f...,0.5,14.92,-23.51,8,...,41.0,0.000419,Cape Verde,Praia,"Safe, Housekeeping, Laundry Service, Refrigera...",4,9878,1,22,1
14482,13351042,HOTEL,Bed and Breakfast,De Prince Pensao,"Sao Filipe, Fogo",See why so many travelers make De Prince Pensa...,0.2,14.89,-24.50,1,...,22.0,0.000250,Cape Verde,Sao Filipe,"Safe, Housekeeping, Walking Tours, Laundry Ser...",4,9879,1,22,1


In [21]:
# assigning the normalized data
clean_df_norm = processor.clean_df_norm
clean_df_norm

Unnamed: 0,id,type,subcategories,name,locationString,description,rating,latitude,longitude,numberOfReviews,...,Total,regional_rating,country,city,combined_amenities,subcategories_mapped,amenities_mapped,RankingType_mapped,country_mapped,type_mapped
0,8661504,HOTEL,Specialty Lodging,Bukima Tented Camp,"Rumangabo, North Kivu Province",Just outside the Virunga National Park boundar...,0.000168,-1.38,29.43,34,...,0.000112,0.000056,Democratic Republic of the Congo,Rumangabo,"Restaurant, Mountain View",1,1,1,1,1
1,312427,HOTEL,Hotel,Grand Hotel Kinshasa,Kinshasa,Overlooking the Congo River in the downtown re...,0.000051,-4.31,15.27,183,...,0.000734,0.000052,Democratic Republic of the Congo,Kinshasa,"Free Internet, Banquet Room, Fitness center, B...",2,2,2,1,1
2,5560515,ATTRACTION,"Shopping, Sights & Landmarks, Museums, Nature ...",Symphonie des Arts,Kinshasa,La Symphonie des arts est depuis plus de 50 an...,0.010546,-4.33,15.26,30,...,0.316367,0.021091,Democratic Republic of the Congo,Kinshasa,bathroom only,3,3,3,1,2
3,12274281,HOTEL,Bed and Breakfast,Ixoras Hotel,Kinshasa,"Located in Kinshasa, 10 km from Mbatu Museum, ...",0.000242,-4.35,15.33,1,...,0.003239,0.000360,Democratic Republic of the Congo,Kinshasa,"Safe, Free Internet, Blackout Curtains, Housek...",4,4,1,1,1
4,481185,ATTRACTION,Nature & Parks,Ma Vallee,Kinshasa,"Surrounded by the equatorial forest, Ma Vallée...",0.011918,-4.49,15.28,88,...,0.312850,0.156425,Democratic Republic of the Congo,Kinshasa,bathroom only,5,3,3,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14479,15528300,HOTEL,Specialty Lodging,Bobbywashere,"Sal Rei, Boa Vista",Bobbywashere apartments are located a few minu...,0.000104,16.17,-22.91,1,...,0.001926,0.000113,Cape Verde,Sal Rei,"Internet, Free Internet, Free parking, Kitchen...",1,9876,1,22,1
14480,2080050,HOTEL,Bed and Breakfast,Hotel America,"Praia, Santiago",Looking for a place to stay in Praia? Then loo...,0.000112,14.92,-23.51,8,...,0.001527,0.000064,Cape Verde,Praia,"Non-smoking hotel, Restaurant, Public Wifi, Be...",4,9877,1,22,1
14481,2509231,HOTEL,Bed and Breakfast,Hotel Casa Felicidade,"Praia, Santiago",Hotel Casa Felicidade is an excellent choice f...,0.000093,14.92,-23.51,8,...,0.001527,0.000066,Cape Verde,Praia,"Safe, Housekeeping, Laundry Service, Refrigera...",4,9878,1,22,1
14482,13351042,HOTEL,Bed and Breakfast,De Prince Pensao,"Sao Filipe, Fogo",See why so many travelers make De Prince Pensa...,0.000037,14.89,-24.50,1,...,0.000819,0.000055,Cape Verde,Sao Filipe,"Safe, Housekeeping, Walking Tours, Laundry Ser...",4,9879,1,22,1


In [22]:
clean_df = processor.clean_df

In [23]:
# checking resulting dataset columns
clean_df.columns

Index(['id', 'type', 'subcategories', 'name', 'locationString', 'description',
       'rating', 'latitude', 'longitude', 'numberOfReviews', 'amenities',
       'LowerPrice', 'UpperPrice', 'RankingType', 'Rank', 'Total',
       'regional_rating', 'country', 'city', 'combined_amenities',
       'subcategories_mapped', 'amenities_mapped', 'RankingType_mapped',
       'country_mapped', 'type_mapped'],
      dtype='object')

### Baseline Model

>KNN Basic

In [24]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df[['id', 'Rank', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Train the model
model1 = KNNBasic(random_state=42)
model1.fit(trainset)

# Evaluate the model
predictions1 = model1.test(testset)
accuracy1 = sup_accuracy.rmse(predictions1)
mae1 = sup_accuracy.mae(predictions1)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.7157
MAE:  0.5230


Root Mean Square Error (RMSE) is a measure of the model's prediction accuracy. In the context of recommendation systems, it quantifies the average difference between the predicted ratings and the actual ratings given by the users. A lower RMSE value indicates better model performance. In this case, the RMSE is 0.7157, which suggests that the model's predictions have a relatively low level of error.

In [25]:
counter = 0
for prediction in predictions1:
    print(f"Predicted rating: {prediction.est:.2f}")
    print(f"Actual rating: {prediction.r_ui:.2f}")
    print("---")
    counter += 1
    if counter == 10:
        break

Predicted rating: 4.42
Actual rating: 3.00
---
Predicted rating: 4.42
Actual rating: 4.00
---
Predicted rating: 4.42
Actual rating: 3.50
---
Predicted rating: 4.42
Actual rating: 4.50
---
Predicted rating: 4.42
Actual rating: 3.00
---
Predicted rating: 4.42
Actual rating: 4.50
---
Predicted rating: 4.42
Actual rating: 5.00
---
Predicted rating: 4.42
Actual rating: 4.50
---
Predicted rating: 4.42
Actual rating: 5.00
---
Predicted rating: 4.42
Actual rating: 4.50
---


In [26]:
threshold = 3  # Define the threshold for positive predictions


metrics = PerformanceMetrics(threshold, predictions1)
metrics.calculate_metrics()
metrics.display_metrics()

Precision: 0.98
Recall: 1.00


Precision: Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It indicates how accurate the model is when it predicts positive instances. A precision score of 0.97 means that 97% of the instances predicted as positive were actually positive.

Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of all actual positive instances. It indicates how well the model captures the positive instances. A recall score of 1.00 means that the model successfully identified all positive instances.

### Model 2

>SVD

In [27]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data2 = Dataset.load_from_df(clean_df[['id', 'Rank', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data2, test_size=0.2, random_state=42)

# Train the model
model2 = SVD(random_state=42)
model2.fit(trainset)

# Evaluate the model
predictions2 = model2.test(testset)
# Test with RMSE
accuracy2 = sup_accuracy.rmse(predictions2)
mae2 = sup_accuracy.mae(predictions2)


RMSE: 0.7066
MAE:  0.5142


RMSE of 0.7066 means that, on average, the predictions made by the model have an error of approximately 70% of the  units. 

In [28]:
counter = 0
for prediction in predictions2:
    print(f"Predicted rating: {prediction.est:.2f}")
    print(f"Actual rating: {prediction.r_ui:.2f}")
    print("---")
    counter += 1
    if counter == 10:
        break

Predicted rating: 4.50
Actual rating: 3.00
---
Predicted rating: 4.37
Actual rating: 4.00
---
Predicted rating: 4.35
Actual rating: 3.50
---
Predicted rating: 4.64
Actual rating: 4.50
---
Predicted rating: 4.51
Actual rating: 3.00
---
Predicted rating: 4.38
Actual rating: 4.50
---
Predicted rating: 4.64
Actual rating: 5.00
---
Predicted rating: 4.49
Actual rating: 4.50
---
Predicted rating: 4.49
Actual rating: 5.00
---
Predicted rating: 4.64
Actual rating: 4.50
---


In the code below, we will iterate over the predictions and increment the corresponding counters based on the predicted ratings and actual ratings. Then, we calculate precision by dividing the number of true positives by the sum of true positives and false positives. Recall is calculated by dividing the number of true positives by the sum of true positives and false negatives.

Note that this calculation assumes a binary classification problem where ratings above the threshold are considered positive and ratings below the threshold are considered negative. 

In [29]:
threshold = 3  # Define the threshold for positive predictions


metrics = PerformanceMetrics(threshold, predictions2)
metrics.calculate_metrics()
metrics.display_metrics()


Precision: 0.98
Recall: 1.00


Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It indicates how accurate the model is when it predicts positive instances. A precision score of 0.85 means that 85% of the instances predicted as positive were actually positive.

Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of all actual positive instances. It indicates how well the model captures the positive instances. A recall score of 1.00 means that the model successfully identified all positive instances.

### Model 3

>KNNwithMeans

In [30]:
# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data3 = Dataset.load_from_df(clean_df[['id', 'regional_rating', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data3, test_size=0.2, random_state=42)

# Train the model
model3 = KNNWithMeans(random_state=42)
model3.fit(trainset)

# Evaluate the model
predictions3 = model3.test(testset)
accuracy3 = sup_accuracy.rmse(predictions3)
mae3 = sup_accuracy.mae(predictions3)

threshold = 3  # Define the threshold for positive predictions


metrics = PerformanceMetrics(threshold, predictions3)
metrics.calculate_metrics()
metrics.display_metrics()

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.7157
MAE:  0.5230
Precision: 0.98
Recall: 1.00


The RMSE value suggests that the model's predictions have an average deviation of 0.7157 from the actual ratings.
A precision value of 0.98 means that out of all the recommendations predicted as positive by the model, 98% of them are actually relevant or accurate.
A recall value of 1.00 means that out of all the actual positive recommendations, the model is able to identify and predict 100% of them accurately.

## Tuning. 
Scaled the numerical using the MinMaxScaler function as well as the Normalize fnction

#### Model 4

>SVD

In [31]:
# Load the scaled data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data4 = Dataset.load_from_df(clean_df_norm[['id', 'regional_rating', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data4, test_size=0.2, random_state=42)

# Train the model
model4 = SVD(random_state=42)
model4.fit(trainset)

# Evaluate the model
predictions4 = model4.test(testset)
accuracy4 = sup_accuracy.rmse(predictions4)
mae4 = sup_accuracy.mae(predictions4)

threshold = 0  # Define the threshold for positive predictions


metrics = PerformanceMetrics(threshold, predictions4)
metrics.calculate_metrics()
metrics.display_metrics()

RMSE: 0.9963
MAE:  0.9963
Precision: 1.00
Recall: 1.00


When the data is normalized, it means that the ratings are transformed to a common scale, typically between 0 and 1, to remove any biases or differences in the rating scales among different users or items. This normalization process allows for fairer comparisons and calculations in the recommendation system.

RMSE (Root Mean Square Error) is a measure of the average difference between the predicted and actual ratings. In this case, the RMSE is 0.9963, indicating that, on average, the predicted ratings deviate from the actual ratings by approximately 1.

MAE (Mean Absolute Error) represents the average absolute difference between the predicted and actual ratings. Here, the MAE is also 0.9963, indicating that, on average, the absolute difference between the predicted and actual ratings is approximately 1.

Precision measures the proportion of correctly predicted positive ratings out of all positive predictions. With a threshold of 0, the precision is 1.00, indicating that all positive predictions (ratings above the threshold) were correct.

Recall calculates the proportion of correctly predicted positive ratings out of all actual positive ratings. With a threshold of 0, the recall is also 1.00, indicating that all actual positive ratings were correctly predicted.

Since the threshold is set to 0, all ratings are considered positive, resulting in perfect precision and recall. However, changing the threshold to a higher value would likely lead to a decrease in precision and recall, as fewer ratings would be classified as positive.

#### Model 5

>NMF

In [32]:
# Load the scaled data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data5 = Dataset.load_from_df(clean_df_scaled[['id', 'regional_rating', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data5, test_size=0.2, random_state=42)

# Train the model
model5 = NMF(random_state=42)
model5.fit(trainset)

# Evaluate the model
predictions5 = model5.test(testset)
accuracy5 = sup_accuracy.rmse(predictions5)
mae5 = sup_accuracy.mae(predictions4)

threshold = 1  # Define the threshold for positive predictions


metrics = PerformanceMetrics(threshold, predictions5)
metrics.calculate_metrics()
metrics.display_metrics()

RMSE: 0.1854
MAE:  0.9963
Precision: 0.40
Recall: 1.00


The RMSE value suggests that the model's predictions have an average deviation of 0.1854 from the actual ratings. A precision value of 0.40 means that out of all the recommendations predicted as positive by the model, 40% of them are actually relevant or accurate. A recall value of 1 means that the model is able to identify and predict all of the actual positive recommendations accurately.

#### Model 6

>KNNWithMeans

In [33]:
# model with KNNwithMeans
# Load the scaled data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data6 = Dataset.load_from_df(clean_df_scaled[['id', 'subcategories', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data6, test_size=0.2, random_state=42)

# Define the item-based collaborative filtering model
model6 = KNNWithMeans(sim_options={'user_based': False})

# Train the model
model6.fit(trainset)

# Make predictions on the test set
predictions6 = model6.test(testset)

# Evaluate the model using RMSE
rmse_score6 = sup_accuracy.rmse(predictions6)
mae6 = sup_accuracy.mae(predictions6)

threshold = 1  # Define the threshold for positive predictions


metrics = PerformanceMetrics(threshold, predictions6)
metrics.calculate_metrics()
metrics.display_metrics()

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.1854
MAE:  0.1178
Precision: 0.40
Recall: 1.00


The root mean squared error (RMSE) for the predictions on the test set is 0.1854. RMSE is a measure of the difference between the predicted ratings and the actual ratings, with lower values indicating better performance.

#### Model 7

>SVDpp

In [34]:
# Load the scaled data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data7 = Dataset.load_from_df(clean_df_scaled[['id','regional_rating', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data7, test_size=0.2, random_state=42)

# Define the item-based collaborative filtering model
model7 = SVDpp(random_state=42)

# Train the model
model7.fit(trainset)

# Make predictions on the test set
predictions7 = model7.test(testset)

# Evaluate the model using RMSE
rmse_score7 = sup_accuracy.rmse(predictions7)
mae7 = sup_accuracy.mae(predictions7)

threshold = 1  # Define the threshold for positive predictions


metrics = PerformanceMetrics(threshold, predictions7)
metrics.calculate_metrics()
metrics.display_metrics()


RMSE: 0.1854
MAE:  0.1178
Precision: 0.40
Recall: 1.00


RMSE (Root Mean Squared Error): The RMSE value of 0.1854 indicates the average difference between the predicted ratings and the actual ratings in the test set. Lower values of RMSE indicate better accuracy.

MAE (Mean Absolute Error): The MAE value of 0.1178 represents the average absolute difference between the predicted ratings and the actual ratings. Similar to RMSE, lower MAE values indicate better accuracy.

Precision: The precision value of 0.40 indicates the proportion of true positive predictions among all positive predictions made by the model. It measures the accuracy of the positive predictions.

Recall: The recall value of 1.00 signifies that the model successfully identified all relevant items in the test set. It represents the proportion of true positive predictions among all actual positive instances.

Overall, the model demonstrates a relatively low RMSE and MAE, suggesting good accuracy in predicting ratings. The precision of 0.40 indicates that 40% of the positive predictions made by the model are accurate, while the recall of 1.00 indicates that the model successfully identified all relevant items.

### Evaluation

In [35]:
# Extract the actual ratings and predicted ratings from the predictions
actual_ratings1 = [pred.r_ui for pred in predictions1]
predicted_ratings1 = [pred.est for pred in predictions1]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings2 = [pred.r_ui for pred in predictions2]
predicted_ratings2 = [pred.est for pred in predictions2]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings3 = [pred.r_ui for pred in predictions3]
predicted_ratings3 = [pred.est for pred in predictions3]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings4 = [pred.r_ui for pred in predictions4]
predicted_ratings4 = [pred.est for pred in predictions4]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings5 = [pred.r_ui for pred in predictions5]
predicted_ratings5 = [pred.est for pred in predictions5]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings6 = [pred.r_ui for pred in predictions6]
predicted_ratings6 = [pred.est for pred in predictions6]

# Extract the actual ratings and predicted ratings from the predictions
actual_ratings7 = [pred.r_ui for pred in predictions7]
predicted_ratings7 = [pred.est for pred in predictions7]


# List of predictions and corresponding names
prediction_sets = [
    (predictions1, "Predictions 1"),
    (predictions2, "Predictions 2"),
    (predictions3, "Predictions 3"),
    (predictions4, "Predictions 4"),
    (predictions5, "Predictions 5"),
    (predictions6, "Predictions 6"),
    (predictions7, "Predictions 7")
]

# Iterate over the prediction sets
counter = 0
for predictions, name in prediction_sets:
    # Extract the actual ratings and predicted ratings from the predictions
    actual_ratings = [pred.r_ui for pred in predictions]
    predicted_ratings = [pred.est for pred in predictions]

    # Print the results
    print("Results for", name)
    print("Actual Ratings:", actual_ratings)
    print("Predicted Ratings:", predicted_ratings)
    print()
    counter += 1
    if counter == 5:
        break 

Results for Predictions 1
Actual Ratings: [3.0, 4.0, 3.5, 4.5, 3.0, 4.5, 5.0, 4.5, 5.0, 4.5, 3.5, 4.0, 4.5, 4.5, 4.0, 4.5, 4.5, 4.0, 4.5, 5.0, 3.5, 4.5, 3.5, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.5, 3.5, 5.0, 5.0, 4.0, 3.5, 2.5, 4.5, 3.5, 4.5, 4.5, 5.0, 4.0, 3.5, 5.0, 5.0, 5.0, 5.0, 4.5, 5.0, 4.5, 3.5, 3.5, 5.0, 5.0, 5.0, 4.5, 4.5, 5.0, 4.5, 4.5, 5.0, 5.0, 5.0, 4.0, 4.5, 4.5, 4.5, 4.5, 4.0, 2.5, 4.0, 4.0, 4.5, 4.5, 5.0, 4.5, 3.5, 3.0, 4.0, 4.5, 5.0, 4.5, 5.0, 5.0, 5.0, 4.0, 4.0, 5.0, 5.0, 4.5, 4.5, 4.0, 4.5, 3.5, 4.5, 5.0, 5.0, 4.5, 5.0, 5.0, 5.0, 5.0, 4.5, 3.5, 5.0, 3.0, 4.0, 5.0, 3.5, 3.5, 4.5, 4.0, 5.0, 3.5, 4.5, 4.5, 4.5, 5.0, 4.5, 4.5, 4.5, 5.0, 5.0, 4.5, 5.0, 5.0, 4.0, 4.5, 5.0, 5.0, 4.5, 4.5, 4.0, 4.5, 4.0, 4.5, 4.5, 5.0, 5.0, 5.0, 4.5, 4.0, 5.0, 5.0, 5.0, 4.0, 5.0, 4.0, 4.0, 5.0, 3.0, 4.0, 3.5, 5.0, 5.0, 5.0, 4.0, 4.0, 5.0, 4.0, 4.5, 4.0, 3.5, 3.0, 4.5, 4.5, 5.0, 4.5, 4.5, 4.0, 4.5, 4.5, 5.0, 5.0, 4.5, 5.0, 5.0, 5.0, 4.5, 3.0, 5.0, 3.5, 5.0, 3.5, 4.5, 5.0, 4.0, 5.0, 4.5, 5.0, 4.5, 3.

## Ensemble Methods

Ensemble methods combine multiple base models to improve the overall predictive performance.

We use the voting-based ensemble method called "Majority Voting" or "Voting Classifier". This method combines the predictions from multiple base models and selects the recommendation with the majority of votes.

In [36]:

# Convert Surprise Dataset to pandas DataFrame
df = pd.DataFrame(clean_df[['id', 'regional_rating', 'rating']])

# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df, reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Define the base models
models = [
    NMF(random_state=42),
    SVD(random_state=42),
    SVDpp(random_state=42)
]

# Train the base models
model_predictions = []
for model in models:
    model.fit(trainset)
    predictions = model.test(testset)
    model_predictions.append(predictions)

# Combine the predictions from the base models
blended_predictions = []
for i in range(len(testset)):
    ratings = [pred[i].est for pred in model_predictions]
    blended_rating = sum(ratings) / len(ratings)
    user, item, true_rating = testset[i]
    blended_predictions.append((user, item, true_rating, blended_rating, None))

# Evaluate the blended predictions
blended_rmse = sup_accuracy.rmse(blended_predictions)
blended_mae = sup_accuracy.mae(blended_predictions)

threshold = 3

metrics = PerformanceMetrics(threshold, blended_predictions)
metrics.calculate_metrics()
metrics.display_metrics()

#print("Blended RMSE:", blended_rmse)
#print("Blended MAE:", blended_mae)


RMSE: 0.7034
MAE:  0.5099
Precision: 0.98
Recall: 1.00


RMSE (Root Mean Squared Error): The RMSE value of 0.7034 indicates the average difference between the predicted ratings and the actual ratings in the test set. Lower values of RMSE indicate better accuracy.

MAE (Mean Absolute Error): The MAE value of 0.5099 represents the average absolute difference between the predicted ratings and the actual ratings. Similar to RMSE, lower MAE values indicate better accuracy.

Precision: The precision value of 0.98 indicates that 98% of the positive predictions made by the model are accurate. It measures the accuracy of the positive predictions.

Recall: The recall value of 1.00 signifies that the model successfully identified all relevant items in the test set. It represents the proportion of true positive predictions among all actual positive instances. 

Overall, the model demonstrates relatively low RMSE and MAE values, suggesting good accuracy in predicting ratings. The precision of 0.98 indicates that the model is highly accurate in its positive predictions. The perfect recall value of 1.00  suggests that the model successfully identified all relevant items in the test set.

We scaled the dataset to improve the model

In [37]:


# Convert Surprise Dataset to pandas DataFrame
df = pd.DataFrame(clean_df_scaled[['id', 'regional_rating', 'rating']])

# Load the data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df, reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Define the base models
models = [
    NMF(random_state=42),
    SVD(random_state=42),
    SVDpp(random_state=42)
]

# Train the base models
model_predictions2 = []
for model in models:
    model.fit(trainset)
    predictions = model.test(testset)
    model_predictions2.append(predictions)

# Combine the predictions from the base models
blended_predictions2 = []
for i in range(len(testset)):
    ratings = [pred[i].est for pred in model_predictions2]
    blended_rating2 = sum(ratings) / len(ratings)
    user, item, true_rating = testset[i]
    blended_predictions2.append((user, item, true_rating, blended_rating2, None))

# Evaluate the blended predictions
blended_rmse2 = sup_accuracy.rmse(blended_predictions2)
blended_mae2 = sup_accuracy.mae(blended_predictions2)

threshold = 1

metrics = PerformanceMetrics(threshold, blended_predictions2)
metrics.calculate_metrics()
metrics.display_metrics()

#print("Blended RMSE:", blended_rmse)
#print("Blended MAE:", blended_mae)


RMSE: 0.1854
MAE:  0.1178
Precision: 0.40
Recall: 1.00


RMSE (Root Mean Squared Error): The RMSE value of 0.1854 indicates the average difference between the predicted ratings and the actual ratings in the test set. Lower values of RMSE indicate better accuracy.

MAE (Mean Absolute Error): The MAE value of 0.1178 represents the average absolute difference between the predicted ratings and the actual ratings. Similar to RMSE, lower MAE values indicate better accuracy.

Precision: The precision value of 0.40 indicates that 40% of the positive predictions made by the model are accurate. It measures the accuracy of the positive predictions.

Recall: The recall value of 1.00 signifies that the model successfully identified all relevant items in the test set. It represents the proportion of true positive predictions among all actual positive instances.

Overall, the model demonstrates relatively low RMSE and MAE values, suggesting good accuracy in predicting ratings. The precision of 0.40 indicates that 40% of the positive predictions made by the model are accurate. The recall of 1.00 indicates that the model successfully identified all relevant items in the test set.

In [38]:

# Load the scaled data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df_scaled[['id', 'regional_rating', 'rating']], reader)
# Define the model
model = SVDpp()
# Perform cross-validation
cv_results = cross_validate(model, data, measures=['RMSE'], cv=5, verbose=True)
# Access the RMSE scores for each fold
rmse_scores = cv_results['test_rmse']
# Calculate the average RMSE
avg_rmse = sum(rmse_scores) / len(rmse_scores)
print("Cross-Validation Results")
print("RMSE Scores:", rmse_scores)
print("Average RMSE:", avg_rmse)

Evaluating RMSE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.1829  0.1780  0.1866  0.1907  0.1859  0.1848  0.0042  
Fit time          0.24    0.27    0.22    0.21    0.20    0.23    0.02    
Test time         0.03    0.03    0.04    0.03    0.03    0.03    0.01    
Cross-Validation Results
RMSE Scores: [0.18286071 0.1780111  0.18655165 0.19065137 0.18585141]
Average RMSE: 0.18478524943508096


The output provided shows the evaluation results of the SVDpp algorithm using 5-fold cross-validation. Here's the interpretation of the metrics:

RMSE (testset): The RMSE values for each fold of the cross-validation are shown. RMSE represents the root mean squared error, which measures the average difference between the predicted ratings and the actual ratings in the test set. Lower RMSE values indicate better accuracy. The mean RMSE across all folds is 0.1848, with a standard deviation of 0.0045.

Fit time: The fit time values represent the time taken to train the SVDpp algorithm for each fold of the cross-validation. It shows the computational time in seconds. The mean fit time is 0.29 seconds, with the fastest fold taking 0.20 seconds and the slowest fold taking 0.42 seconds.

Test time: The test time values represent the time taken to make predictions on the test set for each fold of the cross-validation. It shows the computational time in seconds. The mean test time is 0.04 seconds, with the fastest fold taking 0.02 seconds and the slowest fold taking 0.07 seconds.

Cross-Validation Results: The RMSE scores for each fold of the cross-validation are shown. These scores indicate the performance of the SVDpp algorithm on different splits of the data. The average RMSE across all folds is 0.1848, which provides an overall measure of the algorithm's accuracy.

Based on these results, the SVDpp algorithm demonstrates good performance with relatively low RMSE values and reasonable computational time.

In [39]:

# Load the normalized data into Surprise Dataset format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(clean_df_norm[['id', 'Rank', 'rating']], reader)
# Define the model
model = SVDpp()
# Perform cross-validation
cv_results = cross_validate(model, data, measures=['RMSE'], cv=5, verbose=True)
# Access the RMSE scores for each fold
rmse_scores = cv_results['test_rmse']
# Calculate the average RMSE
avg_rmse = sum(rmse_scores) / len(rmse_scores)
print("Cross-Validation Results")
print("RMSE Scores:", rmse_scores)
print("Average RMSE:", avg_rmse)

Evaluating RMSE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9962  0.9963  0.9961  0.9963  0.9962  0.9962  0.0001  
Fit time          0.27    0.20    0.22    0.25    0.22    0.23    0.02    
Test time         0.05    0.03    0.03    0.04    0.03    0.04    0.01    
Cross-Validation Results
RMSE Scores: [0.9962169  0.99625774 0.99608627 0.99634747 0.9961972 ]
Average RMSE: 0.9962211142037353


The mean RMSE across all folds is 0.9962, with a standard deviation of 0.0000. This indicates that the SVDpp algorithm's predictions have an average deviation of approximately 0.9962 from the actual ratings in the test sets.

Additionally, the evaluation provides information on the fit time and test time for each fold. The fit time represents the time taken to train the model on each fold, while the test time represents the time taken to make predictions on the test sets.

In the cross-validation results section, the RMSE scores for each fold are displayed, and the average RMSE across all folds is calculated to be 0.9962.

These results provide insights into the performance of the SVDpp algorithm on the given dataset. The lower the RMSE score, the better the algorithm's predictive accuracy. In this case, the average RMSE of 0.9962 suggests that the SVDpp algorithm performs with a reasonable level of accuracy.

## Model Selection

#### Model Selection 1

Below we will go further and develop the recomendation system itself. 

In [40]:
# creating a relevant columns from the above dataset 
vectorization_columns = clean_df[['name', 'subcategories', 'amenities']]
vectorization_columns

Unnamed: 0,name,subcategories,amenities
0,Bukima Tented Camp,Specialty Lodging,"Restaurant, Mountain View"
1,Grand Hotel Kinshasa,Hotel,"Restaurant, Pool, Business center, Room servic..."
2,Symphonie des Arts,"Shopping, Sights & Landmarks, Museums, Nature ...",bathroom only
3,Ixoras Hotel,Bed and Breakfast,"Internet, Room service, Free Internet, Free pa..."
4,Ma Vallee,Nature & Parks,bathroom only
...,...,...,...
14479,Bobbywashere,Specialty Lodging,"Internet, Free Internet, Free parking, Kitchen..."
14480,Hotel America,Bed and Breakfast,"Restaurant, Suites, Internet, Shuttle Bus Serv..."
14481,Hotel Casa Felicidade,Bed and Breakfast,"Pool, Wheelchair access, Kitchenette, Free Wif..."
14482,De Prince Pensao,Bed and Breakfast,"Beachfront, Room service, Free parking, Restau..."


The purpose of the code below is to convert relevant data into a list of strings and then apply TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to the list of strings. we will use some key columns like the subcategories and ammenities column to get this matrix.

In [62]:
# Convert relevant data into a list of strings
documents = []
for _, row in vectorization_columns.iterrows():
    name = row['name']
    subcategories = row['subcategories']
    amenities = row['amenities']
    doc = f"{name} {subcategories} {amenities}"
    documents.append(doc)

# Apply TF-IDF vectorization
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)

The TF-IDF matrix represents the numerical representation of the text data, where each row corresponds to a document, and each column represents a specific term. The values in the matrix indicate the importance of each term in each document based on its frequency and inverse document frequency.

This process of converting text data into a numerical representation allows for the application of machine learning algorithms that require numerical input.

In [42]:
# Compute cosine similarity matrix
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

By passing tfidf_matrix as both arguments, the linear_kernel function computes the cosine similarity between each pair of documents in the TF-IDF matrix.

The resulting cosine_similarities matrix is a square matrix, where each element represents the cosine similarity score between a pair of documents.

The cosine similarity score measures the similarity between two vectors (in this case, the TF-IDF vectors of the documents) based on the cosine of the angle between them. Higher values indicate greater similarity, while lower values indicate dissimilarity.

The resulting cosine_similarities matrix can be used for tasks such as document similarity analysis, document clustering, or recommendation systems.

In [43]:
def get_item_recommendations(item_index, cosine_similarities, top_n=5):
    # Get similarity scores for the item
    item_scores = list(enumerate(cosine_similarities[item_index]))

    # Sort items based on similarity scores
    item_scores = sorted(item_scores, key=lambda x: x[1], reverse=True)

    return item_scores[1 : top_n + 1]

# Get recommendations for a specific item (e.g., item with index 0)
item_index = 0
recommendations = get_item_recommendations(item_index, cosine_similarities)

# Print the top 5 recommendations
for item_id, similarity in recommendations:
    print(f"Item ID: {item_id}, Similarity: {similarity}")

Item ID: 7083, Similarity: 0.430179480961992
Item ID: 14023, Similarity: 0.4098755656447982
Item ID: 9023, Similarity: 0.40368614474635844
Item ID: 9127, Similarity: 0.3850735383065387
Item ID: 11692, Similarity: 0.3787768378979146


The code above allows us to easily retrieve top item recommendations based on the cosine similarity. We can specify the item index for which we want recommendations and customize the number of top recommendations to retrieve.

#### model selection 2

We will create a new matrix now only using the description column

In [44]:
# Construct the TF-IDF Matrix
tfidfv2=TfidfVectorizer(analyzer='word', stop_words='english')
tfidfv_matrix2=tfidfv2.fit_transform(clean_df['description'])
print(tfidfv_matrix2.todense())
tfidfv_matrix2.todense().shape

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


(14484, 36530)

In [45]:
# Calculate similarity matrix
cosine_sim2 = cosine_similarity(tfidfv_matrix2, tfidfv_matrix2)

We will then create an index dataframe of the name column to be used in the recomender. 

In [46]:
# Create a Pandas Series to map movie titles to their indices
indices = pd.Series(data = list(clean_df.index), index = clean_df['name'])
indices

name
Bukima Tented Camp           0
Grand Hotel Kinshasa         1
Symphonie des Arts           2
Ixoras Hotel                 3
Ma Vallee                    4
                         ...  
Bobbywashere             14479
Hotel America            14480
Hotel Casa Felicidade    14481
De Prince Pensao         14482
Hotel Cachoeira          14483
Length: 14484, dtype: int64

>>> Place recommender

Below we will use the above indices ans similarities to recomed places based on the name. 

In [47]:
recommend_place("Nairobi National Park", cosine_similarities, cosine_sim2, clean_df)

Unnamed: 0_level_0,country,RankingType,subcategories,LowerPrice,UpperPrice
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Game View Lodge,South Africa,Specialty lodging,Specialty Lodging,7032.0,7314.0
Airport Planet Lodge,Tanzania,hotels,Hotel,29255.0,34740.0
Great Seasons Hotel,Rwanda,hotels,Hotel,13221.0,17862.0
Kaazi Beach Hotel,Uganda,hotels,Hotel,13924.0,14205.0
Town Lodge Executive,Democratic Republic of the Congo,Specialty lodging,Specialty Lodging,13502.0,16596.0
"Imperial Heights Hotel, Entebbe",Uganda,hotels,Hotel,15190.0,20675.0
The Retreat at Ngorongoro,Tanzania,Specialty lodging,Specialty Lodging,65401.0,77215.0
Madikwe Safari Lodge,South Africa,Specialty lodging,Specialty Lodging,153446.0,168214.0
Grand Royal Swiss Hotel,Kenya,Specialty lodging,Bed and Breakfast,12096.0,14487.0
The Beach Hotel,South Africa,hotels,Hotel,13361.0,17440.0


>>> Amenities recommender

Below we will create the recommendation system based on the amenities. 

In [48]:
recommend_amenities('restaurant', cosine_similarities, cosine_sim2, clean_df)

Unnamed: 0_level_0,name,country,RankingType,subcategories,LowerPrice,UpperPrice
combined_amenities,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Afrikaans, Safe, Free Internet, Banquet Room, Smoking rooms available, Fitness center, Poolside Bar, Breakfast Buffet, Bar/Lounge, Non-smoking rooms",Zanzibar Beach Resort,Tanzania,hotels,Hotel,11674.0,15331.0
"Free Internet, Smoking rooms available, Fitness center, Bar/Lounge, Non-smoking rooms, Conference Facilities, Outdoor pool, Laundry Service, Dry Cleaning, Internet",Hill View Hotel,Ghana,hotels,Hotel,6048.0,7595.0
"Free Internet, Banquet Room, Smoking rooms available, Fitness center, Free airport transportation, Bar/Lounge, Non-smoking rooms, Conference Facilities, Outdoor pool, Laundry Service",Oceanic Resort,Ghana,Specialty lodging,Bed and Breakfast,14065.0,35162.0
"Free Internet, Free airport transportation, Bar/Lounge, Housekeeping, Non-smoking rooms, Conference Facilities, Outdoor pool, Laundry Service, Parking, Internet",Eastgate Hotel,Ghana,Specialty lodging,Bed and Breakfast,13361.0,15471.0
"Free Internet, Banquet Room, Bar/Lounge, Non-smoking rooms, Conference Facilities, Outdoor pool, Laundry Service, Dry Cleaning, Internet, Non-smoking hotel",The View Boutique Hotel,South Africa,hotels,Hotel,299156.0,368073.0
"Free Internet, Banquet Room, Bar/Lounge, Non-smoking rooms, Conference Facilities, Laundry Service, Seating Area, Internet, Non-smoking hotel, Refrigerator in room",Hill View Hotel,Tanzania,Specialty lodging,Bed and Breakfast,13777.392428,23045.636246
"Newspaper, Safe, Free Internet, Banquet Room, Fitness center, Free airport transportation, Poolside Bar, Breakfast Buffet, Bar/Lounge, Housekeeping",Airport View Hotel,Ghana,hotels,Hotel,16878.0,19409.0
"Free Internet, Banquet Room, Smoking rooms available, Bar/Lounge, Non-smoking rooms, Conference Facilities, Laundry Service, Internet, Non-smoking hotel, Free parking",Airport Hotel,Senegal,Specialty lodging,Bed and Breakfast,12658.0,15752.0
"Free Internet, Banquet Room, Smoking rooms available, Fitness center, Breakfast Buffet, Bar/Lounge, Non-smoking rooms, Conference Facilities, Laundry Service, Dry Cleaning",Hotel Royal,Democratic Republic of the Congo,hotels,Hotel,18143.0,26160.0
"Free Internet, Banquet Room, Bar/Lounge, Non-smoking rooms, Conference Facilities, Outdoor pool, Laundry Service, Dry Cleaning, Internet, Non-smoking hotel",Airport West Hotel,Ghana,hotels,Hotel,16596.0,20816.0


>>> rating recommender

In [76]:
recommend_attraction(cosine_sim2, cosine_similarities, 2)

Unnamed: 0,name,LowerPrice,UpperPrice,amenities,type,country
0,MJ Furu,281.000000,4219.000000,restaurant,RESTAURANT,Democratic Republic of the Congo
1,Hareg Bahir Dar B&B,13777.392428,23045.636246,"Kids Activities, Shuttle Bus Service, Room ser...",HOTEL,Ethiopia
2,New Vision Addis Hotel,7032.000000,21097.000000,"Internet, Kids Activities, Shuttle Bus Service...",HOTEL,Ethiopia
3,Bole Addis Guest House,13777.392428,23045.636246,"Free parking, Restaurant, Bar/Lounge, Free Wif...",HOTEL,Ethiopia
4,Sparky Hotel,13777.392428,23045.636246,"Flatscreen TV, Bath / Shower, Telephone",HOTEL,Ethiopia
...,...,...,...,...,...,...
75,Centre Lodge,13777.392428,23045.636246,"Kids Activities, Free parking, Restaurant, Bar...",HOTEL,Botswana
76,Mendes & Mendes Rent a car,141.000000,281.000000,bathroom only,ATTRACTION,Cape Verde
77,Open Sky,13777.392428,23045.636246,"Kids Activities, Restaurant, Air conditioning,...",HOTEL,Cape Verde
78,Residencial Paraiso,13777.392428,23045.636246,"Internet, Kids Activities, Room service, Free ...",HOTEL,Cape Verde


>>> country recomender

In [77]:
recommend_country('Kenya', cosine_sim2, cosine_similarities, clean_df)

Unnamed: 0_level_0,name,city,RankingType,subcategories,LowerPrice,UpperPrice
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Kenya,Grand Royal Swiss Hotel,Kisumu,Specialty lodging,Bed and Breakfast,12096.0,14487.0
Kenya,House of Waine,Nairobi,hotels,Hotel,78762.0,81013.0
Kenya,CityBlue Creekside Hotel & Suites.,Mombasa,hotels,Hotel,7736.0,10127.0
Kenya,Tamarind Tree Hotel,Nairobi,hotels,Hotel,21097.0,24754.0
Kenya,African House Resort,Malindi,Specialty lodging,Bed and Breakfast,11392.0,14065.0
Kenya,Fairview Nairobi,Nairobi,hotels,Hotel,15752.0,23910.0


Below we wil then create a new combined system called the Hybrid recomender which will intergrate the above recomenders into one for the final deployment. 

#### Hybrid System

We will instantiate the class RecommendationEngine that has all the recomenders defined therein

In [51]:
# Instantiate the RecommendationEngine
engine = RecommendationEngine(cosine_similarities, cosine_sim2, clean_df)

In [52]:
# Call the recommend Attractions function
attractions = pd.DataFrame(engine.recommend_attraction(0))
print("Recommended Attractions:" )
attractions


Recommended Attractions:


Unnamed: 0,name,LowerPrice,UpperPrice,amenities,type,country
0,Mashrabia Gallery of fine arts,141.0,281.0,bathroom only,ATTRACTION,Egypt
1,Waldorf Astoria Cairo,58931.0,65963.0,"Fitness center, Pool, Room service, Restaurant...",HOTEL,Egypt
2,Corta International Hotel,6329.0,11252.0,"Internet, Room service, Free Internet, Free pa...",HOTEL,Ethiopia
3,Boutik Coeur de Foret,141.0,281.0,bathroom only,ATTRACTION,Madagascar
4,Castello Motel Anosizato,13777.392428,23045.636246,"Shuttle Bus Service, Free parking, Restaurant,...",HOTEL,Madagascar
5,Ryde Fitness,141.0,281.0,bathroom only,ATTRACTION,Nigeria
6,Paramount Sports Shop,141.0,281.0,bathroom only,ATTRACTION,Nigeria
7,RHOMTRIP,141.0,281.0,bathroom only,ATTRACTION,Nigeria
8,Relaxazzy100,141.0,281.0,bathroom only,ATTRACTION,Nigeria
9,Lagos City Tours,141.0,281.0,bathroom only,ATTRACTION,Nigeria


In [53]:
# Call the recommend Countries function
countries = pd.DataFrame(engine.recommend_country('Madagascar'))
print("\nRecommended Countries:")
countries



Recommended Countries:


Unnamed: 0_level_0,name,city,RankingType,subcategories,LowerPrice,UpperPrice
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Madagascar,Sunny Garden Hotel,Antananarivo,hotels,Hotel,9705.0,12096.0
Madagascar,Hotel Marina Beach Toamasina,Toamasina (Tamatave),Specialty lodging,Bed and Breakfast,13777.392428,23045.636246
Madagascar,Hotel Au bois vert,Antananarivo,hotels,Hotel,13777.392428,23045.636246
Madagascar,Grand Hotel,Antsiranana (Diego Suarez),hotels,Hotel,12799.0,15612.0
Madagascar,Hotel Les Boucaniers,Ambatoloaka,hotels,Hotel,5345.0,6188.0
Madagascar,Radama Hotel,Antananarivo,hotels,Hotel,4641.0,6188.0
Madagascar,Moringa Hotel,Toliara,hotels,Hotel,9705.0,11392.0
Madagascar,Talinjoo Hotel,Tolanaro,Specialty lodging,Specialty Lodging,9845.0,14065.0
Madagascar,Hotel Residence Sarimanok,Ambatoloaka,hotels,Hotel,12236.0,13361.0
Madagascar,Azura Hotel & Spa,Tolanaro,hotels,Hotel,3657.0,5626.0


In [54]:
# Call the recommend Places function
places = pd.DataFrame(engine.recommend_place('Nairobi National Park'))
print("\nRecommended Places:")
places



Recommended Places:


Unnamed: 0_level_0,country,RankingType,subcategories,LowerPrice,UpperPrice
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Game View Lodge,South Africa,Specialty lodging,Specialty Lodging,7032.0,7314.0
Airport Planet Lodge,Tanzania,hotels,Hotel,29255.0,34740.0
Great Seasons Hotel,Rwanda,hotels,Hotel,13221.0,17862.0
Kaazi Beach Hotel,Uganda,hotels,Hotel,13924.0,14205.0
Town Lodge Executive,Democratic Republic of the Congo,Specialty lodging,Specialty Lodging,13502.0,16596.0
"Imperial Heights Hotel, Entebbe",Uganda,hotels,Hotel,15190.0,20675.0
The Retreat at Ngorongoro,Tanzania,Specialty lodging,Specialty Lodging,65401.0,77215.0
Madikwe Safari Lodge,South Africa,Specialty lodging,Specialty Lodging,153446.0,168214.0
Grand Royal Swiss Hotel,Kenya,Specialty lodging,Bed and Breakfast,12096.0,14487.0
The Beach Hotel,South Africa,hotels,Hotel,13361.0,17440.0


In [55]:
# Call the recommend Amenities function
amenities = pd.DataFrame(engine.recommend_amenities('Kids Activities'))
print("\nRecommended Amenities:")
amenities


Recommended Amenities:


Unnamed: 0_level_0,name,country,RankingType,subcategories,LowerPrice,UpperPrice
combined_amenities,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Swahili, Safe, Free Internet, Banquet Room, Free airport transportation, Bar/Lounge, Blackout Curtains, Conference Facilities, Laundry Service, Oven",Great Seasons Hotel,Rwanda,hotels,Hotel,13221.0,17862.0
"Swahili, Free Internet, Banquet Room, Fitness center, Free airport transportation, Bar/Lounge, Blackout Curtains, Conference Facilities, Walking Tours, Laundry Service",Colibri Inn Hotel,Democratic Republic of the Congo,Specialty lodging,Specialty Lodging,15893.0,17722.0
"Newspaper, Safe, Free Internet, Fitness center, Breakfast Buffet, Bar/Lounge, Housekeeping, Non-smoking rooms, Conference Facilities, Outdoor pool",The Prestige Hotel Suites,Uganda,hotels,Hotel,11955.0,14768.0
"Safe, Free Internet, Fitness center, Free airport transportation, Bar/Lounge, Blackout Curtains, Conference Facilities, Walking Tours, Laundry Service, Internet","Imperial Heights Hotel, Entebbe",Uganda,hotels,Hotel,15190.0,20675.0
"Swahili, Safe, Free Internet, Banquet Room, Fitness center, Free airport transportation, Bar/Lounge, Blackout Curtains, Conference Facilities, Walking Tours",Town Lodge Executive,Democratic Republic of the Congo,Specialty lodging,Specialty Lodging,13502.0,16596.0
"Afrikaans, Additional Bathroom, Safe, Free Internet, Breakfast Buffet, Bar/Lounge, Housekeeping, Pool / Beach Towels, Non-smoking rooms, Outdoor pool",Game View Lodge,South Africa,Specialty lodging,Specialty Lodging,7032.0,7314.0
"Safe, Free Internet, Banquet Room, Xhosa, Dutch, Bar/Lounge, Blackout Curtains, Conference Facilities, Laundry Service, Internet",Clico Boutique Hotel,South Africa,hotels,Hotel,11674.0,14205.0
"Safe, Free Internet, Banquet Room, Fitness center, Free airport transportation, Breakfast Buffet, Bar/Lounge, Housekeeping, Non-smoking rooms, Conference Facilities",Cresta Lodge Gaborone,Botswana,hotels,Hotel,9001.0,30098.0
"Safe, Free Internet, Banquet Room, Heated pool, Fitness center, Poolside Bar, Breakfast Buffet, Bar/Lounge, Housekeeping, Non-smoking rooms",The Residence Boutique Hotel,South Africa,hotels,Hotel,26582.0,38115.0
"Swahili, Safe, Free Internet, Banquet Room, Fitness center, Bar/Lounge, Blackout Curtains, Conference Facilities, Laundry Service, Full Body Massage",Grand Royal Swiss Hotel,Kenya,Specialty lodging,Bed and Breakfast,12096.0,14487.0


## Deployment

Here we will focus on preparing the above hybrid model ready for deployment. After first run we will hash them out to prevent rewriting of the initial pickle.

##### pickling models and dependancies

In [None]:


#try:
    #with open('Data\\tfidf_matrix.pkl', 'wb') as file:
        #pickle.dump(tfidf_matrix, file)
   # print("File written successfully.")
#except Exception as e:
   # print("Error occurred while writing the file:", e)


File written successfully.


In [69]:

#try:
    #with open('Data\\.cosine_similarities.pkl', 'wb') as file:
        #pickle.dump(cosine_similarities, file)
    #print("File written successfully.")
#except Exception as e:
    #print("Error occurred while writing the file:", e)


File written successfully.


In [70]:

#try:#
   # with open('Data\\.tfidf_matrix2.pkl', 'wb') as file:
        #pickle.dump(tfidfv_matrix2, file)
   # print("File written successfully.")
#except Exception as e:#
    #print("Error occurred while writing the file:", e)



File written successfully.


In [71]:

#try:
    #with open('Data\\clean_df.pkl', 'wb') as file:
     #   pickle.dump(clean_df, file)
    #print("File written successfully.")
#except Exception as e:
   # print("Error occurred while writing the file:", e)




File written successfully.


In [72]:

#try:
  #  with open('Data\\.cosine_sim2.pkl', 'wb') as file:
   #     pickle.dump(cosine_sim2, file)
   # print("File written successfully.")
#except Exception as e:
  #  print("Error occurred while writing the file:", e)


File written successfully.


In [73]:

#try:
  #  with open('Data\\.indices.pkl', 'wb') as file:
  #      pickle.dump(indices, file)
   # print("File written successfully.")
#except Exception as e:
   # print("Error occurred while writing the file:", e)



File written successfully.


In [74]:

#try:
    #with open('Data\\.engine.pkl', 'wb') as file:
    #    pickle.dump(engine, file)
    #print("File written successfully.")
#except Exception as e:
   # print("Error occurred while writing the file:", e)




File written successfully.


## Conclusion and Recommendations


> * Conclusions

`Enhanced Personalized Experiences:` A travel recommender system offers tailored recommendations based on user preferences, enabling travelers to discover destinations and experiences that align with their interests.

`Increased Tourism Diversity:` By showcasing lesser-known destinations, the recommender system can encourage travelers to explore new places, diversifying tourism and distributing visitor traffic more evenly.

`Positive User Satisfaction:` The system provides accurate and relevant travel suggestions, ensuring users have memorable and satisfying experiences, leading to positive reviews and repeat visits.

`Targeted Marketing Opportunities:` Leveraging user data, the recommender system enables targeted promotions and offers, attracting more tourists and driving tourism growth.

`Continuous Improvement:` By analyzing user feedback and travel patterns, the system continually learns and improves its recommendations, enhancing the overall travel experience for users.


> * Recommendations:

`Tailored Recommendations:` By analyzing user preferences, such as destination preferences, travel interests, budget, and previous travel history, a recommender system can generate customized travel suggestions. These recommendations can help travelers discover new destinations and experiences that align with their interests, thereby encouraging them to explore new places and increase their overall travel frequency.

`Enhanced User Experience:` Recommender systems can improve the user experience by offering relevant and accurate travel suggestions. By considering factors like travel season, weather conditions, local events, and traveler reviews, the system can provide valuable information to users, ensuring they have a memorable and satisfying travel experience. Positive experiences are likely to lead to repeat visits and positive word-of-mouth, attracting more tourists to the recommended destinations.

`Increased Visibility for Lesser-Known Destinations:` Recommender systems can showcase lesser-known or off-the-beaten-path destinations that might not receive as much attention from traditional marketing efforts. By highlighting these hidden gems and presenting them as viable options to travelers, the recommender system can help diversify tourism and distribute visitor traffic more evenly across different regions. This, in turn, can stimulate local economies and create opportunities for sustainable tourism development.

`Targeted Promotions and Offers:` By leveraging user data and preferences, travel recommender systems can enable targeted marketing campaigns. This allows tourism authorities, travel agencies, and local businesses to offer personalized promotions, discounts, and packages to potential travelers. These tailored incentives can increase the likelihood of conversion, attracting more tourists and driving tourism growth.

`Continuous Learning and Improvement:` Travel recommender systems can gather feedback from users, monitor their interactions, and analyze their travel patterns. This information can be used to continually refine and improve the recommendations provided. As the system becomes more accurate and better aligned with user preferences over time, it can enhance the overall travel experience and generate even more significant tourism growth.


Overall, your recommendation system has significantly improved the travel planning process for tourists in Africa. By leveraging machine learning techniques and analyzing various factors, users can now receive personalized recommendations, make informed decisions, and enjoy a more satisfying travel experience. Continuous improvement and gathering user feedback will further enhance the system's accuracy and relevance over time.