In [1]:
import pandas as pd
import numpy as np
import warnings
# Ignore all warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('data/data.csv')

In [3]:
# Filter the rows where the 'category' is 'attraction'
attraction_data = data[data['category'] == 'attraction'][[
    'name', 'category', 'rating', 'numberOfReviews', 'photoCount', 'texts',
    'reviews', 'weighted_sentiment', 'adjusted_sentiment', 'bigram_counts'
]]

# Filter the rows where the 'category' is 'hotel'
hotel_data = data[data['category'] == 'hotel'][[
    'name', 'category', 'rating', 'numberOfReviews', 'photoCount',
       'priceRange', 'reviewTags', 'priceLevel', 'texts', 'reviews',
       'lowerPrice', 'upperPrice', 'weighted_sentiment', 'adjusted_sentiment',
       'bigram_counts', 'priceLevelencoded'
]]

In [4]:
attraction_data.shape

(1708, 10)

In [5]:
hotel_data.shape

(2836, 16)

In [6]:
attraction_df = attraction_data.dropna()
attraction_df.shape

(1707, 10)

In [7]:
hotel_df = hotel_data.dropna()
hotel_df.shape

(2252, 16)

### Modeling Attraction

#### Content-Based Recommendation with KNN

In this model, we use the feature `rating` to recommend similar attractions. The recommendation is based on the similarity between the given rating and the ratings of other attractions.

If a user rates an attraction highly, the system will recommend other attractions that have similar ratings.

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [9]:
X = attraction_df[['rating']]
names = attraction_df['name']  

In [10]:
# Scale the feature
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [11]:
# Initialize and fit the KNN model
knn = NearestNeighbors(n_neighbors=5, metric='cosine')
knn.fit(X_scaled)

In [12]:
def recommend_attractions(rating, top_n=5):
    """
    Recommend attractions based on a given rating.
    
    Parameters:
    - rating: The rating to find similar attractions.
    - top_n: The number of similar attractions to recommend.
    
    Returns:
    - recommended_names: List of recommended attraction names.
    - distances: List of distances of the recommended attractions.
    """
    # Scale the given rating
    scaled_rating = scaler.transform([[rating]])
    
    # Find the nearest neighbors
    distances, indices = knn.kneighbors(scaled_rating, n_neighbors=top_n)
    
    # Get the recommended attractions
    recommended_names = names.iloc[indices.flatten()].values
    return recommended_names, distances.flatten()

In [13]:
# Test Recommendation Function
example_rating = 5  
recommended_names, distances = recommend_attractions(example_rating)
print("Recommended attractions:", recommended_names)
print("Distances:", distances)

Recommended attractions: ['Live in Love Kenya tours and travel' 'Africa Safari Trips'
 'Afya Bora Spa' 'Zidis Studio' 'Kempinski The Spa']
Distances: [0. 0. 0. 0. 0.]


The recommendation function is returning all `zero distances`, which indicates that the model may be finding attractions with the exact same rating or there's an issue with distance computation. This suggests that the model might need adjustment or re-evaluation to ensure accurate recommendations.

In [14]:
# Split the data into training and testing sets
X_train, X_test = train_test_split(X_scaled, test_size=0.2, random_state=42)

In [15]:
# Initialize and fit the KNN model
knn = NearestNeighbors(n_neighbors=5, metric='cosine')
knn.fit(X_train)

# Predict on the test set
distances, indices = knn.kneighbors(X_test, n_neighbors=5)

# Compute average rating prediction for test samples
def predict_ratings(indices, X_train):
    return np.mean([X_train[idx] for idx in indices], axis=1)

# Compute predicted ratings for test data
y_pred = predict_ratings(indices, X_train)

# Compute Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)
mse = mean_squared_error(X_test, y_pred)
rmse = np.sqrt(mse)

# Compute Mean Absolute Error (MAE)
mae = mean_absolute_error(X_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")

Mean Squared Error (MSE): 0.1458254636550625
Root Mean Squared Error (RMSE): 0.38187100394644063
Mean Absolute Error (MAE): 0.1589658717277632


The `MSE` of `0.146` indicates that the squared differences between predicted and actual ratings are relatively small. The `RMSE` of `0.382` suggests that, on average, our predictions are about 0.38 units away from the actual ratings. `MAE` of `0.159` shows that, on average, our predictions deviate by 0.16 units from the true ratings, indicating overall good model performance.

##### NOTE:
    
   We didn't use accuracy and recall because they are typically used for classification tasks, not for predicting ratings in a recommendation system. Instead, we used Mean Squared Error (MSE) and Mean Absolute Error (MAE) to measure how well our model's predicted ratings match the actual ratings. These metrics are more suitable for evaluating the performance of regression models.