# Yelp Recommendation System

## Introduction

In this project, we build a **Recommendation System** using Yelp's publicly available dataset. Yelp is a platform that provides user-generated reviews and ratings for local businesses, including restaurants, stores, and service providers. This project focuses on creating a recommendation system that can suggest businesses (e.g., restaurants) to users based on their preferences and past interactions.

### Objective

The main objective of this project is to develop a recommendation system that can predict ratings for businesses that a user has not yet interacted with. The system will leverage **Collaborative Filtering** techniques, which are widely used in recommendation systems to predict a user’s rating for an item based on the ratings of similar users.

### Why Yelp?

Yelp's dataset contains detailed information on business reviews, including:

- **User reviews**: Ratings and textual reviews submitted by users.
- **Business information**: Details about the business, such as name, location, and categories (e.g., restaurant, café).
- **User data**: User demographics and historical activity, including reviews written and ratings given.

These rich datasets allow us to build personalized recommendations that match users with businesses based on their preferences.

### Problem Statement

The goal of this project is to recommend businesses to users in such a way that it maximizes user satisfaction by predicting and suggesting businesses that a user is likely to rate highly.

### Approach

1. **Data Preprocessing**:
    - Load and clean the Yelp dataset, including users, businesses, and reviews data.
    - Handle missing values, duplicate entries, and other anomalies in the data.

2. **Collaborative Filtering**:
    - Implement **User-Item Collaborative Filtering** to recommend businesses based on the preferences of similar users.
    - Use similarity measures such as **Pearson Correlation** or **Cosine Similarity** to identify similar users.
    - Construct a **user-item interaction matrix** to predict which businesses a user might be interested in.

3. **Evaluation**:
    - Evaluate the performance of the recommendation system using metrics such as **Root Mean Squared Error (RMSE)**, **Precision**, and **Recall**.
    - Fine-tune the model based on evaluation metrics to improve recommendation accuracy.

  
### Dataset

The dataset used in this project comes from the **Yelp Open Dataset**. It contains the following files:

- **Business**: Information about businesses such as name, location, and categories.
- **Reviews**: Reviews from users including ratings, text, and date of submission.
- **Users**: Data about Yelp users, including their IDs, demographics, and review history.

### Why Collaborative Filtering?

Collaborative filtering is a popular method for building recommendation systems because it doesn’t require prior knowledge about the items themselves (in this case, businesses) but instead relies on past behavior of users. This allows the system to provide personalized recommendations based on the ratings and behaviors of other similar users.

### Conclusion

This project demonstrates how collaborative filtering can be applied to a real-world dataset like Yelp to provide personalized business recommendations. By leveraging user preferences and behaviors, the system can enhance the user experience by recommending businesses they are likely to enjoy based on the preferences of other similar users.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from scipy.stats import pearsonr
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import mean_squared_error


In [2]:
yelp_df=pd.read_csv('7817_1.csv')
yelp_df.head()

Unnamed: 0,id,asins,brand,categories,colors,dateAdded,dateUpdated,dimension,ean,keys,...,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,sizes,upc,weight
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",,,Cristina M,,,205 grams
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,Allow me to preface this with a little history...,One Simply Could Not Ask For More,,,Ricky,,,205 grams
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,4.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,,,Tedd Gardiner,,,205 grams
3,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I bought one of the first Paperwhites and have...,Love / Hate relationship,,,Dougal,,,205 grams
4,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I have to say upfront - I don't like coroporat...,I LOVE IT,,,Miljan David Tanic,,,205 grams


In [3]:
yelp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1597 entries, 0 to 1596
Data columns (total 27 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    1597 non-null   object 
 1   asins                 1597 non-null   object 
 2   brand                 1597 non-null   object 
 3   categories            1597 non-null   object 
 4   colors                774 non-null    object 
 5   dateAdded             1597 non-null   object 
 6   dateUpdated           1597 non-null   object 
 7   dimension             565 non-null    object 
 8   ean                   898 non-null    float64
 9   keys                  1597 non-null   object 
 10  manufacturer          965 non-null    object 
 11  manufacturerNumber    902 non-null    object 
 12  name                  1597 non-null   object 
 13  prices                1597 non-null   object 
 14  reviews.date          1217 non-null   object 
 15  reviews.doRecommend  

In [4]:
yelp_df.columns

Index(['id', 'asins', 'brand', 'categories', 'colors', 'dateAdded',
       'dateUpdated', 'dimension', 'ean', 'keys', 'manufacturer',
       'manufacturerNumber', 'name', 'prices', 'reviews.date',
       'reviews.doRecommend', 'reviews.numHelpful', 'reviews.rating',
       'reviews.sourceURLs', 'reviews.text', 'reviews.title',
       'reviews.userCity', 'reviews.userProvince', 'reviews.username', 'sizes',
       'upc', 'weight'],
      dtype='object')

### Dataset Columns Explanation

Here’s a brief overview of the columns in the dataset:

- **`id`**: Unique identifier for each product.
- **`asins`**: Amazon Standard Identification Number, a unique identifier for products on Amazon.
- **`brand`**: Brand name of the product.
- **`categories`**: Categories or classifications the product belongs to.
- **`colors`**: Available colors of the product.
- **`dateAdded`**: Date when the product was added to the platform.
- **`dateUpdated`**: Date when the product details were last updated.
- **`dimension`**: Physical dimensions of the product.
- **`ean`**: European Article Number, a unique product identifier.
- **`keys`**: Keywords associated with the product.
- **`manufacturer`**: Name of the product manufacturer.
- **`manufacturerNumber`**: Manufacturer’s specific product number.
- **`name`**: Name or title of the product.
- **`prices`**: Price of the product.
- **`reviews.date`**: Date when the review was posted.
- **`reviews.doRecommend`**: Whether the reviewer recommends the product (Yes/No).
- **`reviews.numHelpful`**: Number of users who found the review helpful.
- **`reviews.rating`**: Rating given by the user for the product.
- **`reviews.sourceURLs`**: URLs linking to the source of the review.
- **`reviews.text`**: Text content of the review.
- **`reviews.title`**: Title or heading of the review.
- **`reviews.userCity`**: City of the user who wrote the review.
- **`reviews.userProvince`**: Province or state of the user who wrote the review.
- **`reviews.username`**: Username of the reviewer.
- **`sizes`**: Available sizes for the product.
- **`upc`**: Universal Product Code, a unique identifier for products.
- **`weight`**: Weight of the product.


In [5]:
# Filter necessary columns
yelp_df = yelp_df[['reviews.username', 'name', 'reviews.rating']]
yelp_df

Unnamed: 0,reviews.username,name,reviews.rating
0,Cristina M,Kindle Paperwhite,5.0
1,Ricky,Kindle Paperwhite,5.0
2,Tedd Gardiner,Kindle Paperwhite,4.0
3,Dougal,Kindle Paperwhite,5.0
4,Miljan David Tanic,Kindle Paperwhite,5.0
...,...,...,...
1592,GregAmandawith4,Alexa Voice Remote for Amazon Fire TV and Fire...,3.0
1593,Amazon Customer,Alexa Voice Remote for Amazon Fire TV and Fire...,1.0
1594,Amazon Customer,Alexa Voice Remote for Amazon Fire TV and Fire...,1.0
1595,Meg Ashley,Alexa Voice Remote for Amazon Fire TV and Fire...,3.0


### Filtering Necessary Columns

In recommendation system development, it's crucial to work with relevant data to build accurate and efficient models. The Yelp dataset contains a large number of columns, but not all of them are necessary for the recommendation system. By filtering out irrelevant columns, we can focus on the core data required to understand user preferences and create personalized recommendations.


In [6]:
# Drop rows with missing data
yelp_df.dropna(inplace=True)

In [7]:
yelp_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1177 entries, 0 to 1596
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   reviews.username  1177 non-null   object 
 1   name              1177 non-null   object 
 2   reviews.rating    1177 non-null   float64
dtypes: float64(1), object(2)
memory usage: 36.8+ KB


In [8]:
# Create a user-item rating matrix
rating_matrix = yelp_df.pivot_table(index='reviews.username', columns='name', values='reviews.rating')
rating_matrix

name,Alexa Voice Remote for Amazon Echo and Echo Dot,Alexa Voice Remote for Amazon Fire TV and Fire TV Stick,All-New Amazon Fire 7 Tablet Case (7th Generation,All-New Amazon Fire HD 8 Tablet Case (7th Generation,All-New Amazon Fire TV Game Controller,All-New Amazon Kid-Proof Case for Amazon Fire 7 Tablet (7th Generation,All-New Amazon Kid-Proof Case for Amazon Fire HD 8 Tablet (7th Generation,All-New Fire 7 Kids Edition Tablet,All-New Fire 7 Tablet with Alexa,All-New Fire HD 8 Kids Edition Tablet,...,Kindle Keyboard,Kindle Oasis E-reader with Leather Charging Cover - Walnut,Kindle Oasis with Leather Charging Cover - Black,Kindle Paperwhite,Kindle Paperwhite 3G,Kindle Paperwhite E-reader - Black,Kindle Voyage E-reader,Kindle for Kids Bundle with the latest Kindle E-reader,Moshi Anti-Glare No Bubble Screen Protector for the Fire Phone,Replacement Remote for Amazon Fire TV Stick
reviews.username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1-Apr,,,,,,,,,,,...,,,,,,,,,,
1215,,,,,,,,,,,...,,,,,,,,,,
1234,,,,,,,,,,,...,,,,,,,,,,
1soni,,,,,,,,,,,...,,,,,,,,,,
25Firefighter,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
wadas1989,,,,,,,,,,,...,,,,,,,,,,
wax0pal,,,,,,,,,,,...,,,,,,,,,,
william lombardo,,,,,,,,,,,...,,,,,,,,,,
wirelesssassyowner,,,,,,,,,,,...,,,,,,,,,,


### User-Item Rating Matrix

The user-item rating matrix is created to represent the relationship between users and businesses based on their ratings. It serves as the foundation for building a recommendation system by:

1. **Capturing User Preferences**: Shows how users rate different businesses.
2. **Identifying Similar Users**: Helps find users with similar preferences to recommend businesses.
3. **Predicting Missing Ratings**: Allows us to suggest businesses that a user may like based on others' ratings.

This matrix is essential for collaborative filtering, a technique used to make personalized recommendations.


### Fill Missing Values in the Rating Matrix


To handle sparse data in the user-item rating matrix (which often contains many NaN values), we replace these NaN values with 0. This is done to enable efficient computation and analysis.


In [9]:
# Replace NaN values with 0 for computation (as sparse data will contain many NaNs)
rating_matrix_filled = rating_matrix.fillna(0)
rating_matrix_filled

name,Alexa Voice Remote for Amazon Echo and Echo Dot,Alexa Voice Remote for Amazon Fire TV and Fire TV Stick,All-New Amazon Fire 7 Tablet Case (7th Generation,All-New Amazon Fire HD 8 Tablet Case (7th Generation,All-New Amazon Fire TV Game Controller,All-New Amazon Kid-Proof Case for Amazon Fire 7 Tablet (7th Generation,All-New Amazon Kid-Proof Case for Amazon Fire HD 8 Tablet (7th Generation,All-New Fire 7 Kids Edition Tablet,All-New Fire 7 Tablet with Alexa,All-New Fire HD 8 Kids Edition Tablet,...,Kindle Keyboard,Kindle Oasis E-reader with Leather Charging Cover - Walnut,Kindle Oasis with Leather Charging Cover - Black,Kindle Paperwhite,Kindle Paperwhite 3G,Kindle Paperwhite E-reader - Black,Kindle Voyage E-reader,Kindle for Kids Bundle with the latest Kindle E-reader,Moshi Anti-Glare No Bubble Screen Protector for the Fire Phone,Replacement Remote for Amazon Fire TV Stick
reviews.username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1-Apr,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1soni,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25Firefighter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
wadas1989,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
wax0pal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
william lombardo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
wirelesssassyowner,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Explanation:

Rating_matrix is the user-item matrix where rows are users, columns are items, and the values are ratings.

Pearson correlation requires valid numeric inputs. Missing ratings (NaN) would cause errors, so we replace them with 0.

By filling NaN values with 0, we ensure that the computation includes only the items rated by users.

# Compute Pearson Correlation for All Users


In [10]:
# Compute the similarity matrix using Pearson correlation

user_similarity = 1 - pairwise_distances(rating_matrix_filled, metric='correlation')
user_similarity

array([[ 1.        ,  1.        ,  1.        , ..., -0.01639344,
         1.        , -0.01639344],
       [ 1.        ,  1.        ,  1.        , ..., -0.01639344,
         1.        , -0.01639344],
       [ 1.        ,  1.        ,  1.        , ..., -0.01639344,
         1.        , -0.01639344],
       ...,
       [-0.01639344, -0.01639344, -0.01639344, ...,  1.        ,
        -0.01639344, -0.01639344],
       [ 1.        ,  1.        ,  1.        , ..., -0.01639344,
         1.        , -0.01639344],
       [-0.01639344, -0.01639344, -0.01639344, ..., -0.01639344,
        -0.01639344,  1.        ]])

### Explanation:

Pearson Correlation: Pearson correlation is a statistical measure that quantifies the linear relationship between two sets of numbers (e.g., ratings by two users). 

The pearson Correlation formula is:The Pearson Correlation formula is:

$ \text{Pearson Correlation} = \frac{\sum((x_i - \bar{x})(y_i - \bar{y}))}{\sqrt{\sum(x_i - \bar{x})^2} \sqrt{\sum(y_i - \bar{y})^2}} $


 The Scikit-learn function computes pairwise distances (or dissimilarities) between rows of a matrix. Setting metric='correlation' calculates the Pearson correlation coefficient for each pair of users.

The result is 1 - Pearson Correlation, where:
0 means identical users.
1 means no correlation.
-1 means inverse correlation.
Subtract from 1: Since we need similarity, not distance, we subtract the result from 1.

Output: user_similarity is a square matrix where the entry (i, j) represents the similarity between user i and user j.

# Convert to DataFrame for Easy Access

In [11]:
# Convert the similarity matrix to a DataFrame for easy indexing
user_similarity_df = pd.DataFrame(user_similarity, index=rating_matrix.index, columns=rating_matrix.index)
user_similarity_df

reviews.username,1-Apr,1215,1234,1soni,25Firefighter,5bros,7011,A. Dent Aragorn,A. Younan,A.C,...,toeka,ts120,txtech1997,unplug,vishal,wadas1989,wax0pal,william lombardo,wirelesssassyowner,zman
reviews.username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1-Apr,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,-0.016393,-0.016393,-0.04191,...,1.000000,1.000000,-0.016393,1.000000,1.000000,-0.016393,1.000000,-0.016393,1.000000,-0.016393
1215,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,-0.016393,-0.016393,-0.04191,...,1.000000,1.000000,-0.016393,1.000000,1.000000,-0.016393,1.000000,-0.016393,1.000000,-0.016393
1234,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,-0.016393,-0.016393,-0.04191,...,1.000000,1.000000,-0.016393,1.000000,1.000000,-0.016393,1.000000,-0.016393,1.000000,-0.016393
1soni,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,-0.016393,-0.016393,-0.04191,...,1.000000,1.000000,-0.016393,1.000000,1.000000,-0.016393,1.000000,-0.016393,1.000000,-0.016393
25Firefighter,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,-0.016393,-0.016393,-0.04191,...,1.000000,1.000000,-0.016393,1.000000,1.000000,-0.016393,1.000000,-0.016393,1.000000,-0.016393
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
wadas1989,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.04191,...,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,1.000000,-0.016393,-0.016393,-0.016393,-0.016393
wax0pal,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,-0.016393,-0.016393,-0.04191,...,1.000000,1.000000,-0.016393,1.000000,1.000000,-0.016393,1.000000,-0.016393,1.000000,-0.016393
william lombardo,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.04191,...,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,-0.016393,1.000000,-0.016393,-0.016393
wirelesssassyowner,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,-0.016393,-0.016393,-0.04191,...,1.000000,1.000000,-0.016393,1.000000,1.000000,-0.016393,1.000000,-0.016393,1.000000,-0.016393


### Explanation:

Converting the similarity matrix into a DataFrame so we can access specific rows/columns using user IDs as labels.

For instance,user_similarity_df.loc['user1', 'user2'] gives the similarity between user1 and user2.

# Recommend Top N Items for a Specific User

In [16]:

# Function to recommend top N items for a specific user
def recommend_items_optimized(user_id, rating_matrix, similarity_matrix, top_n=5):
    
    # Get the similarity scores for the target user
    
    user_idx = rating_matrix.index.get_loc(user_id) # retrieves the index of the target user in the matrix.
    
    similarities = similarity_matrix[user_idx] #retrieves the similarity scores of the target user with all other users.
    
    # Compute weighted ratings.Ratings from similar users are weighted by their similarity to the target user. This gives higher importance to users with stronger correlations.
    
    weighted_ratings = similarity_matrix[user_idx].dot(rating_matrix_filled) / np.abs(similarity_matrix[user_idx]).sum()
    
    # Exclude items the user has already rated
    #Any items already rated by the target user are removed from the recommendations to Prevent recommending items the user has already interacted with.
    
    already_rated = rating_matrix.loc[user_id].dropna().index
    recommendations = pd.Series(weighted_ratings, index=rating_matrix.columns).drop(already_rated)
    
    # Return the top N recommendations>> Retrieve the top N items with the highest predicted ratings.
    return recommendations.sort_values(ascending=False).head(top_n)

# Recommend top 5 items for a specific user

user_id = rating_matrix.index[0]  # Replace with a specific user ID
top_5_recommendations = recommend_items_optimized(user_id, rating_matrix, user_similarity)

print(f"Top 5 recommendations for user {user_id}:\n{top_5_recommendations}")


Top 5 recommendations for user 1-Apr:
name
Echo Show - Black                          0.006085
Certified Refurbished Fire HD 10 Tablet    0.005377
All-New Fire HD 8 Tablet with Alexa        0.004858
Amazon Fire TV                             0.001957
Fire Kids Edition Tablet                  -0.000248
dtype: float64


# Evaluate the recommender using metrics like precision, recall, and RMSE .

In [13]:
# Step 1: Fill NaN values in the rating matrix
"""
Replace all NaN values in the rating matrix with 0. 
This ensures we have no missing data when performing calculations.
"""
rating_matrix_filled = rating_matrix.fillna(0)

# Step 2: Initialize accumulators for metrics
"""
Set up variables to accumulate precision, recall, and RMSE values for all users. 
These will be averaged later to evaluate the recommendation system's performance.
"""
total_precision = 0
total_recall = 0
total_rmse = 0
num_users = rating_matrix.shape[0]  # Total number of users in the rating matrix

# Parameters
"""
Define evaluation parameters:
- `top_n`: The number of recommendations to consider for evaluation.
- `threshold`: The minimum rating required for an item to be considered relevant.
"""
top_n = 5    # Number of top recommendations to evaluate
threshold = 4  # Rating threshold to consider an item relevant

# Step 3: Iterate over each user to calculate metrics
"""
Loop through each user in the dataset. For each user:
- Predict ratings for all items.
- Calculate evaluation metrics like RMSE, precision, and recall.
"""
for user_idx, user_id in enumerate(rating_matrix.index):
    # Calculate predicted ratings for the user
    """
    Compute predicted ratings for items using a weighted average of ratings from similar users.
    Similarity scores (from `user_similarity`) serve as weights.
    Normalize the predicted ratings by dividing by the sum of the absolute similarity scores.
    """
    weighted_ratings = user_similarity[user_idx].dot(rating_matrix_filled) / np.abs(user_similarity[user_idx]).sum()
    
    # Extract the actual and predicted ratings
    """
    Retrieve the user's actual ratings and store predicted ratings for all items.
    - `actual_ratings`: The actual ratings given by the user.
    - `predicted_ratings`: The predicted ratings computed for the user.
    """
    actual_ratings = rating_matrix.loc[user_id]
    predicted_ratings = pd.Series(weighted_ratings, index=rating_matrix.columns)
    
    # Step 4: Calculate RMSE for the user
    """
    Identify items that the user has rated (`common_items`) and compute RMSE (error) 
    between the actual and predicted ratings for these items. Accumulate the RMSE for all users.
    """
    common_items = ~actual_ratings.isna()  # Items rated by the user
    rmse = mean_squared_error(actual_ratings[common_items], predicted_ratings[common_items], squared=False)
    total_rmse += rmse
    
    # Step 5: Precision and Recall
    # Get top-N recommendations, excluding already rated items
    """
    Exclude items the user has already rated from recommendations. 
    Sort the remaining items by predicted rating in descending order and select the top N.
    """
    already_rated = actual_ratings.dropna().index
    recommendations = predicted_ratings.drop(already_rated).sort_values(ascending=False).head(top_n)
    
    # Find relevant and recommended items
    """
    Identify:
    - `relevant_items`: Items the user rated as relevant (ratings ≥ threshold).
    - `recommended_items`: Items included in the top-N recommendations.
    """
    relevant_items = set(actual_ratings[actual_ratings >= threshold].dropna().index)
    recommended_items = set(recommendations.index)
    
    # Calculate true positives
    """
    Determine `true_positives` (items that are both relevant and recommended).
    Use these to calculate precision and recall for the user.
    """
    true_positives = len(recommended_items & relevant_items)
    
    # Calculate precision and recall for this user
    """
    - Precision: Proportion of recommended items that are relevant.
    - Recall: Proportion of relevant items that are recommended.
    """
    precision = true_positives / len(recommended_items) if len(recommended_items) > 0 else 0
    recall = true_positives / len(relevant_items) if len(relevant_items) > 0 else 0
    
    # Accumulate metrics
    """
    Add the user's precision and recall to the running totals.
    """
    total_precision += precision
    total_recall += recall

# Step 6: Calculate averages
"""
Compute the average precision, recall, and RMSE across all users.
These averages represent the overall performance of the recommendation system.
"""
avg_precision = total_precision / num_users
avg_recall = total_recall / num_users
avg_rmse = total_rmse / num_users

# Step 7: Print the metrics
"""
Display the final evaluation metrics for the recommendation system:
- Precision: How accurate are the recommendations?
- Recall: How comprehensive are the recommendations?
- RMSE: How accurate are the predicted ratings compared to actual ratings?
"""
print("Evaluation Metrics:")
print(f"Precision: {avg_precision:.2f}")
print(f"Recall: {avg_recall:.2f}")
print(f"RMSE: {avg_rmse:.2f}")


Evaluation Metrics:
Precision: 0.00
Recall: 0.00
RMSE: 1.38


# Explaination for the evaluation matrix

1. Precision = 0.00:

Precision measures the proportion of recommended items that are relevant.
A precision of 0.00 suggests that none of the items in the top-5 recommendations for any user were rated as "relevant" (rating >= threshold of 4). This indicates a mismatch between the predicted ratings and the actual preferences.
Recall = 0.00:

2. Recall  = 0.00

Recall measures the proportion of relevant items that are recommended.
A recall of 0.00 means that none of the items deemed "relevant" by the users (ratings >= 4) were among the recommendations. This suggests that the system is failing to capture and recommend items that align with the users' preferences.

3. RMSE = 1.38:

RMSE (Root Mean Square Error) evaluates the accuracy of the predicted ratings compared to the actual ratings.
A value of 1.38 indicates a moderate level of error between predictions and actual ratings, though it's not excessively high. This suggests that the system is somewhat capable of predicting overall ratings but is not able to recommend the correct items.

# Possible Reasons for Poor Precision and Recall:

1. Similarity Matrix Issues:

The user_similarity matrix might not be accurately capturing meaningful relationships between users. This could happen if there’s sparse data or poor similarity calculation.

2. Cold Start Problem:

Users or items with too few ratings can result in unreliable predictions and recommendations.

3. Threshold Too High:

A threshold of 4 may be too strict. If most users rate items below 4, the system will struggle to classify items as "relevant."

4. Normalization:

Predicted ratings might not be properly normalized or scaled, leading to poor recommendations.

5. Sparse Data:

If the rating matrix is too sparse (many missing values), it becomes difficult for collaborative filtering to identify patterns and make meaningful recommendations.

# Next Steps to Improve:

1. Adjust the Threshold:

Experiment with lower thresholds (e.g., 3.0) to see if more items are classified as relevant, improving precision and recall.

2. Data Preprocessing:

Normalize the ratings matrix to reduce the effect of outliers or skewed rating distributions. Center ratings by subtracting the user or item mean.

3. Improve Similarity Calculation:

Try different similarity measures, such as cosine similarity or adjusted Pearson correlation.

4. Hybrid Recommendation System:

Combine user-based and item-based collaborative filtering to address the limitations of either approach individually.

5. Address Sparsity:
 Use matrix factorization techniques 
(e.g., SVD or NMF) to handle sparse datasets better.

6. Increase Recommendations (top_n):

Experiment with a higher value for top_n (e.g., 10) to test if a broader recommendation list improves metrics.

7. Additional Metrics:

Evaluate the system using additional metrics like F1-score or coverage to get a more comprehensive view of its performance.