# Collaborative Filtering Recommendation System (User-Based) 
https://www.kaggle.com/datasets/chauanhdat/movielens-latest-small-for-education

## Overview

This notebook implements a **User-Based Collaborative Filtering (CF) Recommendation System** using the MovieLens dataset. The goal is to provide personalized movie recommendations based on the preferences of similar users.

## Collaborative Filtering (CF)

Collaborative Filtering is a popular recommendation approach that suggests items to users based on **past interactions** of users and items. Unlike content-based systems, CF does not rely on item attributes (like movie genres or descriptions) but instead **learns from user behavior**, such as ratings or purchase history.

There are two main types of collaborative filtering:

1. **User-Based CF:**  
   - Finds users similar to the target user based on rating patterns.  
   - Recommends items that similar users liked but the target user has not seen.  
   - Example: “Users who rated Toy Story highly also liked Aladdin.”

2. **Item-Based CF:**  
   - Finds items similar to those the target user has already rated.  
   - Recommends items similar to the ones the user liked.  
   - Example: “If you liked Toy Story, you might also like Toy Story 2.”

---

## Cosine Similarity

To determine similarity between users (or items), this notebook uses **Cosine Similarity**, a metric that measures the angle between two vectors in a multi-dimensional space:

\[
\text{Cosine Similarity} = \frac{A \cdot B}{||A|| \cdot ||B||}
\]

Where:  
- \(A\) and \(B\) are rating vectors of two users (or items).  
- A value closer to 1 means more similar; closer to 0 means less similar.  

Cosine similarity is widely used because it **accounts for the direction of ratings** rather than the magnitude, so two users who rate the same movies similarly will have a high similarity even if one tends to rate higher overall.

---

## Why User-Based CF with Cosine Similarity?

- User-Based CF captures **shared preferences among users**.  
- Cosine similarity helps measure these preferences effectively.  
- This method works well on datasets like MovieLens, where users rate a set of movies, allowing the system to recommend **movies a user hasn’t seen but is likely to enjoy**.

---

## Workflow of This Notebook

1. **Load and inspect the dataset:** Explore `movies.csv` and `ratings.csv`.
2. **Data preprocessing:** Create a **user-item rating matrix**.
3. **Compute similarity:** Use **cosine similarity** to find similar users.
4. **Generate recommendations:** Predict unseen movies for a target user based on similar users’ ratings.
5. **Evaluation and exploration:** Check recommendations for a user and analyze their relevance.


## Import Libraries

In this step, we import the essential Python libraries for building our recommendation system:

- `pandas`: Used for **data manipulation and analysis**, particularly for handling CSV files and creating dataframes.
- `numpy`: Provides support for **numerical operations**, including array computations.
- `cosine_similarity` from `sklearn.metrics.pairwise`: Used to compute **cosine similarity** between users or items, which is the core of our collaborative filtering approach.


In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


In [2]:

movies = pd.read_csv('/kaggle/input/movielens-latest-small-for-education/ml-latest-small/movies.csv')
ratings = pd.read_csv('/kaggle/input/movielens-latest-small-for-education/ml-latest-small/ratings.csv')


In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
movies.isnull().sum()
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [7]:

print("Unique users:", ratings['userId'].nunique())
print("Unique movies:", ratings['movieId'].nunique())


Unique users: 610
Unique movies: 9724


## Explore Ratings Distribution

Before building the recommendation system, it is important to understand the **distribution of ratings** in the dataset.

- `ratings['rating'].describe()` provides statistical summaries:
  - **count**: Total number of ratings in the dataset.
  - **mean**: Average rating (3.50 in this dataset), showing a slight positive bias.
  - **std**: Standard deviation (1.04), indicating variability in user ratings.
  - **min / max**: Minimum and maximum ratings (0.5 to 5.0).
  - **25%, 50%, 75%**: Percentiles showing how ratings are spread across users.

This helps identify patterns, biases, or sparsity in user ratings, which is useful for building an effective collaborative filtering system.


In [8]:
# Ratings distribution
print(ratings['rating'].describe())


count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: float64


##  Analyze User and Movie Activity

It is important to explore how active users are and how popular movies are before building a recommendation system.

- **Number of movies rated by each user (`user_counts`)**:  
  - Shows how many ratings each user has provided.  
  - Helps identify **active vs. inactive users**.
  
- **Number of ratings per movie (`movie_counts`)**:  
  - Shows how many users rated each movie.  
  - Helps identify **popular vs. less-rated movies**.

We display the top few users and movies to get a quick sense of the dataset’s activity patterns.  

> Note: `value_counts()` is sorted by default from **most to least ratings**.


In [11]:
# How many movies each user rated
user_counts = ratings['userId'].value_counts()
user_counts.head()

userId
414    2698
599    2478
474    2108
448    1864
274    1346
Name: count, dtype: int64

In [23]:
# How many ratings each movie has
movie_counts = ratings['movieId'].value_counts()
movie_counts.head()

movieId
356     329
318     317
296     307
593     279
2571    278
Name: count, dtype: int64

Step 6: Create User-Item Rating Matrix

To implement user-based collaborative filtering, we need a **user-item matrix**:

- **Rows (`index=userId`)** represent individual users.  
- **Columns (`columns=movieId`)** represent movies.  
- **Values (`values=rating`)** represent the rating a user gave to a movie.  

`fillna(0)` is used to fill missing ratings with `0`, meaning the user has not rated that movie yet


In [13]:
user_item_matrix = ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)
print(user_item_matrix.head())


movieId  1       2       3       4       5       6       7       8       \
userId                                                                    
1           4.0     0.0     4.0     0.0     0.0     4.0     0.0     0.0   
2           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
3           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
4           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
5           4.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   

movieId  9       10      ...  193565  193567  193571  193573  193579  193581  \
userId                   ...                                                   
1           0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
2           0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
3           0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
4           0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0

##  Compute User Similarity

Now that we have the user-item matrix, we calculate **similarity between users** using **cosine similarity**:

- `cosine_similarity(user_item_matrix)` computes pairwise similarity scores between all users based on their rating vectors.
- Users who have rated movies similarly will have a **higher similarity score** (closer to 1).  
- Users with very different ratings will have a **lower similarity score** (closer to 0).

We convert the result into a **DataFrame** with `userId` as both rows and columns for easier lookup.

> Example: `user_similarity.loc[1, 2]` gives the similarity between user 1 and user 2.


In [15]:
user_similarity = cosine_similarity(user_item_matrix)
user_similarity = pd.DataFrame(user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index)
user_similarity.head()


userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.027283,0.05972,0.194395,0.12908,0.128152,0.158744,0.136968,0.064263,0.016875,...,0.080554,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
2,0.027283,1.0,0.0,0.003726,0.016614,0.025333,0.027585,0.027257,0.0,0.067445,...,0.202671,0.016866,0.011997,0.0,0.0,0.028429,0.012948,0.046211,0.027565,0.102427
3,0.05972,0.0,1.0,0.002251,0.00502,0.003936,0.0,0.004941,0.0,0.0,...,0.005048,0.004892,0.024992,0.0,0.010694,0.012993,0.019247,0.021128,0.0,0.032119
4,0.194395,0.003726,0.002251,1.0,0.128659,0.088491,0.11512,0.062969,0.011361,0.031163,...,0.085938,0.128273,0.307973,0.052985,0.084584,0.200395,0.131746,0.149858,0.032198,0.107683
5,0.12908,0.016614,0.00502,0.128659,1.0,0.300349,0.108342,0.429075,0.0,0.030611,...,0.068048,0.418747,0.110148,0.258773,0.148758,0.106435,0.152866,0.135535,0.261232,0.060792


##  User-Based Recommendation Function

This function generates **personalized movie recommendations** for a given user using **user-based collaborative filtering**:

1. **Compute similarity scores (`sim_scores`)**:  
   - Extract similarity values between the target user (`user_id`) and all other users from the `user_similarity` matrix.

2. **Calculate weighted ratings (`weighted_ratings`)**:  
   - Multiply the similarity scores with the ratings of all users to predict the target user’s potential ratings.  
   - Divide by the sum of similarity scores to normalize the prediction.

3. **Filter out already rated movies (`already_rated`)**:  
   - Ensure we only recommend movies the user hasn’t seen yet.

4. **Sort and select top N recommendations**:  
   - Sort the predicted ratings in descending order.  
   - Keep only the top `top_n` movies with the highest predicted ratings.

5. **Return readable results (`recommended_movies`)**:  
   - Merge the movie titles from `movies.csv`.  
   - Include the predicted rating for transparency.

> Example: `recommend_user(1, top_n=5)` returns the **top 5 movie recommendations for user 1** based on what similar users liked.


In [16]:
def recommend_user(user_id, top_n=5):
    sim_scores = user_similarity[user_id]
    weighted_ratings = user_item_matrix.T.dot(sim_scores) / sim_scores.sum()
    already_rated = user_item_matrix.loc[user_id][user_item_matrix.loc[user_id] > 0].index
    recommendations = weighted_ratings.drop(already_rated)
    recommendations = recommendations.sort_values(ascending=False).head(top_n)
    recommended_movies = movies[movies['movieId'].isin(recommendations.index)][['title']]
    recommended_movies['predicted_rating'] = recommendations.values
    return recommended_movies


## Check Movies Already Rated by the User

Before generating recommendations, it's useful to **see which movies the target user has already rated**:

1. **Filter ratings for the target user** (`userId == 1`).  
2. **Merge with the movies dataset** to get the movie titles instead of just `movieId`.  
3. **Select relevant columns**: `title` and `rating`.  
4. **Sort by rating descending** to see the user's favorite movies first.  

This helps us:
- Understand the user’s preferences.
- Ensure that recommendations **exclude movies the user has already rated**.

> Example: Displays the top 10 movies user 1 has rated highest.


In [19]:
user1_ratings = ratings[ratings['userId'] == 1].merge(movies, on='movieId')[['title', 'rating']]
print("Movies already rated by User 1:")
user1_ratings.sort_values(by='rating', ascending=False).head(10)


Movies already rated by User 1:


Unnamed: 0,title,rating
231,M*A*S*H (a.k.a. MASH) (1970),5.0
185,Excalibur (1981),5.0
89,Indiana Jones and the Last Crusade (1989),5.0
90,Pink Floyd: The Wall (1982),5.0
190,From Russia with Love (1963),5.0
189,Goldfinger (1964),5.0
188,"Dirty Dozen, The (1967)",5.0
186,Gulliver's Travels (1939),5.0
184,American Beauty (1999),5.0
179,"South Park: Bigger, Longer and Uncut (1999)",5.0


## Generate Top Recommendations for the User

Now we generate **personalized movie recommendations** for user 1 using the `recommend_user` function:

1. **Call the function** with `user_id = 1` and `top_n = 5` to get the top 5 recommendations.  
2. **Print the results**, which include:
   - Movie `title`
   - Predicted rating (`predicted_rating`) based on similar users’ preferences.

This allows us to see which **new movies user 1 is most likely to enjoy**, according to user-based collaborative filtering.


In [20]:
print("Top 5 recommendations for User 1:")
print(recommend_user(1, top_n=5))


Top 5 recommendations for User 1:
                                                  title  predicted_rating
277                    Shawshank Redemption, The (1994)          2.622414
507                   Terminator 2: Judgment Day (1991)          2.061920
659                               Godfather, The (1972)          1.836914
2078                            Sixth Sense, The (1999)          1.643315
3638  Lord of the Rings: The Fellowship of the Ring,...          1.605043
