# Movie Recommendation Project Summary

- **Dataset**: MovieLens dataset containing movies, user ratings, and user IDs.
- **Goal**: Build a recommendation system to suggest similar movies based on user ratings.

## Steps I Followed

1. **Data Loading and Cleaning**
   - Loaded `movies.csv` and `ratings.csv`.
   - Merged datasets to create a single table with `userId`, `movieId`, `rating`, and `title`.
   - Removed missing or irrelevant data.

2. **Popularity Analysis**
   - Calculated the total number of ratings for each movie.
   - Filtered out movies with very few ratings (e.g., only included movies with ≥50 ratings) to ensure statistical significance.

3. **Pivot Table Creation**
   - Created a **user-movie rating matrix**:
     - Rows = movies
     - Columns = users
     - Values = ratings (filled missing ratings with 0)
   - This matrix is essential for correlation-based and kNN recommendations.

4. **Correlation Analysis (Optional)**
   - Used Pearson correlation to find movies with similar rating patterns.
   - Helps recommend movies that users with similar tastes liked.

5. **Collaborative Filtering Using k-Nearest Neighbors (kNN)**
   - Converted the pivot table to a **scipy sparse matrix** for efficiency.
   - Trained a **kNN model** with `metric='cosine'` and `algorithm='brute'`.
   - Queried the model to find the **nearest neighbors** (similar movies) for a given movie.

## Key Takeaways
- Filtering for popular movies ensures **statistical reliability**.
- Pivot tables and sparse matrices make calculations **efficient** for large datasets.
- kNN allows me to provide **personalized recommendations** based on user rating patterns.


In [1]:
## Dataset URL: https://grouplens.org/datasets/movielens/latest/

# Import required libraries
import pandas as pd   # For data manipulation and analysis
import numpy as np    # For numerical operations

In [3]:
# Load the movies dataset
# - Only use 'movieId' and 'title' columns for simplicity
# - Set data types for efficiency
movies_df = pd.read_csv(
    'movies.csv',
    usecols=['movieId', 'title'],
    dtype={'movieId': 'int32', 'title': 'str'}
)

# Load the ratings dataset
# - Only use 'userId', 'movieId', and 'rating' columns
# - Set data types for efficiency and memory optimization
rating_df = pd.read_csv(
    'ratings.csv',
    usecols=['userId', 'movieId', 'rating'],
    dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'}
)

In [4]:
# Display the first 5 rows of the movies dataset
# - Helps me quickly check the structure and content of the dataset
movies_df.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [5]:
# Display the first 5 rows of the ratings dataset
# - Helps me quickly check the structure and content of the ratings data
rating_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [6]:
# Merge the ratings and movies datasets
# - Join on 'movieId' so each rating has the corresponding movie title
df = pd.merge(rating_df, movies_df, on='movieId')

# Display the first 5 rows of the merged dataset
df.head()

Unnamed: 0,userId,movieId,rating,title
0,1,1,4.0,Toy Story (1995)
1,1,3,4.0,Grumpier Old Men (1995)
2,1,6,4.0,Heat (1995)
3,1,47,5.0,Seven (a.k.a. Se7en) (1995)
4,1,50,5.0,"Usual Suspects, The (1995)"


In [7]:
# Remove rows where 'title' is missing
combine_movie_rating = df.dropna(axis=0, subset=['title'])

# Count the total number of ratings for each movie
# - group by 'title' and count the number of ratings
# - reset index to get a proper DataFrame
# - rename the column to 'totalRatingCount' for clarity
movie_ratingCount = (
    combine_movie_rating
    .groupby(by=['title'])['rating']
    .count()
    .reset_index()
    .rename(columns={'rating': 'totalRatingCount'})
    [['title', 'totalRatingCount']]
)

# Display the first 5 rows to check the rating counts
movie_ratingCount.head()

Unnamed: 0,title,totalRatingCount
0,'71 (2014),1
1,'Hellboy': The Seeds of Creation (2004),1
2,'Round Midnight (1986),2
3,'Salem's Lot (2004),1
4,'Til There Was You (1997),2


In [8]:
# Merge the original ratings with the total rating count for each movie
# - This allows us to know both the individual ratings and the popularity of each movie
rating_with_totalRatingCount = combine_movie_rating.merge(
    movie_ratingCount, 
    left_on='title', 
    right_on='title', 
    how='left'
)

# Display the first 5 rows to verify the merge
rating_with_totalRatingCount.head()

Unnamed: 0,userId,movieId,rating,title,totalRatingCount
0,1,1,4.0,Toy Story (1995),215
1,1,3,4.0,Grumpier Old Men (1995),52
2,1,6,4.0,Heat (1995),102
3,1,47,5.0,Seven (a.k.a. Se7en) (1995),203
4,1,50,5.0,"Usual Suspects, The (1995)",204


In [9]:
# Set pandas display option to format floats with 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Display descriptive statistics for the total number of ratings per movie
# - Helps understand the distribution of how many ratings movies received
print(movie_ratingCount['totalRatingCount'].describe())

count   9719.000
mean      10.375
std       22.406
min        1.000
25%        1.000
50%        3.000
75%        9.000
max      329.000
Name: totalRatingCount, dtype: float64


In [10]:
# Set a threshold to filter only popular movies
# - Only include movies that have received at least 50 ratings
popularity_threshold = 50
rating_popular_movie = rating_with_totalRatingCount.query('totalRatingCount >= @popularity_threshold')

# Display the first 5 rows to verify the filtering
rating_popular_movie.head()

Unnamed: 0,userId,movieId,rating,title,totalRatingCount
0,1,1,4.0,Toy Story (1995),215
1,1,3,4.0,Grumpier Old Men (1995),52
2,1,6,4.0,Heat (1995),102
3,1,47,5.0,Seven (a.k.a. Se7en) (1995),203
4,1,50,5.0,"Usual Suspects, The (1995)",204


In [11]:
# Check the shape of the filtered popular movies dataset
# - This tells me how many ratings and movies are included after applying the popularity threshold
rating_popular_movie.shape

(41362, 5)

In [12]:
## First, let's create a pivot matrix

# - Rows represent movies (by title)
# - Columns represent users (by userId)
# - Values are the ratings
# - Missing ratings are filled with 0 because the user did not rate the movie
movie_features_df = rating_popular_movie.pivot_table(
    index='title',
    columns='userId',
    values='rating'
).fillna(0)

# Display the first 5 rows of the pivot matrix
movie_features_df.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10 Things I Hate About You (1999),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men (1957),0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2001: A Space Odyssey (1968),0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,5.0,0.0,3.0,0.0,4.5
28 Days Later (2002),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,5.0
300 (2007),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,5.0,0.0,4.0


In [13]:
# Convert the movie features pivot table to a scipy sparse matrix
# - Sparse matrix is efficient because most ratings are zeros
from scipy.sparse import csr_matrix
movie_features_df_matrix = csr_matrix(movie_features_df.values)

# Import k-Nearest Neighbors algorithm
from sklearn.neighbors import NearestNeighbors

# Initialize the kNN model
# - metric='cosine' calculates similarity based on cosine distance
# - algorithm='brute' performs a brute-force search for nearest neighbors
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')

# Fit the kNN model to the movie features matrix
model_knn.fit(movie_features_df_matrix)

In [14]:
# Check the shape of the movie features pivot matrix
# - Returns (number of movies, number of users)
movie_features_df.shape

(450, 606)

In [15]:
# Randomly select a movie row index to query
query_index = np.random.choice(movie_features_df.shape[0])
print(query_index)  # Display the index of the selected movie

# Find the 6 nearest neighbors of the selected movie using kNN
# - iloc[query_index, :] selects the movie vector
# - .values.reshape(1, -1) reshapes it for sklearn
# - n_neighbors=6 returns the movie itself + 5 similar movies
distances, indices = model_knn.kneighbors(
    movie_features_df.iloc[query_index, :].values.reshape(1, -1),
    n_neighbors=6
)

35


In [16]:
# Display the first 5 rows of the movie features pivot table
# - Rows are movie titles
# - Columns are user IDs
# - Values are ratings (0 if the user didn't rate the movie)
movie_features_df.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10 Things I Hate About You (1999),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men (1957),0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2001: A Space Odyssey (1968),0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,5.0,0.0,3.0,0.0,4.5
28 Days Later (2002),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,5.0
300 (2007),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,5.0,0.0,4.0


In [17]:
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(movie_features_df.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, movie_features_df.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for Austin Powers: The Spy Who Shagged Me (1999):

1: Austin Powers: International Man of Mystery (1997), with distance of 0.2481430172920227:
2: American Pie (1999), with distance of 0.3662390112876892:
3: South Park: Bigger, Longer and Uncut (1999), with distance of 0.43402284383773804:
4: Indiana Jones and the Temple of Doom (1984), with distance of 0.4383839964866638:
5: Men in Black (a.k.a. MIB) (1997), with distance of 0.44469213485717773:
