Aharon Rabson <br>
Movie Recommender System

In [3]:
# Basic imports
import pandas as pd
import numpy as np

In [13]:
# Read movies and ratings data
movies_df = pd.read_csv("movies.csv")
ratings_df = pd.read_csv("ratings.csv")

# Display basic information about each dataset
movies_info = movies_df.info(), movies_df.head()
ratings_info = ratings_df.info(), ratings_df.head()

movies_info, ratings_info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


((None,
     movieId                               title  \
  0        1                    Toy Story (1995)   
  1        2                      Jumanji (1995)   
  2        3             Grumpier Old Men (1995)   
  3        4            Waiting to Exhale (1995)   
  4        5  Father of the Bride Part II (1995)   
  
                                          genres  
  0  Adventure|Animation|Children|Comedy|Fantasy  
  1                   Adventure|Children|Fantasy  
  2                               Comedy|Romance  
  3                         Comedy|Drama|Romance  
  4                                       Comedy  ),
 (None,
     userId  movieId  rating  timestamp
  0       1        1     4.0  964982703
  1       1        3     4.0  964981247
  2       1        6     4.0  964982224
  3       1       47     5.0  964983815
  4       1       50     5.0  964982931))

In [15]:
# Pivot the ratings data to create a user-item matrix
# Rows are users, columns are movies, and values are ratings
user_movie_matrix = ratings_df.pivot(index='userId', columns='movieId', values='rating')

# Display the shape and a preview of the matrix
matrix_info = user_movie_matrix.shape, user_movie_matrix.head()

matrix_info

((610, 9724),
 movieId  1       2       3       4       5       6       7       8       \
 userId                                                                    
 1           4.0     NaN     4.0     NaN     NaN     4.0     NaN     NaN   
 2           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
 3           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
 4           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
 5           4.0     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
 
 movieId  9       10      ...  193565  193567  193571  193573  193579  193581  \
 userId                   ...                                                   
 1           NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN     NaN   
 2           NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN     NaN   
 3           NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN     NaN   
 4           NaN     NaN  ...     NaN     NaN  

The user-item matrix has 610 users and 9,724 movies, with ratings as the matrix values. <br>
Each row represents a user, and each column represents a movie.<br>

I'll now implement a collaborative filtering-based recommendation algorithm. To make recommendations,<br>
I’ll use the cosine similarity between movies to identify the most similar ones based on user ratings.<br>
This similarity measure will help in recommending movies that are often liked by users who also <br>
liked the input movie. 

In [24]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

I imported cosine_similarity from sklearn.metrics.pairwise, which calculates the <br>
cosine similarity between vectors. This metric is commonly used in recommendation <br>
systems because it measures the angle between two vectors (representing movies) in <br>
a high-dimensional space, indicating how similar they are based on user ratings.<br>

In [29]:
# 1. Replace NaN values with 0 for similarity calculation
user_movie_matrix_filled = user_movie_matrix.fillna(0)

# 2. Calculate the cosine similarity matrix between movies (transpose is to get movie-wise similarity)
movie_similarity = cosine_similarity(user_movie_matrix_filled.T)

# 3. Convert the similarity matrix to a DataFrame for easier lookup
movie_similarity_df = pd.DataFrame(movie_similarity, index=user_movie_matrix.columns, 
                                   columns=user_movie_matrix.columns)

# 4. Function to get movie recommendations based on similarity
def get_recommendations(movie_title, movies_df, movie_similarity_df, top_n=10):
    # Find the movieId for the given movie title
    movie_id = movies_df[movies_df['title'] == movie_title]['movieId']
    if movie_id.empty:
        return "Movie not found in the dataset."
    movie_id = movie_id.values[0]
    
    # Get similarity scores for this movie and sort by similarity
    similarity_scores = movie_similarity_df[movie_id].sort_values(ascending=False)
    
    # Exclude the input movie itself and get the top_n similar movies
    similar_movie_ids = similarity_scores.iloc[1:top_n+1].index
    recommendations = movies_df[movies_df['movieId'].isin(similar_movie_ids)]
    
    return recommendations[['title', 'genres']]

# Test the recommendation function with an example movie title
example_movie_title = "Toy Story (1995)"
recommendations = get_recommendations(example_movie_title, movies_df, movie_similarity_df)

recommendations

Unnamed: 0,title,genres
224,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
314,Forrest Gump (1994),Comedy|Drama|Romance|War
322,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX
418,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
546,Mission: Impossible (1996),Action|Adventure|Mystery|Thriller
615,Independence Day (a.k.a. ID4) (1996),Action|Adventure|Sci-Fi|Thriller
911,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi
964,Groundhog Day (1993),Comedy|Fantasy|Romance
969,Back to the Future (1985),Adventure|Comedy|Sci-Fi
2355,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy


1. The user_movie_matrix is a pivot table where rows are users, columns are movies, and values are <br>
ratings. Since users haven't rated all movies, this matrix contains missing values (NaN). I replaced<br>
these NaN values with 0, treating unrated movies as 0 ratings for similarity calculations. This is<br>
a simple approach to handle missing data in collaborative filtering. <br>
2. I calculated the cosine similarity between movies by transposing user_movie_matrix_filled.<br>
The transpose ensures we are comparing columns (movies) rather than rows (users). This results<br>
in a similarity matrix where each entry represents the similarity score between two movies.<br>
3.  The similarity matrix was converted into a DataFrame for easier lookup, with movieId as both<br>
row and column indices. This enables quick access to similarity scores between any pair of movies,<br>
based on their IDs.<br>
4. The get_recommendations function finds movies similar to the input movie:<br>
- Retrieve Movie ID: Given a movie title, we locate its corresponding movieId in movies_df.<br>
If the title isn't found, the function returns an error - message.<br>
- Calculate Similarity Scores: Using the movie_similarity_df, I retrieve and sort similarity<br>
scores for the specified movie in descending order, placing the most similar movies at the top.<br>
- Exclude Input Movie: To avoid recommending the input movie itself, I skip the topmost score<br>
(the movie's similarity to itself).<br>
- Return Recommendations: Finally, i retrieve the top n (default 10) similar movies from movies_df<br>
and return their titles and genres. <br>

The recommender system successfully generated ten movie recommendations based on the input movie <br>
"Toy Story (1995)." The recommended movies include popular titles with similar genres or appeal:<br>

Star Wars: Episode IV - A New Hope (1977) <br>
Forrest Gump (1994)<br>
The Lion King (1994)<br>
Jurassic Park (1993)<br>
Mission: Impossible (1996)<br>
Independence Day (1996)<br>
Star Wars: Episode VI - Return of the Jedi (1983)<br>
Groundhog Day (1993)<br>
Back to the Future (1985)<br>
Toy Story 2 (1999)<br>