# MovieLens Recommender System

## DSC 630

## Week 9

## Predictive Analytics Assignment Week 9

## David Berberena

## 8/4/2024

# Program Start

In [1]:
# Importing Pandas to read and work with the MovieLens small datasets will be done here. I will be importing the 
# cosine_similarity() function as well from Sci-kitlearn's metrics.pairwise module after reading up on recommender systems, 
# as cosine similarity is one of the most popular distance measures used to determine the similarity of two vectors (in 
# this case the vectors would be the movies).

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# I will read in both the movies and ratings datasets. 
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

# The head() function is used to verify that the data has been correctly loaded.

print(movies.head())
print(ratings.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


In [2]:
# I see that the movies dataset contains movie titles that are also inclusive of the year they were released. As the year is
# not something a user would typically enter (or even remember off the top of their head), I will move the year into its own
# column by extracting it using str.extract(), str.replace(), str.strip(), and a regular expression that searches for the 
# four digits that make up the year, captures them and moves them to a new year column and is then removed from the title 
# column. This allows the user to enter only the movie name without the year and succeed in outputting the ten recommended 
# movies. 

movies['year'] = movies['title'].str.extract(r'\((\d{4})\)', expand = False)
movies['title'] = movies['title'].str.replace(r'\(\d{4}\)', '', regex = True).str.strip()

# I will also make the movie titles lowercase so they can be easily found when a user inputs a movie title in the wrong 
# casing (the user input's casing will be addressed later).

movies['title'] = movies['title'].str.lower()

# The head() function is used to verify that the data has been correctly transformed.

movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,toy story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,jumanji,Adventure|Children|Fantasy,1995
2,3,grumpier old men,Comedy|Romance,1995
3,4,waiting to exhale,Comedy|Drama|Romance,1995
4,5,father of the bride part ii,Comedy,1995


In [3]:
# As I am choosing to use the cosine_similarity() function to assess movie similarity scores, I need to transform the 
# ratings dataset (which contains the relevant movie ratings) into a matrix usable by the scoring function. To do this, I 
# will be using the pivot() function to specify the matrix being built with users as the rows, the movies themselves as the 
# columns, and the ratings as the values. 

ratings_matrix = ratings.pivot(index='userId', columns='movieId', values='rating')

# Now that the matrix is made, there are many NaN values as each user did not rate every movie within the dataset. I will 
# fill the missing values with 0 (indicating the users did not rate the movies) using fillna(). 

ratings_matrix.fillna(0, inplace = True)

# The head() function is used to verify that the ratings dataset matrix has been properly pivoted and filled with zeroes.

ratings_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
# The cosine_similarity() function can now be used to calculate the similarity of the movie vectors. Cosine similarity works
# regardless of vector size, which is why I have chosen it to be the distance metric needed for the recommender system.

movie_similarity_scores = cosine_similarity(ratings_matrix.T)

# The similarity scores present themselves as a multitude of arrays, so for ease of manipulation, I will convert the scores
# into a new DataFrame which I will craft to look like a correlation matrix, with the rows and columns being the movies and
# the values being the similarity scores.

recommendations = pd.DataFrame(movie_similarity_scores, index = ratings_matrix.columns, columns = ratings_matrix.columns)

# The head() function is used to verify that the DataFrame has been correctly created.

recommendations.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.410562,0.296917,0.035573,0.308762,0.376316,0.277491,0.131629,0.232586,0.395573,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.410562,1.0,0.282438,0.106415,0.287795,0.297009,0.228576,0.172498,0.044835,0.417693,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.296917,0.282438,1.0,0.092406,0.417802,0.284257,0.402831,0.313434,0.30484,0.242954,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.035573,0.106415,0.092406,1.0,0.188376,0.089685,0.275035,0.158022,0.0,0.095598,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.308762,0.287795,0.417802,0.188376,1.0,0.298969,0.474002,0.283523,0.335058,0.218061,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
# For the recommender system to function so that a user can input any movie within the dataset to output the ten recommended
# movies based on cosine similarity score, I will craft a custom function that takes an input movie from a user, finds the 
# movie ID of the user's movie choice, calls forth the similarity scores for that movie, sorts them in descending order, 
# extracts the movie IDs of the top ten scores, links the movie titles to those ten IDs, and outputs them for the user to 
# read as recommendations. The function will take three parameters: the user inputted movie title's ID, the movies dataset, 
# and the recommendations similarity score DataFrame. I have added the last parameter of the number of recommendations and 
# set it to equal 10.

def user_recommended_movies(movie_ID, movies, recommendations, num_similar_movies = 10):
    
# To extract the movie ID from the user's choice of movie, indexing is used along with the values() function. I have 
# specified the user_movie variable with the lower() function added as this will be the user's movie title that they input 
# into the program. This way, any casing issue with the input is handled to still extract the correct movie title.
    
    try:
        movie_ID = movies[movies['title'] == user_movie]['movieId'].values[0]
    except IndexError:
        raise ValueError(f"Movie '{user_movie}' not found in the dataset.")
    
# The similarity scores of that movie ID are then obtained through the perusal of the recommendations similarity matrix. 
# These scores are then arranged in descending order using sort_values().

    movie_similarity_scores = recommendations[movie_ID]
    top_down_movies = movie_similarity_scores.sort_values(ascending = False)
    
# Using the iloc() function, I can locate and retrieve the movie IDs for the top ten movies by cosine similarity score.

    top_ten_movies = top_down_movies.iloc[1:num_similar_movies + 1].index
    
# Now I need to find each movie ID within the movies dataset and capture the movie titles belonging to those IDs. I am 
# employing the isin() function along with more indexing. I also added a for loop to transform the movie titles back to 
# their original casing for user display purposes using the title() function.

    recommended_movies = movies[movies['movieId'].isin(top_ten_movies)]['title'].values
    recommended_movies = [movie.title() for movie in recommended_movies]

# To ensure that I receive the recommended movies variable at the end of the function, I will return that variable here.    
    
    return recommended_movies

In [None]:
# To make the recommendation program interactive, I will craft a small program that allows the user to input a movie that 
# exists within the dataset to receive the ten recommended movies. 

while True:
    print('Welcome to the Movie Recommendation Program!\n'  
      'This program allows you to get movie recommendations based on movies you have seen.\n')
    
    user_movie = input('Please type a movie you have watched that you would like recommendations for.\n').lower()
    try:
        recommended_movies = user_recommended_movies(user_movie, movies, recommendations)
        print(f'\nMovies recommended for "{user_movie}":\n')
        for index, movie in enumerate(recommended_movies, start = 1):
            print(f'{index}. {movie}')
    except ValueError as e:
        print(e)
        
    repeated_recommendation = input('\nWould you like to look up recommendations for another movie? (Yes/No)\n')
    if repeated_recommendation.lower() not in ['yes', 'y']:
        print('Thank you for using the Movie Recommendation Program! Have a nice day!')
        break

Welcome to the Movie Recommendation Program!
This program allows you to get movie recommendations based on movies you have seen.

Please type a movie you have watched that you would like recommendations for.
deadpool

Movies recommended for "deadpool":

1. Edge Of Tomorrow
2. Guardians Of The Galaxy
3. John Wick
4. Kingsman: The Secret Service
5. Mad Max: Fury Road
6. Star Wars: Episode Vii - The Force Awakens
7. Avengers: Age Of Ultron
8. Ant-Man
9. The Martian
10. Zootopia

Would you like to look up recommendations for another movie? (Yes/No)
yes
Welcome to the Movie Recommendation Program!
This program allows you to get movie recommendations based on movies you have seen.



In [None]:
user_movie = 'batman begins'
recommended_movies = user_recommended_movies(user_movie, movies, recommendations)
print(f'\nMovies recommended for "{user_movie}":\n')
for index, movie in enumerate(recommended_movies, start = 1):
    print(f'{index}. {movie}')

The recommender system I have created here takes the MovieLens small datasets (movies and ratings CSV files) and uses them to recommend a user an additional ten movies to watch after their initial movie input. This is done by transforming the ratings 
dataset into a matrix that is subject to the cosine similarity score function, which outputs similarity scores on a scale of -1 to 1, where -1 shows a completely opposing similarity and a 1 shows a perfectly similar relationship. This new similarity score matrix is then used to craft a function that directly calls the similarity scores from the user's chosen movie input, orders them in descending order, and prints the ten movies with the highest similarity scores. My program adds user-friendly features so they can look up as many movies within the dataset as the wish to receive their recommendations.

## References

1. Gomes, N. D. (2023, January 31). The cosine similarity and its use in recommendation systems. Retrieved July 29, 2024, from https://naomy-gomes.medium.com/the-cosine-similarity-and-its-use-in-recommendation-systems-cb2ebd811ce1