Derek Lamb
DSC630 
Week 10 Assignment

In [53]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load the datasets
movies_df = pd.read_csv('/Users/dereklamb/Downloads/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('/Users/dereklamb/Downloads/ml-latest-small/ratings.csv')


print (movies_df.head)
print (ratings_df.head)

<bound method NDFrame.head of       movieId                                      title  \
0           1                           Toy Story (1995)   
1           2                             Jumanji (1995)   
2           3                    Grumpier Old Men (1995)   
3           4                   Waiting to Exhale (1995)   
4           5         Father of the Bride Part II (1995)   
...       ...                                        ...   
9737   193581  Black Butler: Book of the Atlantic (2017)   
9738   193583               No Game No Life: Zero (2017)   
9739   193585                               Flint (2017)   
9740   193587        Bungo Stray Dogs: Dead Apple (2018)   
9741   193609        Andrew Dice Clay: Dice Rules (1991)   

                                           genres  
0     Adventure|Animation|Children|Comedy|Fantasy  
1                      Adventure|Children|Fantasy  
2                                  Comedy|Romance  
3                            Comedy|Drama

In [54]:
# Merge datasets on movieId
movie_ratings_df = pd.merge(ratings_df, movies_df, on='movieId')

# Split genres and create a binary DataFrame
genres_split = movies_df['genres'].str.get_dummies(sep='|')
genre_matrix = genres_split.values

# Compute cosine similarity between movies
cosine_sim = cosine_similarity(genre_matrix)

# Recommendation Function
def recommend_movies(input_movie, num_recommendations=10):
    # Check if the input movie is in the dataset
    if input_movie not in movies_df['title'].values:
        return "Movie not found in the dataset."

    # Get the index of the input movie
    idx = movies_df[movies_df['title'] == input_movie].index[0]

    # Get the pairwise similarity scores of all movies with the input movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the top movie indices (excluding the input movie itself)
    sim_scores = sim_scores[1:num_recommendations + 1]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top recommended movies
    return movies_df['title'].iloc[movie_indices].tolist()

In [55]:
# Example usage of the recommender system
recommended_movies = recommend_movies("Jumanji (1995)", num_recommendations=5)
print(recommended_movies)

['Indian in the Cupboard, The (1995)', 'NeverEnding Story III, The (1994)', 'Escape to Witch Mountain (1975)', "Darby O'Gill and the Little People (1959)", 'Return to Oz (1985)']


This recommender system uses a content-based filtering approach, which suggests movies based on their features (in this case, genres). The idea is to recommend movies similar to those a user has liked in the past based on the characteristics of the movies.

Steps Performed

Load the Datasets:
Objective: To read the movie and ratings data into memory.
Here, we import the pandas library and load two CSV files: movies.csv (which contains movie titles and genres) and ratings.csv (which contains user ratings for movies).

Merge the Datasets:
Objective: To combine the movie and rating data based on the common column movieId.
This step creates a new DataFrame, movie_ratings_df, containing user ratings and the corresponding movie titles and genres.

Feature Engineering:
Objective: To convert the genres of each movie into a format suitable for analysis.
The genres column is split into separate columns (one for each genre) where each column contains binary values (0 or 1) indicating the presence of a genre in a movie. This results in a binary DataFrame, genres_split, converted into a NumPy array called genre_matrix.

Calculate Similarity:
Objective: To compute the similarity between movies based on their genres.
The cosine similarity is calculated using the binary genre matrix. The resulting cosine_sim array contains similarity scores for each pair of movies, where each element represents how similar two movies are based on their genres.

Recommendation Function:
Objective: To create a function that takes a movie title as input and returns a list of recommended movies based on their similarity to the input movie.
Inside the function:
The input movie is checked against the movie titles to ensure it exists in the dataset.
The index of the input movie is obtained to retrieve its similarity scores from the cosine_sim array.
Similarity scores are sorted in descending order to identify the most similar movies.
The indices of the top similar movies (excluding the input movie itself) are extracted and used to return their titles.

Usage Example:
Objective: To demonstrate how to use the recommendation function.
This step calls the recommend_movies function with the title of a movie (e.g., "The Matrix") and prints the top 5 recommended movies based on genre similarity.

Summary
The recommender system operates by:
- Loading the relevant movie and ratings data.
-  Merging the datasets to link ratings with movie titles and genres.
- Processing the genre information into a binary format for analysis.
- Calculating the similarity between movies using cosine similarity based on their genres.
- Providing a function that recommends similar movies when a user inputs a title.

This method leverages the content-based filtering approach, which focuses on the items' attributes (in this case, the genres of movies) to make recommendations. The system can help users discover movies they might enjoy based on their previously liked films.