# S.a.M.: Movie Recommender Feature

The movie recommendation system will offer suggestions the user might like to watch based on their previously watched movies or genres that they like. This system will use K-Means clustering and content-based filtering

# Step 1: Loading Data

First, import the necessary libraries and resources then read data

In [68]:
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.neighbors import NearestNeighbors


In [69]:
movieNames = pd.read_csv('movies.csv')
movieRatings = pd.read_csv('ratings.csv')

In [70]:
movieRatings.head() # First dataset

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## Step 1a: Convert time stamp column from UNIX to readable format

In [71]:
# Timestamp column is currently in UNIX format so we're going to convert it into a readable format
def UNIXtoReadable(df):
    return pd.to_datetime(datetime.fromtimestamp(df).strftime('%Y-%m-%d %H:%M:%S'))

movieRatings.timestamp = movieRatings.timestamp.apply(UNIXtoReadable)
movieRatings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,2000-07-30 11:45:03
1,1,3,4.0,2000-07-30 11:20:47
2,1,6,4.0,2000-07-30 11:37:04
3,1,47,5.0,2000-07-30 12:03:35
4,1,50,5.0,2000-07-30 11:48:51


In [72]:
movieNames.head() # Second dataset

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Step 2: Merge the first and second dataset

In [73]:
movies = pd.merge(movieRatings, movieNames, how = 'inner')
movies.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,2000-07-30 11:45:03,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,2000-07-30 11:20:47,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,2000-07-30 11:37:04,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,2000-07-30 12:03:35,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,2000-07-30 11:48:51,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


In the pd.merge() function, the how parameter specifies the type of join to perform between the two DataFrames. In this case, how='inner' performs an inner join.

An inner join returns only the rows where there is a match in both DataFrames based on the specified join column(s). In other words, it keeps only the rows that have a common value in the specified column(s) across both DataFrames.

In [74]:
print("The dimensions of our merged dataset movies are:", movies.shape)

The dimensions of our merged dataset movies are: (100836, 6)


Our dataset consists of 100,836 movies with 6 features. The 6 features are userId, movieId, rating, timestamp, title, and genres. 

## Step 2a: Remove the movie year from movie titles

In [75]:
movies['title'] = movies.title.str.split('(').str[0].str[:-1]
movies.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,2000-07-30 11:45:03,Toy Story,Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,2000-07-30 11:20:47,Grumpier Old Men,Comedy|Romance
2,1,6,4.0,2000-07-30 11:37:04,Heat,Action|Crime|Thriller
3,1,47,5.0,2000-07-30 12:03:35,Seven,Mystery|Thriller
4,1,50,5.0,2000-07-30 11:48:51,"Usual Suspects, The",Crime|Mystery|Thriller


# Step 3: Create a dictionary mapping movieIds to corresponding movie titles. 
This dictionary can also be used to replace movieIds with their titles in other parts of code.

In [76]:
# Created a (movieId: title) dictionary for all movieId's for replacing them with their names
movieIdDict = movies.drop_duplicates('title')[['movieId', 'title']].set_index('movieId').to_dict()['title']

# First 5 elements of this dictionary
list(movieIdDict.items())[:5]

[(1, 'Toy Story'),
 (3, 'Grumpier Old Men'),
 (6, 'Heat'),
 (47, 'Seven'),
 (50, 'Usual Suspects, The')]

# Step 4: Create a pivot table dataRecommendation from the movies DataFrame 

In [77]:
# Creating a pivot table that has indexes as user ratings, and columns as each movie title
dataRecommendation = movies.pivot(index='userId', columns='movieId', values='rating').fillna(0)

# Replacing dataRecommendation columns with the movie titles
dataRecommendation.columns = dataRecommendation.columns.map(movieIdDict)

# Output pivot table with user ratings for each movie. Show a sample of 5 individual users ratings' of 5 movies
dataRecommendation.head(5).iloc[:, [0,1,2,3,4]]

movieId,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,4.0,0.0,4.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0


# Step 5: Use k-nearest neighbors (KNN) for content-based collaborative filtering to generate movie recommendations

In [78]:
# Use KNN to find the most similar movies to the target movie based on cosine similarity. 
knn = NearestNeighbors(n_neighbors=11, metric='cosine', algorithm='brute')
knn.fit(dataRecommendation.values.T)

The feature vectors can represent various attributes of a movie, such as user ratings, genres and other relevant features. Cosine similarity is chosen as a metric because it effectively captures the similarity between two movies based on their feature vectors.

In [79]:
# Here is our movie recommendations for Toy Story
recommendationResult = list(knn.kneighbors([dataRecommendation['Toy Story'].values], 8))

recommendationResult  # The first array gives the cosine angles. The second array gives the movieId corresponding to the cosine angles. We'll need to convert it to a more readable form.

[array([[0.        , 0.42739874, 0.4343632 , 0.43573831, 0.44261183,
         0.45290409, 0.45885465, 0.4589107 ]]),
 array([[   0, 2353,  418,  615,  224,  314,  322,  910]])]

This step generates movie recommendations for Toy Story. The first array represents the cosine similarities between the target movie and the recommended movie. The second array represents the movieID corresponding to the cosine similarity.

In [80]:
recommendations = pd.DataFrame(np.vstack((recommendationResult[1], recommendationResult[0])),
                 index=['movieId', 'Cosine_Similarity (degree)']).T
recommendations = recommendations.drop([0]).reset_index(drop=True)
# In this step, I created a dataframe that stores the movieId and cosine similarity in degrees
recommendations

Unnamed: 0,movieId,Cosine_Similarity (degree)
0,2353.0,0.427399
1,418.0,0.434363
2,615.0,0.435738
3,224.0,0.442612
4,314.0,0.452904
5,322.0,0.458855
6,910.0,0.458911


Create a DataFrame to store the movie recommendations

In [81]:
a = dataRecommendation.columns.to_frame().reset_index(drop=True).to_dict()['movieId']
recommendations.movieId = recommendations.movieId.map(a)

recommendations

Unnamed: 0,movieId,Cosine_Similarity (degree)
0,Toy Story 2,0.427399
1,Jurassic Park,0.434363
2,Independence Day,0.435738
3,Star Wars: Episode IV - A New Hope,0.442612
4,Forrest Gump,0.452904
5,"Lion King, The",0.458855
6,Star Wars: Episode VI - Return of the Jedi,0.458911


Map the movieIDs to movie titles

## Step 5a: Put it all together and create a function to generate movie recommendations

In [82]:
def movieRecommendation(movie_title, num_recommendations):
    # Check if the movie exists in the dataNames DataFrame
    movieRow = movies[movies['title'] == movie_title]
    if len(movieRow) > 0:
        movieId = movieRow['movieId'].values[0]
        # Check if the movie exists in the dataRecommendation DataFrame
        if movieId in dataRecommendation.index:
            recommendations = pd.DataFrame(np.vstack((recommendationResult[1], recommendationResult[0])),
                 index=['movieId', 'Cosine_Similarity (degree)']).T
            recommendations = recommendations.drop([0]).reset_index(drop=True)
            a = dataRecommendation.columns.to_frame().reset_index(drop=True).to_dict()['movieId']
            recommendations.movieId = recommendations.movieId.map(a)
            return recommendations
    else:
        print(f"The movie '{movie_title}' does not exist in the dataset.")
    
    return None

In [83]:
movieRecommendation('Toy Story', 7)

Unnamed: 0,movieId,Cosine_Similarity (degree)
0,Toy Story 2,0.427399
1,Jurassic Park,0.434363
2,Independence Day,0.435738
3,Star Wars: Episode IV - A New Hope,0.442612
4,Forrest Gump,0.452904
5,"Lion King, The",0.458855
6,Star Wars: Episode VI - Return of the Jedi,0.458911


In [84]:
movieRecommendation('Movie Does Not Exist', 1)

The movie 'Movie Does Not Exist' does not exist in the dataset.


Accuracy

Recall

F1

Dataset visualization