# Movie Lens Recommendation Systems

**Authors:**  Tony Bai, Eddie Lin, Douglas Lu, Shane Mangold
***

## Overview

In this project we will aim to explore different types of Recommendation Systems by utilizing the classic MovieLens 100K dataset.

## Business Problem
![image.png](attachment:image.png)

Every Recommendation System these days, whether its Hulu, Netflix or Prime Video, are limited only to those movies that 
are available on each application's network. In order to circumnavigate this challenge, we aim to build a third party movie recommender program that can take a title and recommend users similar movies that are not limited to a specific streaming platform.

## Data Understanding

The data for this project comes from the well-known MovieLens 100k dataset. The data represent user reviews for a given movie as well as different movie genres tied to each movie. Though many variables are included, for our purposes in building recommendation systems, the target variable in this case is somewhat ambiguous. Furthermore, while some of our models will predict a given user rating based on other inputs, other models will recommend movies based on inputs but suffer from the lack of ground-truth comparisons. Thus, in this case, we must change our evaluation metrics to ensure that our models can be evaluated appropriately.

### Summary Data:

>The Movies CSV files contains 9742 movies containing the movieID, title, and genre. There are no null values in this dataset.

>The Ratings CSV file contains 100836 files which is more than the movies CSV file. There are also no null values in this dataset.

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from surprise import Reader, Dataset, SVDpp, accuracy
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import linear_kernel
from surprise.model_selection import cross_validate, GridSearchCV, train_test_split
from surprise.prediction_algorithms import SVD, KNNWithMeans, KNNBasic, KNNBaseline
from sklearn.feature_extraction.text import TfidfVectorizer

%matplotlib inline

ModuleNotFoundError: No module named 'Code'

In [None]:
#load dataset
movies = pd.read_csv('Data/movies.csv')
ratings = pd.read_csv('Data/ratings.csv')              

In [None]:
#exploratory data
sns.set_style('whitegrid')
sns.set(font_scale = 1)
sns.kdeplot(ratings['rating'].fillna(ratings['rating'].median()));

As you can see most movies are rated highly >3 in this dataset, perhaps revealing the fact that most users are generous with their ratings.

In [None]:
movie_df2 = movies.copy()
movie_df2['genres'] = movie_df2['genres'].str.strip().str.split('|')
genres_df = movie_df2.explode('genres')

#initiate graph 
fig, ax = plt.subplots(figsize = (30,10))


#set the x and y parameters
x = genres_df.groupby('genres')['movieId'].count().sort_values(ascending = False).index
height = genres_df.groupby('genres')['movieId'].count().sort_values(ascending = False)

#set axes labels
ax.set_xlabel('Movie Categories',size = 20)
ax.set_ylabel('Number of Movies',size = 20)
ax.set_title('Categories by Popularity',size = 25)

#display y-yabel and ticks on both left and right side
plt.tick_params(labelright = True)
ax.yaxis.set_ticks_position('both')

#set the tick sizes for graph
plt.yticks(size = 11)
plt.xticks(rotation = 0, size = 11)

#plot the bar graph of categories vs movies
ax.bar(x,height);

The entire dataset is around 100,000 movies, and you can see that the top few movie genres make up for the majority of the data. This could pose a problem later on if we were to make recommendations based on genre.

## Data Preparation

For our analysis we dropped timeseries information and only use the movies.csv and ratings.csv files. Given the nature of the filtering models we cretaed, much of the data was not necessary to address the business problem.

Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
movies.info()

In [None]:
movies.head()

In [None]:
ratings.info()

In [None]:
ratings.head()

In [None]:
df_joined = pd.merge(ratings, movies)
df_joined.drop(['userId', 'timestamp','rating'], axis=1)
df_joined.head()

## Data Modeling


## SVD with Surprise

We decided to implenment Surprise's packages and use it as a first model due to its built in tools and ease of use.

In [None]:
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings[['movieId', 'userId', 'rating']], reader)
dataset = data.build_full_trainset()

In [None]:
#need to add more to the grid search later
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1)
g_s_svd.fit(data)

print(g_s_svd.best_score)
print(g_s_svd.best_params)

## KNearest Neighbors

We then used a KNN model to generate predictions for our movies. Since a recommendation system using KNN would just result in obtaining the movies with the highest predicted score, we decided to look into collaborative filtering as a modeling technique in order to achieve more personalization for our users' movie recommendations. 

In [None]:
knn_baseline = KNNBaseline(sim_options={'name':'pearson', 'user_based':True})
cv_knn_baseline = cross_validate(knn_baseline,data)

In [None]:
for i in cv_knn_baseline.items():
    print(i)

np.mean(cv_knn_baseline['test_rmse'])

In [None]:
svd = SVD(n_factors= 50, reg_all=0.05)
svd.fit(dataset)
svd.predict(2, 4)

In [None]:
#created a function that uses your ratings/input to forecast and recommend other movies
def movie_rater(movie_df,num, genre=None):
    userID = 1000
    rating_list = []
    while num > 0:
        if genre:
            movie = movie_df[movie_df['genres'].str.contains(genre)].sample(1)
        else:
            movie = movie_df.sample(1)
        print(movie)
        rating = input('How do you rate this movie on a scale of 1-5, press n if you have not seen :\n')
        if rating == 'n':
            continue
        else:
            rating_one_movie = {'userId':userID,'movieId':movie['movieId'].values[0],'rating':rating}
            rating_list.append(rating_one_movie) 
            num -= 1
    return rating_list      

In [None]:
user_rating = movie_rater(df_joined, 3, 'Action')

In [None]:
#obtain the user inputed ratings
user_rating

In [None]:
df_joined

In [None]:
#merging the new inputed data onto the main dataframe
df_new = df_joined.drop(['genres','title','timestamp'], axis=1)
new_ratings = df_new.append(user_rating,ignore_index=True)

#convert this into a Suprise dataset
new_data = Dataset.load_from_df(new_ratings,reader)

In [None]:
#using Suprises SVD with grid searched params
svd_ = SVD(n_factors= 50, reg_all=0.05)
svd_.fit(new_data.build_full_trainset())

In [None]:
#evaluate the model

data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

trainset = data.build_full_trainset()
svd_.fit(trainset)

cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [None]:
#appending everything into a list
#here we are specifically predicting how user (user 32) would rate every single movie in the dataset
list_of_movies = []
for m_id in df_new['movieId'].unique():
    list_of_movies.append( (m_id,svd_.predict(32,m_id)[3]))

In [None]:
# as you can see this list is the entire dataset
len(list_of_movies)

In [None]:
#returns the first 5 movies in the list
list_of_movies[:5]

In [None]:
#sort the list from highest predicted scores
ranked_movies = sorted(list_of_movies, key=lambda x:x[1], reverse=True)

In [None]:
ranked_movies[:5]

In [None]:
#function to turn our recommendations into movie titles
def recommended_movies(user_ratings,movie_title_df,n):
        for idx, rec in enumerate(user_ratings):
            title = movie_title_df.loc[movie_title_df['movieId'] == int(rec[0])]['title']
            print('Recommendation # ', idx+1, ': ', title, '\n')
            n-= 1
            if n == 0:
                break
            
recommended_movies(ranked_movies,movies,4)

## Content-Based Filtering

![image.png](attachment:image.png)
Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback. Here we chose to measure similarity with **Cosine Similarity** (in short it measures the angles between non zero vectors). The score is on a scale of 0-1. A score of 0 means that the two vectors are orthogonal and have no match while a score of 1 means that they are identical.

>Below we use a TFIDF vectorizer and linear kernel to compute the pairwise distances

In [None]:
#define a TF-IDF Vectorizer Object.
tfidf_movies_genres = TfidfVectorizer(token_pattern = '[a-zA-Z0-9\-]+')

In [None]:
tfidf_movies_genres_matrix = tfidf_movies_genres.fit_transform(movies['genres'])

#linear kernel is same thing according to the documentation
cosine_score = linear_kernel(tfidf_movies_genres_matrix, tfidf_movies_genres_matrix)

In [None]:
#using TF-IDF Vectorizer Object to calculate recommendations
def get_recommendations_based_on_genres(movie_title, cosine_score=cosine_score):
    """
     Calculates top movies to recommend based on given movie titles genres. 
    :param movie_title: title of movie to be taken for base of recommendation
    :param cosine_sim_movies: cosine similarity between movies 
    :return: Titles of movies recommended to user
    """
    #Get the index of the movie that matches the title
    index_mov = movies.loc[movies['title'].isin([movie_title])]
    index_mov = index_mov.index
    
    #Get the pairwsie similarity scores of all movies with that movie
    sim_scores_movies = list(enumerate(cosine_score[index_mov][0]))
    
    #Sort the movies based on the similarity scores
    sim_scores_movies = sorted(sim_scores_movies, key=lambda x: x[1], reverse=True)

    #Get the scores of the 10 most similar movies
    sim_scores_movies = sim_scores_movies[0:10]
    
    
    #Get the movie indices
    movie_indices = [i[0] for i in sim_scores_movies if i[0] not in index_mov]
    
    #Return the most similar movies
    return movies['title'].iloc[movie_indices]

In [None]:
#code shows repeat but for purposes and time constraints we could not refine it yet.
get_recommendations_based_on_genres('Big Bully (1996)')

In [None]:
def get_recommendation_content_model(userId):
    """
    This function puts the previous function to use and recommends for specific user 
    :param userId: userid of user
    :return: Titles of movies recommended to user
    """
    recommended_movie_list = []
    movie_list = []
    df_rating_filtered = ratings[ratings["userId"]== userId]
    
    #generates a list of movies that user has already watched
    for key, row in df_rating_filtered.iterrows():
      
        movie_list.append((movies["title"][row["movieId"]==movies["movieId"]]).values) 
    #generates recommendations based on the titles the user has already watched and puts them into a list
    for index, movie in enumerate(movie_list):
        for key, movie_recommended in get_recommendations_based_on_genres(movie[0]).iteritems():
            recommended_movie_list.append(movie_recommended)

    #removing already watched movie from recommended list    
    for movie_title in recommended_movie_list:
        if movie_title in movie_list:
            recommended_movie_list.remove(movie_title)
    
    return set(recommended_movie_list)

In [None]:
get_recommendation_content_model(33)

The content based recommendation produces us with a nice list of movies, however how do we rank these movies and determine which ones to suggest? Before we define a metric we decided to explore collaborative filtering. 

## Collaborative Filtering
Item-Item Filtering
![image.png](attachment:image.png)

In our contexts for item to item filtering, the model will look for similar movies to the movies that the user has already rated and recommend the most similar movies. Here the similarity is calculated based on user ratings. Thus, similar movies will have similar high ratings and not be necessarily similar in genre. 

In [None]:
#creates the matrix of the movies 
movies_ratings=pd.merge(movies, ratings)
ratings_matrix_items = movies_ratings.pivot_table(index=['movieId'],columns=['userId'],values='rating').reset_index(drop=True)
ratings_matrix_items.fillna( 0, inplace = True )

In [None]:
ratings_matrix_items

In [None]:
#pairwise similarities were calculated between the movies and created a data frame
movie_similarity = pairwise_distances( ratings_matrix_items.values, metric="cosine")

#filling diagonals with 0s for modeling purposes we dont want to return the similar movies
np.fill_diagonal( movie_similarity, 0 ) 
ratings_matrix_items = pd.DataFrame( movie_similarity )
ratings_matrix_items

In [None]:
#function where returns similar scores of movies based on item to item:
def item_similarity(movieName): 
    """
    recomendates similar movies when entering a name 
   :param data: name of the movie 
   """
    try:

        mov=movies[movies['title']==movieName].index.tolist()
        mov=mov[0]

        movies['similarity'] = ratings_matrix_items.iloc[mov]
        movies.columns = ['movieId', 'title', 'genre','similarity']
    except:
        print("Sorry, the movie is not in the database!")

In [None]:
#single function that returns movie recs based on not liked movies
def dissimilar_movies_rec(movie):
    item_similarity(movie)
    dissim_mov= movies.sort_values( ["similarity"], ascending = True )
    return dissim_mov.iloc[1:50, :4]


In [None]:
dissimilar_movies_rec("Die Hard (1988)")

In [None]:
def recommendedMoviesAsperItemSimilarity(user_id):
    """
     Recommending movie which user hasn't watched as per Item Similarity
     we use the function above to create a row and ranks the dataframe based on similarity
     
     The threshhold here assumes that a user has movies rated 4 or above and the algo finds similar movies to those
     
     If the user doesnt have a movie rated 4 or 5 we then take movies rated between 0-3 (users disliked/neutral) and we 
     aim to find the movie most dissimilar to those
     
    :param user_id: user_id to whom movie needs to be recommended
    :return: movieIds to user 
    """
    
    try:
        user_movie= movies_ratings[(movies_ratings.userId==user_id) & movies_ratings.rating.isin(range(4,5))][['title']]
        user_movie=user_movie.iloc[0,0]

        item_similarity(user_movie)

        user_movies=movies.sort_values( ["similarity"], ascending = False )
        user_movies=user_movies[user_movies['similarity'] >=0.45]['movie_id']
        recommended_movies=list()
        df_recommended_item=pd.DataFrame()
        user2Movies= ratings[ratings['userId']== user_id]['movieId']
        for movieId in user_movies:
                if movieId not in user2Movies:
                    d=[]

                    df_new= ratings[(ratings.movieId==movieId)]
                    df_recommended_item=pd.concat([df_recommended_item,df_new])
                    best10=df_recommended_item.sort_values(["rating"], ascending = False )[1:25] 
                    d= set(best10['movieId'].tolist())


        return d
    
    
    #recommend based on disliked movies if the selected user does not have a 4 or above rated movie
    except:
        print("this user does not have a movie she has rated above 4 to start the recommendation. Trying to find dissimilar movies...")
        user_movie= movies_ratings[(movies_ratings.userId==user_id) & movies_ratings.rating.isin(range(0,3))][['title']]
        user_movie=user_movie.iloc[0,0]

        item_similarity(user_movie)

        user_movies=movies.sort_values( ["similarity"], ascending = True )
        user_movies=user_movies[user_movies['similarity'] < 0.1]['movieId']
        recommended_movies=list()
        df_recommended_item=pd.DataFrame()
        user2Movies= ratings[ratings['userId']== user_id]['movieId']
        for movieId in user_movies:
                if movieId not in user2Movies:
                    d=[]

                    df_new= ratings[(ratings.movieId==movieId)]
                    df_recommended_item=pd.concat([df_recommended_item,df_new])
                    best10=df_recommended_item.sort_values(["rating"], ascending = False )[1:10] 
                    d= set(best10['movieId'].tolist())


        return d

In [None]:
recommendedMoviesAsperItemSimilarity(54)

In [None]:
def movieIdToTitle(listMovieIDs):
    """
    Converts movie titles into a string
    """
    movie_titles= list()
    for id in listMovieIDs:
        movie_titles.append(movies[movies['movieId']==id]['title'])
    return movie_titles

In [None]:
movieIdToTitle(recommendedMoviesAsperItemSimilarity(54))

## Hybrid Model

Very often, in the recommendation world, data engineers are faced with the cold start problem. The cold start problem occurs when an item has very little interaction with other items thus the model cannot make accurate predictions. Here we aim to address the cold start problem in the case of a new user. In the hybrid model, if you are a new user the function will ask you to rate a few movies within a genre then recommend movies based off of your ratings of those movies.

In [None]:
#implementing multiple recommendation systems together
def hybrid_model(user_id= None, new_user=False):
    
    """user_id: The id of the user that you would like to make recommendations for
    
       new_user: if you are a new user input True for this parameter.
                 Our function will then ask you for a genre and give you movies from that genre for you to rate.
                 This addresses the cold start problem
    """
    
    if new_user:
        genre= input('name a genre of movies that you like: ')
        user_rating = movie_rater(df_joined, 4, 'Action')
        
        new_ratings = df_new.append(user_rating,ignore_index=True)
        new_data = Dataset.load_from_df(new_ratings,reader)
        
        svd_ = SVD(n_factors=200, n_epochs= 40, reg_all= 0.1)
        svd_.fit(new_data.build_full_trainset())
        
        list_of_movies = []
        
        for m_id in df_new['movieId'].unique():
            list_of_movies.append( (m_id,svd_.predict(100,m_id)[3]))
        
        ranked_movies = sorted(list_of_movies, key=lambda x:x[1], reverse=True)
        
        print("\n\n")
        recommended_movies(ranked_movies, movies, 5)
        
        
    else:
        recommended_movies_by_content_model = get_recommendation_content_model(user_id)
        recommended_movies_by_content_model = movies[movies.apply(lambda movie: movie["title"] in recommended_movies_by_content_model, axis=1)]
    
    
        for key, col in recommended_movies_by_content_model.iterrows():
            predict = svd.predict(user_id, col["movieId"])
            recommended_movies_by_content_model.loc[key, "svd_rating"] = predict.est
    
    
        return recommended_movies_by_content_model.sort_values("svd_rating", ascending=False).iloc[0:11]

In [None]:
hybrid_model(33)

## User-Item filtering

The assumption of the collaborative filtering approach is that if two users have similar opinions on one piece of content, then that person is more likely to share the other person's opinion on an unseen piece of content. In a similar implementation as ItemItem similarity, we created a matrix but kept users as rows and movieId as columns. Then, we used the resulting vectors to calculate similarity by pairwise distance.

In [None]:
movies.head()

In [None]:
ratings_matrix_users = movies_ratings.pivot_table(index=['userId'],columns=['movieId'],values='rating').reset_index(drop=True)
ratings_matrix_users.fillna( 0, inplace = True )
movie_similarity = pairwise_distances( ratings_matrix_users.values, metric="cosine" )
np.fill_diagonal( movie_similarity, 0 ) 
ratings_matrix_users = pd.DataFrame( movie_similarity )
ratings_matrix_users.head()

Now that we have similarity of users in columns and respective users in each row, we look for the max value in a column to find the most smiilar user.

In [None]:
#returns the most similar user 
similar_user_series= ratings_matrix_users.idxmax(axis=1)
df_similar_user= similar_user_series.to_frame()
df_similar_user.columns=['similarUser']
df_similar_user.head()

In [None]:
movies.head()

In [None]:
df_similar_user.iloc[0]

Similar to our previous models, this model takes in the userId of the user that we are reccomending movies to and on the basis of most similar users, we filter unseen movies that are highly rated by that "lookalike" and reccomend movies to the user.

In [None]:
movieId_recommended=list()
def getRecommendedMoviesAsperUserSimilarity(userId):
    """
     Recommending movies which user hasn't watched as per User Similarity
    :param user_id: user_id to whom movie needs to be recommended
    :return: movieIds to user 
    """
    #all the movies the user has rated
    user2Movies= ratings[ratings['userId']== userId]['movieId']
    #locks the most similar user based on rated movies
    sim_user=df_similar_user.iloc[userId,0]
    df_recommended=pd.DataFrame(columns=['movieId','title','genres','userId','rating','timestamp'])
   
    #removes already watched movies
    for movieId in ratings[ratings['userId']== sim_user]['movieId']:
        if movieId not in user2Movies:
            df_new= movies_ratings[(movies_ratings.userId==sim_user) & (movies_ratings.movieId==movieId)]
            df_recommended=pd.concat([df_recommended,df_new])
    best10=df_recommended.sort_values(['rating'], ascending = False )[1:11]  
    return best10['movieId']

In [None]:
getRecommendedMoviesAsperUserSimilarity(234)

In [None]:
user_id=243
recommend_movies= movieIdToTitle(getRecommendedMoviesAsperUserSimilarity(user_id))
print("Movies you should watch are:\n")
print(recommend_movies)

### Evaluation of Model

Challenges with User similarity

The challenge with calculating user similarity is that the user needs to have some prior purchases and have provided prior ratings. However,
this recommendation technique does not work for new users.
The system needs to wait until the user makes some purchases and rates them. i.e the cold start problem.

In [None]:
def get_user_similar_movies( user1, user2 ):  
    """
     Returning common movies and ratings of same for both the users
    :param user1,user2: user ids of 2 users need to compare
    :return: movieIds to user 
    """
    common_movies = movies_ratings[movies_ratings.userId == user1].merge(
      movies_ratings[movies_ratings.userId == user2],
      on = "movieId",
      how = "inner" )
    common_movies.drop(['movieId','genres_x','genres_y', 'timestamp_x','timestamp_y','title_y'],axis=1,inplace=True)
    return common_movies

In [None]:
get_user_similar_movies(587,511)

## SVD ++

In [None]:
data = Dataset.load_builtin('ml-100k')
trainset, testset = train_test_split(data, test_size=.2)

In [None]:
algo_svdpp = SVDpp(n_factors=150, n_epochs=8, lr_all=0.005, reg_all=0.1)
algo_svdpp.fit(trainset)
test_pred = algo_svdpp.test(testset)
print("SVDpp : Test Set")
accuracy.rmse(test_pred, verbose=True)

In [None]:
df_2 = df_joined.groupby('title')["userId","rating"].count()[lambda x: x> 1].sort_values(by = "rating", ascending = False)

In [None]:
df_2 = df_joined.groupby(['title','rating']).count()["userId"][lambda x: x> 1].sort_values(ascending = False)

In [None]:
df_2

In [None]:
df_3=df_joined.drop('timestamp', axis=1)

In [None]:
df_3= df_3.groupby('movieId').agg([np.size, np.mean])

In [None]:
a= df_3['userId']['size'] ==1

In [None]:
a.value_counts()

## Evaluation

Evaluating a recommendation system can be very tricky because there are so many different metrics to look at. For instance, academic metrics for evaluating models, to name a few, are typically RMSE, MAE, recall and accuracy. However, a high error for a model does not necessarily correspond directly to a poor recommendation system. For the purpose of a evaluating a recommendation system, we would recommend to evaluate also on "online" metrics such as A/B testing various versions of the recommendation system.

In [None]:
print('Knn')
for i in cv_knn_baseline.items():
    print(i)

In [None]:
print('svd')
np.mean(cv_knn_baseline['test_rmse'])
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

As you can see the RMSE is around .87 and our MAE is around .67 for our KNN and SVD models which is pretty decent. We chose to look closely at mean squared error since it tells us how far off our predicted score is versus the actual scores that the user chose. This score doesn't reflect the actual performance of our recommendation system due to the fact that we are using collaborative filtering. We would like to refine our model and reduce the error because our hybrid model utilizes collaborative filtering and modeling.

## Coverage

Another metric we decided to look at is coverage: what percentage of movies does our system recommend from the total pool of movies available.

In [None]:
movie_names= list(movies.title.unique())

sugg=[]
for movie in movie_names:
    sugg.append(get_recommendations_based_on_genres(movie))
    
flat_list = [item for sublist in sugg for item in sublist]
flat_list = set(flat_list)
len(flat_list)/len(movie_names)

In [None]:
sugg=[]
for user in range(610):
    sugg.append(getRecommendedMoviesAsperUserSimilarity(user))
    
flat_list = [item for sublist in sugg for item in sublist]
flat_list = set(flat_list)
len(flat_list)/len(movie_names)

>Our user to user model covers 4.5% and our item to item model covers 32%. Due to the skewedness of the genres and how our item to item model uses genre to make recommendations, many genres do not get selected. We will aim to improve upon this by allowing users more options in selecting movies they want and adjusting the intertface to allow the user to further filter the output. For example, recommending movies that are more recent or during a specific timeframe. Another example would be allowing the user to select a minimum number of reviews before a movie is recommended. We recognize that additional implementations will need to be made to improve upon the coverage metric.

## Conclusions

These algorithms provided a baseline knowledge of how a sophisticated combined filter can be accomplished. The next steps for this project would be to combine the models together in a web implementation that has a more sophisticated user interface.