# Movies Recommendation System

In [40]:
from math import sqrt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics.pairwise import pairwise_distances
from scipy.sparse.linalg import svds 

In [41]:
movies_df=pd.read_csv("./src/datasets/movies2.csv") #https://grouplens.org/datasets/movielens/

In [42]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [43]:
ratings_df=pd.read_csv("./src/datasets/ratings2.csv")

In [44]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## Content based Filtering:

The Content-Based Recommender relies on the similarity of the items being recommended. The basic idea is that if you like an item, then you will also like a “similar” item. It generally works well when it’s easy to determine the context/properties of each item.

A content based recommender works with data that the user provides, either explicitly movie ratings for the MovieLens dataset. Based on that data, a user profile is generated, which is then used to make suggestions to the user. As the user provides more inputs or takes actions on the recommendations, the engine becomes more and more accurate.

In [45]:
#TfidfVectorizer function transforms text to feature vectors that can be used as input to estimator

# Define a TF-IDF Vectorizer Object.
tfidf_movies_genres = TfidfVectorizer(token_pattern = '[a-zA-Z0-9\-]+') 

#Replace NaN with an empty string
movies_df['genres'] = movies_df['genres'].replace(to_replace="(no genres listed)", value="")

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_movies_genres_matrix = tfidf_movies_genres.fit_transform(movies_df['genres'])
cosine_sim_movies = linear_kernel(tfidf_movies_genres_matrix, tfidf_movies_genres_matrix)#Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies.
print(cosine_sim_movies)

[[1.         0.81357774 0.15276924 ... 0.         0.4210373  0.26758648]
 [0.81357774 1.         0.         ... 0.         0.         0.        ]
 [0.15276924 0.         1.         ... 0.         0.         0.57091541]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.4210373  0.         0.         ... 0.         1.         0.        ]
 [0.26758648 0.         0.57091541 ... 0.         0.         1.        ]]


**Content-Based Recommendation Engine that computes similarity between movies based on movie genres. It will suggest movies that are most similar to a particular movie based on its genre:**

 - No need for data on other users, thus no cold-start or sparsity problems
 - Can recommend to users with unique tastes
 - Can recommend new & unpopular items
 - Can provide explanations for recommended items by listing content-features that caused an item to be recommended (in this case, movie genres)
 - Does not recommend items outside a user’s content profile

In [46]:
def get_similar_movies(movie_title, cosine_sim_movies=cosine_sim_movies):
    # Get the index of the movie that matches the title
    idx_movie = movies_df.loc[movies_df['title'].isin([movie_title])]
    idx_movie = idx_movie.index
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores_movies = list(enumerate(cosine_sim_movies[idx_movie][0]))
    # Sort the movies based on the similarity scores
    sim_scores_movies = sorted(sim_scores_movies, key=lambda x: x[1], reverse=True)
    # Get the scores of the 20 most similar movies
    sim_scores_movies = sim_scores_movies[1:21]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores_movies]
    # Return the top 2 most similar movies
    return movies_df['title'].iloc[movie_indices]

In [47]:
get_similar_movies("Toy Story (1995)")

1706                                          Antz (1998)
2355                                   Toy Story 2 (1999)
2809       Adventures of Rocky and Bullwinkle, The (2000)
3000                     Emperor's New Groove, The (2000)
3568                                Monsters, Inc. (2001)
6194                                     Wild, The (2006)
6486                               Shrek the Third (2007)
6948                       Tale of Despereaux, The (2008)
7760    Asterix and the Vikings (Astérix et les Viking...
8219                                         Turbo (2013)
8927                             The Good Dinosaur (2015)
9430                                         Moana (2016)
8900                                    Inside Out (2015)
1505                           Black Cauldron, The (1985)
1577                        Lord of the Rings, The (1978)
2539                We're Back! A Dinosaur's Story (1993)
3230                     Atlantis: The Lost Empire (2001)
3336          

**Find top movies to be recommended to user based on movie that the user has watched:**

In [48]:
def get_user_similarmovie(userId):
    recommended_movie_list = []
    movie_list = []
    df_rating_filtered = ratings_df[ratings_df["userId"]== userId]
    for key, row in df_rating_filtered.iterrows():
        movie_list.append((movies_df["title"][row["movieId"]==movies_df["movieId"]]).values) 
    for index, movie in enumerate(movie_list):
        for key, movie_recommended in get_similar_movies(movie[0]).iteritems():
            recommended_movie_list.append(movie_recommended)

    # removing already watched movie from recommended list    
    for movie_title in recommended_movie_list:
        if movie_title in movie_list:
            recommended_movie_list.remove(movie_title)
    
    return set(recommended_movie_list)

In [49]:
get_user_similarmovie(2)

{'39 Steps, The (1935)',
 '8MM (1999)',
 'A Man Called Blade (1977)',
 'Above the Rim (1994)',
 'Abyss, The (1989)',
 'Ace Ventura: When Nature Calls (1995)',
 'Air America (1990)',
 'Alexander Nevsky (Aleksandr Nevskiy) (1938)',
 'All Quiet on the Western Front (1930)',
 'Alone in the Dark II (2008)',
 'Amateur (1994)',
 'American Buffalo (1996)',
 'American Outlaws (2001)',
 'Angels and Insects (1995)',
 'Animals are Beautiful People (1974)',
 'Anne Frank Remembered (1995)',
 'Apocalypse Now (1979)',
 'Aristocrats, The (2005)',
 'Assassins (1995)',
 'Avatar (2009)',
 'Avengers, The (2012)',
 'Awfully Big Adventure, An (1995)',
 'Babysitter, The (1995)',
 'Bait (2000)',
 'Basketball Diaries, The (1995)',
 'Batman (1989)',
 'Batman Begins (2005)',
 'Battleship (2012)',
 'Beauty of the Day (Belle de jour) (1967)',
 'Bed of Roses (1996)',
 'Before Sunrise (1995)',
 'Behind Enemy Lines (2001)',
 'Ben-Hur (1959)',
 'Beverly Hills Ninja (1997)',
 'Bill Cosby, Himself (1983)',
 'Billy Madiso

## Collaborative Filtering

The Collaborative Filtering Recommender is entirely based on the past behavior and not on the context. More specifically, it is based on the similarity in preferences, tastes and choices of two users. It analyses how similar the tastes of one user is to another and makes recommendations on the basis of that.

The algorithm has a very interesting property of being able to do feature learning on its own, which means that it can start to learn for itself what features to use.

### Item-Item Collaborative Filtering:

ITEM-ITEM collaborative filtering look for items that are similar to the articles that user has already rated and recommend most similar articles.

Instead of finding user’s look-alike, we try finding movie’s look-alike. Once we have movie’s look-alike matrix, we can easily recommend alike movies to user who have rated any movie from the dataset. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new user, the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between users. And with fixed number of movies, movie-movie look alike matrix is fixed over time

In [50]:
movies_ratings=pd.merge(movies_df, ratings_df)
movies_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


In [51]:
ratings_matrix_items = movies_ratings.pivot_table(index=['movieId'],columns=['userId'],values='rating').reset_index(drop=True)
ratings_matrix_items.fillna( 0, inplace = True )
ratings_matrix_items.shape

(9724, 610)

In [52]:
ratings_matrix_items.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
0,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
1,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


Here Pivot table function is used as we want one to one maping between movies, user and their rating. So by default pivot_table command takes average if we have multiple values of one combination.

In [53]:
movie_similarity = 1 - pairwise_distances(ratings_matrix_items.to_numpy(), metric="cosine" )
np.fill_diagonal( movie_similarity, 0 ) #Filling diagonals with 0s for future use when sorting is done
ratings_matrix_items = pd.DataFrame( movie_similarity )
ratings_matrix_items.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9714,9715,9716,9717,9718,9719,9720,9721,9722,9723
0,0.0,0.410562,0.296917,0.035573,0.308762,0.376316,0.277491,0.131629,0.232586,0.395573,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.410562,0.0,0.282438,0.106415,0.287795,0.297009,0.228576,0.172498,0.044835,0.417693,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.296917,0.282438,0.0,0.092406,0.417802,0.284257,0.402831,0.313434,0.30484,0.242954,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.035573,0.106415,0.092406,0.0,0.188376,0.089685,0.275035,0.158022,0.0,0.095598,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.308762,0.287795,0.417802,0.188376,0.0,0.298969,0.474002,0.283523,0.335058,0.218061,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Find similar movies:**

In [54]:
def item_similarity(movieName): #Find similar movies
    try:
        user_inp=movieName
        inp=movies_df[movies_df['title']==user_inp].index.tolist()
        inp=inp[0]

        movies_df['similarity'] = ratings_matrix_items.iloc[inp]
        movies_df.columns = ['movie_id', 'title', 'release_date','similarity']
    except:
        print("Sorry, the movie is not in the database!")

In [55]:
def recommendedMoviesAsperItemSimilarity(user_id):#Recommending movie which user hasn't watched as per Item Similarity
    user_movie= movies_ratings[(movies_ratings.userId==user_id) & movies_ratings.rating.isin([5,4.5])][['title']]
    user_movie=user_movie.iloc[0,0]
    item_similarity(user_movie)
    sorted_movies_as_per_userChoice=movies_df.sort_values( ["similarity"], ascending = False )
    sorted_movies_as_per_userChoice=sorted_movies_as_per_userChoice[sorted_movies_as_per_userChoice['similarity'] >=0.45]['movie_id']
    recommended_movies=list()
    df_recommended_item=pd.DataFrame()
    user2Movies= ratings_df[ratings_df['userId']== user_id]['movieId']
    for movieId in sorted_movies_as_per_userChoice:
            if movieId not in user2Movies:
                df_new= ratings_df[(ratings_df.movieId==movieId)]
                df_recommended_item=pd.concat([df_recommended_item,df_new])
            top_movies=df_recommended_item.sort_values(["rating"], ascending = False)[1:21] 
    return top_movies['movieId']

In [56]:
def movieIdToTitle(listMovieIDs):
    movie_titles= list()
    for id in listMovieIDs:
        movie_titles.append(movies_df[movies_df['movie_id']==id]['title'])
    return movie_titles

In [57]:
print("Recommended movies,:\n",movieIdToTitle(recommendedMoviesAsperItemSimilarity(50)))

Recommended movies,:
 [659    Godfather, The (1972)
Name: title, dtype: object, 922    Godfather: Part II, The (1974)
Name: title, dtype: object, 510    Silence of the Lambs, The (1991)
Name: title, dtype: object, 922    Godfather: Part II, The (1974)
Name: title, dtype: object, 922    Godfather: Part II, The (1974)
Name: title, dtype: object, 659    Godfather, The (1972)
Name: title, dtype: object, 659    Godfather, The (1972)
Name: title, dtype: object, 659    Godfather, The (1972)
Name: title, dtype: object, 1644    Willow (1988)
Name: title, dtype: object, 922    Godfather: Part II, The (1974)
Name: title, dtype: object, 922    Godfather: Part II, The (1974)
Name: title, dtype: object, 659    Godfather, The (1972)
Name: title, dtype: object, 659    Godfather, The (1972)
Name: title, dtype: object, 659    Godfather, The (1972)
Name: title, dtype: object, 659    Godfather, The (1972)
Name: title, dtype: object, 510    Silence of the Lambs, The (1991)
Name: title, dtype: object, 922  

### User-Item Filtering:

Here we find look alike users based on similarity and recommend movies which first user’s look-alike has chosen in past.

In [58]:
ratings_matrix_users = movies_ratings.pivot_table(index=['userId'],columns=['movieId'],values='rating').reset_index(drop=True)
ratings_matrix_users.fillna( 0, inplace = True )
movie_similarity = 1 - pairwise_distances( ratings_matrix_users.to_numpy(), metric="cosine" )
np.fill_diagonal( movie_similarity, 0 ) #Filling diagonals with 0s for future use when sorting is done
ratings_matrix_users = pd.DataFrame( movie_similarity )
ratings_matrix_users.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,600,601,602,603,604,605,606,607,608,609
0,0.0,0.027283,0.05972,0.194395,0.12908,0.128152,0.158744,0.136968,0.064263,0.016875,...,0.080554,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
1,0.027283,0.0,0.0,0.003726,0.016614,0.025333,0.027585,0.027257,0.0,0.067445,...,0.202671,0.016866,0.011997,0.0,0.0,0.028429,0.012948,0.046211,0.027565,0.102427
2,0.05972,0.0,0.0,0.002251,0.00502,0.003936,0.0,0.004941,0.0,0.0,...,0.005048,0.004892,0.024992,0.0,0.010694,0.012993,0.019247,0.021128,0.0,0.032119
3,0.194395,0.003726,0.002251,0.0,0.128659,0.088491,0.11512,0.062969,0.011361,0.031163,...,0.085938,0.128273,0.307973,0.052985,0.084584,0.200395,0.131746,0.149858,0.032198,0.107683
4,0.12908,0.016614,0.00502,0.128659,0.0,0.300349,0.108342,0.429075,0.0,0.030611,...,0.068048,0.418747,0.110148,0.258773,0.148758,0.106435,0.152866,0.135535,0.261232,0.060792


In [59]:
ratings_matrix_users.idxmax(axis=1)

0      265
1      365
2      312
3      390
4      469
      ... 
605    473
606    569
607    479
608    339
609    248
Length: 610, dtype: int64

In [60]:
ratings_matrix_users.idxmax(axis=1).sample( 10, random_state = 10 )

547     76
241    467
277    337
348    454
218    238
407    278
352     45
97     600
381     20
607    479
dtype: int64

In [61]:
similar_user_series= ratings_matrix_users.idxmax(axis=1)
df_similar_user= similar_user_series.to_frame()
df_similar_user.head

<bound method NDFrame.head of        0
0    265
1    365
2    312
3    390
4    469
..   ...
605  473
606  569
607  479
608  339
609  248

[610 rows x 1 columns]>

In [62]:
movieId_recommended=list()
def getRecommendedMoviesAsperUserSimilarity(userId):#Recommending movies which user hasn't watched as per User Similarity
    user2Movies= ratings_df[ratings_df['userId']== userId]['movieId']
    sim_user=df_similar_user.iloc[0,0]
    df_recommended=pd.DataFrame(columns=['movieId','title','genres','userId','rating','timestamp'])
    for movieId in ratings_df[ratings_df['userId']== sim_user]['movieId']:
        if movieId not in user2Movies:
            df_new= movies_ratings[(movies_ratings.userId==sim_user) & (movies_ratings.movieId==movieId)]
            df_recommended=pd.concat([df_recommended,df_new])
        top_movies=df_recommended.sort_values(['rating'], ascending = False )[1:21]  
    return top_movies['movieId']

In [63]:
recommend_movies= movieIdToTitle(getRecommendedMoviesAsperUserSimilarity(50))
print("Movies you should watch are:\n")
print(recommend_movies)

Movies you should watch are:

[1431    Rocky (1976)
Name: title, dtype: object, 742    African Queen, The (1951)
Name: title, dtype: object, 733    It's a Wonderful Life (1946)
Name: title, dtype: object, 939    Terminator, The (1984)
Name: title, dtype: object, 969    Back to the Future (1985)
Name: title, dtype: object, 510    Silence of the Lambs, The (1991)
Name: title, dtype: object, 1057    Star Trek II: The Wrath of Khan (1982)
Name: title, dtype: object, 1059    Star Trek IV: The Voyage Home (1986)
Name: title, dtype: object, 1939    Matrix, The (1999)
Name: title, dtype: object, 275    Stargate (1994)
Name: title, dtype: object, 898    Star Wars: Episode V - The Empire Strikes Back...
Name: title, dtype: object, 224    Star Wars: Episode IV - A New Hope (1977)
Name: title, dtype: object, 2836    X-Men (2000)
Name: title, dtype: object, 1422    On the Waterfront (1954)
Name: title, dtype: object, 958    Stand by Me (1986)
Name: title, dtype: object, 1832    Civil Action, A (199

# Evaluating the model

In [27]:
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)
from surprise import SVDpp
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV
from surprise.model_selection import cross_validate
import os
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
from surprise import accuracy
from surprise.model_selection import KFold

In [27]:
df_movies_ratings=pd.merge(movies_df, ratings_df)
df_movies_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


In [28]:
def get_user_similar_movies( user1, user2 ):#Returning common movies and ratings of same for both the users
    common_movies = movies_ratings[movies_ratings.userId == user1].merge(
      movies_ratings[movies_ratings.userId == user2],
      on = "movieId",
      how = "inner" )
    common_movies.drop(['movieId','genres_x','genres_y', 'timestamp_x','timestamp_y','title_y'],axis=1,inplace=True)
    return common_movies

**Try with users 587 and 511:**

In [29]:
get_user_similar_movies(587,511)

Unnamed: 0,title_x,userId_x,rating_x,userId_y,rating_y
0,Forrest Gump (1994),587,4.0,511,4.5
1,Life Is Beautiful (La Vita è bella) (1997),587,5.0,511,4.5
2,"Matrix, The (1999)",587,4.0,511,5.0


# Singular Value Decomposition(SVD):

### Matrix Factorization:

 Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF  

In [31]:
number_users = ratings_df.userId.unique().shape[0]
nuumber_movies = ratings_df.movieId.unique().shape[0]
print('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies))

Number of users = 7120 | Number of movies = 14026


**Ratings matrix to be one row per user and one column per movie:**

In [32]:
ratings_pivot = ratings_df.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(0)
ratings_pivot.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,129350,129354,129428,129707,130052,130073,130219,130462,130490,130642
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**De-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array:**

In [33]:
R = ratings_pivot.to_numpy()
#print(R)
user_ratings_mean = np.mean(R, axis = 1)
#print(user_ratings_mean.shape)
print(user_ratings_mean.size)
ratings_demeaned = R - user_ratings_mean.reshape(-1, 1) # Making the user_ratings_mean vertical by reshaping

7120


### Model-Based Collaborative Filtering

In [34]:
#Check sparsity of dataset ratings
sparsity = round(1.0 - len(ratings_df) / float(n_users * n_movies), 3)
print('The sparsity level of MovieLens100K dataset is ' +  str(sparsity * 100) + '%')

The sparsity level of MovieLens100K dataset is 99.0%


### SVD Model:

In [35]:
U, sigma, Vt = svds(Ratings_demeaned, k = 50) #let’s me choose how many latent factors I want to use to approximate the original ratings matrix
print('Size of sigma: ' , sigma.size)

Size of sigma:  50


In [36]:
sigma = np.diag(sigma) #As I’m going to leverage matrix multiplication to get predictions, I’ll convert the Sum (now are values) to the diagonal matrix form.
print('Shape of sigma: ', sigma.shape)
print(sigma)

Shape of sigma:  (50, 50)
[[ 145.10150012    0.            0.         ...    0.
     0.            0.        ]
 [   0.          146.9116344     0.         ...    0.
     0.            0.        ]
 [   0.            0.          147.60927986 ...    0.
     0.            0.        ]
 ...
 [   0.            0.            0.         ...  629.60367499
     0.            0.        ]
 [   0.            0.            0.         ...    0.
   672.51929702    0.        ]
 [   0.            0.            0.         ...    0.
     0.         1599.39618943]]


In [37]:
print('Shape of U: ', U.shape)
print('Shape of Vt: ', Vt.shape)

Shape of U:  (7120, 50)
Shape of Vt:  (50, 14026)


### Making Predictions from the Decomposed Matrices

*Make movie ratings predictions for every user:*

In [38]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1) #add the user means back to get the actual star ratings prediction.
print('All user predicted rating : ', all_user_predicted_ratings.shape)

All user predicted rating :  (7120, 14026)


In [39]:
print('Rating Dataframe column names', Ratings.columns) 

Rating Dataframe column names Int64Index([     1,      2,      3,      4,      5,      6,      7,      8,
                 9,     10,
            ...
            129350, 129354, 129428, 129707, 130052, 130073, 130219, 130462,
            130490, 130642],
           dtype='int64', name='movieId', length=14026)


With the predictions matrix for every user, we can build a function to recommend movies for any user. We return the list of movies the user has already rated, for the sake of comparison.

In [40]:
preds = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns)
preds.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,129350,129354,129428,129707,130052,130073,130219,130462,130490,130642
0,0.540995,0.843632,0.175701,-0.01176,-0.181734,0.135975,-0.459275,-0.023876,-0.207347,-0.016,...,0.006033,-0.008744,-0.00096,-0.002264,-0.003274,0.001616,-0.008724,-0.010165,0.001714,-0.004818
1,1.126602,0.074863,0.333645,0.084757,0.208162,0.154196,0.434022,0.002525,0.108228,-0.105898,...,-0.003558,-0.00701,-0.003883,-0.001358,-0.00139,0.003223,-0.004999,0.001781,0.007028,-0.000433
2,1.937884,0.908821,-0.061293,-0.022979,-0.041665,0.473491,0.02552,0.010201,-0.037875,0.242482,...,0.00294,0.011799,0.006577,0.004205,0.002523,0.025215,-0.00275,0.00652,0.006823,-0.000506
3,-0.522958,0.675686,0.390795,0.005326,0.252884,0.825424,0.063328,0.083529,0.206329,0.968592,...,0.002022,0.001737,0.002378,0.002758,0.001924,0.000865,0.001771,0.001563,0.004668,0.000457
4,2.145759,1.071243,1.257973,0.103505,1.185075,0.597056,1.349056,0.173922,0.300962,1.364036,...,-0.001485,0.010323,-0.009227,0.00154,0.001488,0.001313,0.000757,0.001082,0.006296,0.002579


**Find movies with the highest predicted rating that the specified user hasn’t already rated:**

In [41]:
def recommend_movies(predictions, userID, movies, original_ratings, num_recommendations):
    # Get and sort the user's predictions
    user_row_number = userID - 1 # User ID starts at 1, not 0
    sorted_user_predictions = predictions.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1
    # Get the user's data and merge in the movie information.
    user_data = original_ratings[original_ratings.userId == (userID)]
    user_full = (user_data.merge(movies, how = 'left', left_on = 'movieId', right_on = 'movieId').
                     sort_values(['rating'], ascending=False))
    print('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies[~movies['movieId'].isin(user_full['movieId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movieId',
               right_on = 'movieId').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1])

    return user_full, recommendations #return the movies with the highest predicted rating that the specified user hasn’t already rated

In [42]:
already_rated, predictions = recommend_movies(preds, 150, movies_df, ratings_df, 20)

User 150 has already rated 26 movies.
Recommending highest 20 predicted ratings movies not already rated.


In [43]:
# Top 20 movies that User 150 has rated 
already_rated.head(20)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
25,150,66934,5.0,1293713611,Dr. Horrible's Sing-Along Blog (2008),Comedy|Drama|Musical|Sci-Fi
24,150,7153,4.0,1293713549,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy
23,150,5952,4.0,1293713551,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy
22,150,4993,4.0,1293713546,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy
6,150,1033,4.0,1293713406,"Fox and the Hound, The (1981)",Animation|Children|Drama
7,150,1298,4.0,1293713323,Pink Floyd: The Wall (1982),Drama|Musical
9,150,1586,4.0,1293713300,G.I. Jane (1997),Action|Drama
20,150,2926,4.0,1293713446,Hairspray (1988),Comedy|Drama
12,150,2114,4.0,1293713450,"Outsiders, The (1983)",Drama
18,150,2502,4.0,1293713701,Office Space (1999),Comedy|Crime


In [44]:
# Top 20 movies that User 15 hopefully will enjoy
predictions

Unnamed: 0,movieId,title,genres
2467,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
2852,2959,Fight Club (1999),Action|Crime|Drama|Thriller
1165,1197,"Princess Bride, The (1987)",Action|Adventure|Comedy|Fantasy|Romance
6405,6539,Pirates of the Caribbean: The Curse of the Bla...,Action|Adventure|Comedy|Fantasy
2657,2762,"Sixth Sense, The (1999)",Drama|Horror|Mystery
1106,1136,Monty Python and the Holy Grail (1975),Adventure|Comedy|Fantasy
12500,58559,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX
107,110,Braveheart (1995),Action|Drama|War
4190,4306,Shrek (2001),Adventure|Animation|Children|Comedy|Fantasy|Ro...
3466,3578,Gladiator (2000),Action|Adventure|Drama


It’s good to see that, although I didn’t actually use the genre of the movie as a feature, the truncated matrix factorization features “picked up” on the underlying tastes and preferences of the user

## Model Evaluation

Surprise library provides various ready-to-use powerful prediction algorithms including (SVD) to evaluate its RMSE (Root Mean Squared Error) on the MovieLens dataset. It is a Python Scikit-Learn’s building and analyzing recommender systems.

In [28]:
#https://surprise.readthedocs.io/en/stable/getting_started.html#use-cross-validation-iterators
# Load Reader library
reader = Reader()
# Load ratings dataset with Dataset library
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)
# Split the dataset for 5-fold evaluation
kf=KFold(n_splits=5)
#Use the SVD algorithm.
algo = SVD()
for trainset, testset in kf.split(data):
    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)
    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 0.8773
RMSE: 0.8713
RMSE: 0.8842
RMSE: 0.8676
RMSE: 0.8699


In [29]:
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8703  0.8821  0.8699  0.8750  0.8671  0.8729  0.0053  
MAE (testset)     0.6704  0.6764  0.6659  0.6728  0.6662  0.6703  0.0040  
Fit time          7.07    8.07    9.80    8.15    7.94    8.21    0.89    
Test time         0.26    0.25    0.40    0.28    0.31    0.30    0.05    


{'test_rmse': array([0.87028626, 0.88211373, 0.86988925, 0.87503327, 0.86710013]),
 'test_mae': array([0.67036793, 0.67642521, 0.66589648, 0.67276235, 0.66621527]),
 'fit_time': (7.074045896530151,
  8.073004245758057,
  9.802165985107422,
  8.154356002807617,
  7.938354253768921),
 'test_time': (0.262376070022583,
  0.25001001358032227,
  0.40036511421203613,
  0.2817268371582031,
  0.3109269142150879)}

**Train dataset and predictions:**

In [30]:
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x12b97ba90>

In [33]:
ratings_df[ratings_df['userId'] == 25]

Unnamed: 0,userId,movieId,rating,timestamp
4012,25,231,4.0,1535470451
4013,25,260,5.0,1535470429
4014,25,527,5.0,1535470432
4015,25,1198,5.0,1535470495
4016,25,2028,5.0,1535470505
4017,25,2571,5.0,1535470427
4018,25,3578,5.0,1535470497
4019,25,4993,5.0,1535470421
4020,25,5952,5.0,1535470419
4021,25,7153,5.0,1535470418


In [34]:
algo.predict(25, 2000)

Prediction(uid=150, iid=1994, r_ui=None, est=3.6096067667372678, details={'was_impossible': False})

In [35]:
files_dir = os.path.expanduser('./ml-100k')
reader = Reader('ml-100k')

In [36]:
# Use movielens-100K
data = Dataset.load_builtin('ml-100k')
trainset, testset = train_test_split(data, test_size=.15)

type(data)

surprise.dataset.DatasetAutoFolds

# SVD++

In [37]:
algo_svdpp = SVDpp(n_factors=160, n_epochs=10, lr_all=0.005, reg_all=0.1)
algo_svdpp.fit(trainset)
test_pred = algo_svdpp.test(testset)
print("SVDpp : Test Set")
accuracy.rmse(test_pred, verbose=True)

SVDpp : Test Set
RMSE: 0.9375


0.9374722725341056

# Evaluating Collaborative Filtering

In [38]:
def evaluation_collaborative_svd_model(userId,userOrItem):#hydrid the functionality of Collaborative based and svd based model to see if ratings of predicted movies 
    movieIdsList= list()
    movieRatingList=list()
    movieIdRating= pd.DataFrame(columns=['movieId','rating'])
    if userOrItem== True:
        movieIdsList=getRecommendedMoviesAsperUserSimilarity(userId)
    else:
        movieIdsList=recommendedMoviesAsperItemSimilarity(user_id)
    for movieId in movieIdsList:
        predict = algo.predict(userId, movieId)
        movieRatingList.append([movieId,predict.est])
        movieIdRating = pd.DataFrame(np.array(movieRatingList), columns=['movieId','rating'])
        count=movieIdRating[(movieIdRating['rating'])>=3]['movieId'].count()
        total=movieIdRating.shape[0]
        hit_ratio= count/total
    return hit_ratio

Hit Ratio It is ratio of number of hits/ Total recommendation

In [64]:
print("Hit ratio of User-user collaborative filtering")
print(evaluation_collaborative_svd_model(2,True))
print("Hit ratio of Item-Item collaborative filtering")
print(evaluation_collaborative_svd_model(2,False))

Hit ratio of User-user collaborative filtering
1.0
Hit ratio of Item-Item collaborative filtering
1.0


# Hybrid model:

Content Based Filtering + SVD:
- Run Content based filtering and determine the movies which we want to recommend to the user.
- Filter and sort the recommendations of CF using SVD predicted ratings.

In [65]:
df_movies=pd.read_csv("./src/datasets/movies2.csv")

In [66]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [67]:
df_ratings=pd.read_csv("./src/datasets/ratings2.csv")

In [68]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


**Predict the rating that a user would give to a movie that he has not yet rated:**

In [69]:
def hybrid_content_svd_model(userId): ##hydrid the functionality of content based and svd based model to recommend user top 10 movies. 
    recommended_movie_list = []
    movie_list = []
    df_rating_filtered = df_ratings[df_ratings["userId"]== userId]
    for key, row in df_rating_filtered.iterrows():
        movie_list.append((df_movies["title"][row["movieId"]==df_movies["movieId"]]).values) 
    for index, movie in enumerate(movie_list):
        for key, movie_recommended in get_similar_movies(movie[0]).iteritems():
            recommended_movie_list.append(movie_recommended)
    for movie_title in recommended_movie_list:
        if movie_title in movie_list:
            recommended_movie_list.remove(movie_title)
    recommended_movies_by_content=set(recommended_movie_list)
    recommended_movies_by_content_model = movies_df[movies_df.apply(lambda movie: movie["title"] in recommended_movies_by_content, axis=1)]
    for key, columns in recommended_movies_by_content_model.iterrows():
        predict = algo.predict(userId, columns["movie_id"])
        recommended_movies_by_content_model.loc[key, "svd_rating"] = predict.est
    return recommended_movies_by_content_model.sort_values("svd_rating", ascending=False).iloc[0:11]

In [70]:
hybrid_content_svd_model(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,movie_id,title,release_date,similarity,svd_rating
922,1221,"Godfather: Part II, The (1974)",Crime|Drama,0.525002,4.470523
914,1213,Goodfellas (1990),Crime|Drama,0.569947,4.387041
210,246,Hoop Dreams (1994),Documentary,0.148774,4.347391
659,858,"Godfather, The (1972)",Crime|Drama,0.488569,4.34379
909,1208,Apocalypse Now (1979),Action|Drama|War,0.339778,4.325483
613,778,Trainspotting (1996),Comedy|Crime|Drama,0.388216,4.311783
135,162,Crumb (1994),Documentary,0.252189,4.248894
2743,3681,For a Few Dollars More (Per qualche dollaro in...,Action|Drama|Thriller|Western,0.243336,4.240694
4764,7090,Hero (Ying xiong) (2002),Action|Adventure|Drama,0.189629,4.234739
27,28,Persuasion (1995),Drama|Romance,0.106808,4.203169


In [72]:
hybrid_content_svd_model(50)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,movie_id,title,release_date,similarity,svd_rating
257,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,0.438956,3.530294
922,1221,"Godfather: Part II, The (1974)",Crime|Drama,0.525002,3.470829
686,904,Rear Window (1954),Mystery|Thriller,0.481921,3.409146
914,1213,Goodfellas (1990),Crime|Drama,0.569947,3.369722
46,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,0.394586,3.361001
2462,3275,"Boondock Saints, The (2000)",Action|Crime|Drama|Thriller,0.082847,3.340848
950,1251,8 1/2 (8½) (1963),Drama|Fantasy,0.476977,3.334019
2996,4011,Snatch (2000),Comedy|Crime|Thriller,0.242567,3.32707
210,246,Hoop Dreams (1994),Documentary,0.148774,3.32165
910,1209,Once Upon a Time in the West (C'era una volta ...,Action|Drama|Western,0.477393,3.315937
