# A Movie Recommedation System

Before diving into the problem, let's import all import libraries.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from numpy.linalg import norm
import matplotlib.pyplot as plt
# % matplotlib inline

***Collaborative FIltering*** considers other users’ reactions while making recommendations. It does not need any addtional features of the items apart from user reactions to make recommendations.

Collaborative systems can be:

    1.Memory based: 
                    a)remembers the user-item interaction matrix and creates a item similarity or user similrity matrix depending on
                      which type of  filtering is used, item based or user based. 
                    b)The main difference between two filterings are the way recommedations are made. 
                    c)In user-based, if a user A’s characteristics are similar to some other user B then, the products that B liked are 
                      recommended to A and in 
                    d)item-based if user A likes an item x, then, the items yand z which are similar to x in property, 
                      then y and z are recommended to the user.
    2.Model based:  We compress the large interaction matrix using dimensional Reduction or using clustering algorithms and fit machine
                    learning models and try to predict ratings which a user will gove to an item. Remembering the matrix is not required here
                    The user interaction matrix is reduced to give movie features vector and a user feature vector.

***Content Based Filtering***system tries to guess the features or behavior of a user given the item’s features he/she has reacted to.It does not require other users' data to make recommendations to one user.


Content-based filtering is suitable for providing personalized recommendations that match user preferences and interests, while collaborative filtering can provide surprising and diverse recommendations that expose users to new or popular items.

# Movie ratings dataset

In [2]:
#Load data
ratings=pd.read_csv('ratings_small.csv')
meta=pd.read_csv('movies_metadata.csv')
links=pd.read_csv('links_small.csv')
credit=pd.read_csv('credits.csv')

  meta=pd.read_csv('movies_metadata.csv')




*   **ratings** holds rating values
*   **meta**    holds metadata of the movies
*   **links**   contains tmdbd and imdbd ids
*   **credit**  contains cast and crew info about movies



In [3]:
meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [4]:
credit.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


# Preprocessing and cleaning of the dataset

First of all we will convert ids value in string type to numerical value and for that we use pd.to_numeric function

PS: non numeric values in id will get nan value due to the attribute **errors='coerce'**

In [5]:
meta['id']=pd.to_numeric(meta['id'],errors='coerce')

Drop rows with non numeric id values in the id column

In [6]:
meta=meta.dropna(subset='id')       #eliminate garbage entries

Let's merge the loaded dataset into single dataframe

In [7]:
df=pd.merge(links,meta,left_on='tmdbId',right_on='id',how='inner')
df=pd.merge(df,ratings,on='movieId')
df=pd.merge(df,credit,on='id')
# df 

In [8]:
df = df.drop_duplicates(subset=['userId', 'movieId'], keep='last')
# drop_duplicate() function removes the duplicate entries
df
# df is the final datframe

Unnamed: 0,movieId,imdbId,tmdbId,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,...,tagline,title,video,vote_average,vote_count,userId,rating,timestamp,cast,crew
0,1,114709,862.0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862.0,tt0114709,...,,Toy Story,False,7.7,5415.0,7,3.0,851866703,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,1,114709,862.0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862.0,tt0114709,...,,Toy Story,False,7.7,5415.0,9,4.0,938629179,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
2,1,114709,862.0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862.0,tt0114709,...,,Toy Story,False,7.7,5415.0,13,5.0,1331380058,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
3,1,114709,862.0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862.0,tt0114709,...,,Toy Story,False,7.7,5415.0,15,2.0,997938310,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
4,1,114709,862.0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862.0,tt0114709,...,,Toy Story,False,7.7,5415.0,19,3.0,855190091,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99925,161918,4831420,390989.0,False,"{'id': 286023, 'name': 'Sharknado Collection',...",0,"[{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...",http://www.syfy.com/sharknado4,390989.0,tt4831420,...,"What happens in Vegas, stays in Vegas. Unless ...",Sharknado 4: The 4th Awakens,False,4.3,88.0,624,1.5,1472929873,"[{'cast_id': 0, 'character': 'Fin Shepard', 'c...","[{'credit_id': '56ffae0cc3a3686ea7001e00', 'de..."
99926,161944,255313,159550.0,False,,8000000,"[{'id': 18, 'name': 'Drama'}]",,159550.0,tt0255313,...,,The Last Brickmaker in America,False,7.0,1.0,287,5.0,1470167824,"[{'cast_id': 1, 'character': 'Henry Cobb', 'cr...","[{'credit_id': '544475aac3a36819fb000578', 'de..."
99927,162542,5165344,392572.0,False,,1000000,"[{'id': 53, 'name': 'Thriller'}, {'id': 10749,...",,392572.0,tt5165344,...,Decorated Officer. Devoted Family Man. Defendi...,Rustom,False,7.3,25.0,611,5.0,1471520667,"[{'cast_id': 0, 'character': 'Rustom Pavri', '...","[{'credit_id': '5951baf692514129c4016600', 'de..."
99928,162672,3859980,402672.0,False,,15050000,"[{'id': 12, 'name': 'Adventure'}, {'id': 18, '...",,402672.0,tt3859980,...,,Mohenjo Daro,False,6.7,26.0,611,3.0,1471523986,"[{'cast_id': 0, 'character': 'Sarman', 'credit...","[{'credit_id': '57cd5d3592514179d50018e8', 'de..."


Let's try to extract more information from the dataset like genres which are in the JSON string formats.

In [9]:
final=df.copy()  # copy() fxn creates a deep copy i.e changes in one won't get affected in other
final['genres']=df['genres'].str.replace("'",'"')

In [10]:
import json
final['genres'] =final['genres'].apply(json.loads)   #apply fxn is used to apply our function to all the elements of the panda series

In [11]:
final['genres'] = final['genres'].apply(lambda x: str([genre['name'] for genre in x]))

In [12]:
final    #this contains genre in the form of lists

Unnamed: 0,movieId,imdbId,tmdbId,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,...,tagline,title,video,vote_average,vote_count,userId,rating,timestamp,cast,crew
0,1,114709,862.0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"['Animation', 'Comedy', 'Family']",http://toystory.disney.com/toy-story,862.0,tt0114709,...,,Toy Story,False,7.7,5415.0,7,3.0,851866703,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,1,114709,862.0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"['Animation', 'Comedy', 'Family']",http://toystory.disney.com/toy-story,862.0,tt0114709,...,,Toy Story,False,7.7,5415.0,9,4.0,938629179,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
2,1,114709,862.0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"['Animation', 'Comedy', 'Family']",http://toystory.disney.com/toy-story,862.0,tt0114709,...,,Toy Story,False,7.7,5415.0,13,5.0,1331380058,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
3,1,114709,862.0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"['Animation', 'Comedy', 'Family']",http://toystory.disney.com/toy-story,862.0,tt0114709,...,,Toy Story,False,7.7,5415.0,15,2.0,997938310,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
4,1,114709,862.0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"['Animation', 'Comedy', 'Family']",http://toystory.disney.com/toy-story,862.0,tt0114709,...,,Toy Story,False,7.7,5415.0,19,3.0,855190091,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99925,161918,4831420,390989.0,False,"{'id': 286023, 'name': 'Sharknado Collection',...",0,"['Comedy', 'Horror', 'Science Fiction']",http://www.syfy.com/sharknado4,390989.0,tt4831420,...,"What happens in Vegas, stays in Vegas. Unless ...",Sharknado 4: The 4th Awakens,False,4.3,88.0,624,1.5,1472929873,"[{'cast_id': 0, 'character': 'Fin Shepard', 'c...","[{'credit_id': '56ffae0cc3a3686ea7001e00', 'de..."
99926,161944,255313,159550.0,False,,8000000,['Drama'],,159550.0,tt0255313,...,,The Last Brickmaker in America,False,7.0,1.0,287,5.0,1470167824,"[{'cast_id': 1, 'character': 'Henry Cobb', 'cr...","[{'credit_id': '544475aac3a36819fb000578', 'de..."
99927,162542,5165344,392572.0,False,,1000000,"['Thriller', 'Romance']",,392572.0,tt5165344,...,Decorated Officer. Devoted Family Man. Defendi...,Rustom,False,7.3,25.0,611,5.0,1471520667,"[{'cast_id': 0, 'character': 'Rustom Pavri', '...","[{'credit_id': '5951baf692514129c4016600', 'de..."
99928,162672,3859980,402672.0,False,,15050000,"['Adventure', 'Drama', 'History', 'Romance']",,402672.0,tt3859980,...,,Mohenjo Daro,False,6.7,26.0,611,3.0,1471523986,"[{'cast_id': 0, 'character': 'Sarman', 'credit...","[{'credit_id': '57cd5d3592514179d50018e8', 'de..."


Now, to finally create the user interaction matrix whose rows represent movies, columns represent users and entries represent ratings:

We use pivot() function with index,columns and values as attributes

And then finally create the utility matrix using to_numpy()

In [13]:
utili=df.pivot(index='movieId',columns='userId',values='rating')
utili=utili.fillna(0)
# The rating ranges from 0.5 to 5 with increments of 5 therefore the minimum rating is 0.5.
# We can assign 0 to nan values

In [14]:
utili.head()

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,4.0,0.0,...,0.0,4.0,3.5,0.0,0.0,0.0,0.0,0.0,4.0,5.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
Y=utili.to_numpy()
print(Y.shape)
Y

(9025, 671)


array([[0., 0., 0., ..., 0., 4., 5.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Below is the R matrix which has entries 1 corresponding to movies which have been rating and 0 corresponding to movies with no ratings

In [16]:
R=(Y>0).astype(int)
R

array([[0, 0, 0, ..., 0, 1, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

# Making Recommedations

***CALCULATING SIMILARITY SCORES***

For item-based collaborative filtering model we will calculate the similarity between each of the pair of movies and form a similarity matrix S where S[i,j] will represent similarity between ith and jth movie of the utility matrix

Similarity metrics used mostly are:

    1.Cosine Similarity
    2.Euclidian Distance
    3.Adusted cosine Similarity

***Eucledian distance***

In [17]:
# nm=Y.shape[0]
# S1=np.zeros((nm,nm))
# for i in range(nm):
#   for j in range(nm):
#     S1[i,j]=np.sum(np.square(Y[i]-Y[j]))
# S1=np.sqrt(S1)

The run time of above code for calculating eucledian distance from scratchis very high so we use a efficient approach below

In [18]:
from sklearn.metrics.pairwise import euclidean_distances
S1=euclidean_distances(Y,Y)

***Cosine similarity***

In [19]:
# nm=Y.shape[0]
# S2=np.zeros((nm,nm))
# for i in range(nm):
#   for j in range(nm):
#     S2[i,j]=np.dot(Y[i],Y[j])
#     S2[i,j]=S2[i,j]/(norm(Y[i])*norm(Y[j]))
# S2

Similarly the run time of above is large so not preferred

In [20]:
from sklearn.metrics.pairwise import cosine_similarity
S2=cosine_similarity(Y)

***Adjusted cosine similarity***

Preprocess the data by subtracting mean rating of every movie(every row). This normilization is done so that each movie has a rating of 0 on average whether rated or not.

In [21]:
u=(np.sum(Y,axis=1)/(np.sum(R,axis=1)+1e-12)).reshape(-1,1)  #mean matrix

In [22]:
S3=cosine_similarity(Y-u)

Predict rating 

##though ratings are better predicted using trained models


In [23]:
def predict_rating(userId,movieId,s2):
    index=utili.index.get_loc(movieId)
    user_rated_rating=utili.loc[:,userId].to_numpy()
    return np.dot(s2[index],user_rated_rating)/np.sum(s2[index])


In [24]:
predict_rating(1,31,S2)+u[1,0]

3.42433570136621

# The Final Test

Let's get top 5 movies similar to your favourite movie!

In [25]:
fav_movie=input("Enter your movie: ")

Enter your movie:  Toy Story


In [26]:
movieId=final.loc[final['title'] == fav_movie, 'movieId'].values[0]
index=utili.index.get_loc(movieId)

In [27]:
sorted_indices = sorted(range(len(S2[index])), key=lambda i: S2[index,i],reverse=True)
top_5=sorted_indices[0:6]
top_5 # this array contains actual index of the 5 most similar movies

[0, 2501, 232, 321, 641, 1015]

In [28]:
for k in range(1,6):   #do not consider the input movie
  a=utili.index[top_5[k]]
  print(k,'.',final.loc[final['movieId'] == a, 'title'].values[0], final.loc[final['movieId'] == a, 'genres'].values[0])

1 . Toy Story 2 ['Animation', 'Comedy', 'Family']
2 . Star Wars ['Adventure', 'Action', 'Science Fiction']
3 . Forrest Gump ['Comedy', 'Drama', 'Romance']
4 . Independence Day ['Action', 'Adventure', 'Science Fiction']
5 . Groundhog Day ['Romance', 'Fantasy', 'Drama', 'Comedy']


for input movie(Toy story)

**Adjusted Cosine Similarity:**
    
    1 . Toy Story 2 ['Animation', 'Comedy', 'Family']
    2 . A Bug's Life ['Adventure', 'Animation', 'Comedy', 'Family']
    3 . Toy Story 3 ['Animation', 'Comedy', 'Family']
    4 . Monsters, Inc. ['Animation', 'Comedy', 'Family']
    5 . Tarzan ['Animation', 'Comedy', 'Family']
    
**Cosine Similarity:**

    1 . Toy Story 2 ['Animation', 'Comedy', 'Family']
    2 . Star Wars ['Animation', 'Comedy', 'Family']
    3 . Forrest Gump ['Animation', 'Comedy', 'Family']
    4 . Independence Day ['Action', 'Adventure', 'Science Fiction']
    5 . Groundhog Day ['Romance', 'Fantasy', 'Drama', 'Comedy']

**Eucledian Distances:**

    1 . Toy Story 2 ['Animation', 'Comedy', 'Family']
    2 . A Bug's Life ['Adventure', 'Animation', 'Comedy', 'Family']
    3 . Groundhog Day ['Romance', 'Fantasy', 'Drama', 'Comedy']
    4 . Independence Day ['Action', 'Adventure', 'Science Fiction']
    5 . Monsters, Inc. ['Animation', 'Comedy', 'Family']

Observe that the recommedations made by adjusted cosine similarity is better than the other two

Above is a simple model which suggests movies similar to a input movie and is not user specific i.e. does not take into account user's past watching experience. It will suggest same movies irrespective of the user

We can also suggest movies specific to a user based on user's past watching experience

Let's do this using User-Factorization method

Let's add a new user with userId=0 and give ratings to  movies as per your choice

In [29]:
utili[0]=np.nan
utili.loc[1,0]=3
utili.loc[163949,0]=4
utili.loc[2,0]=2
utili.loc[4,0]=3.5
utili.loc[1061,0]=5
utili.loc[1029,0]=4
utili.loc[3671,0]=3
utili=utili.fillna(0)
Y=utili.to_numpy()
print(Y.shape)
Y   #new utility matrix

(9025, 672)


array([[0., 0., 0., ..., 4., 5., 3.],
       [0., 0., 0., ..., 0., 0., 2.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 4.]])

In [30]:
R=(Y>0).astype(int)  #R is a matrix which contains binary label 1 for movies which have been rated

In [31]:
u=np.sum(Y,axis=1)/(np.sum(R,axis=1)+1e-12)
u=u.reshape(Y.shape[0],1)
Y=Y-u
# We normalize the ratings by subtracting avg rating of a movie(mean along a row) so that each movie now will have a rating 0

In [32]:
num_movies,num_users=Y.shape
num_features = 500

In [33]:
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
    
    # Vectorized implementation of cost function for speed using tensorflow operations 
          # X (num_movies,num_features): matrix of movie features
          # W (num_users,num_features) : matrix of user parameters
          # b (1, num_users)           :biasing vector of user parameters
          # lamb (float): regularization parameter
    
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    return J

# initialize Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')
b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float64),  name='b')

# Instantiate an optimizer.
optimizer = keras.optimizers.Adam(learning_rate=1e-1)


In [34]:
iterations = 300
lamb = 1
costs=[]
for iter in range(iterations):
    with tf.GradientTape() as tape:
        # Compute the cost
        cost_value = cofi_cost_func_v(X, W, b, Y, R, lamb)

    # Use the gradient tape to automatically compute gradients with respect to loss
    grads = tape.gradient( cost_value, [X,W,b] )

    # Run one step of gradient descent by changing the values of parameters
    optimizer.apply_gradients( zip(grads, [X,W,b]) )
    # print cost every 2oth iteration
    if iter % 20 == 0:
        costs.append(cost_value)
        print(f"Training loss at iteration {iter}: {cost_value:0.1f}")

Training loss at iteration 0: 27594174.2
Training loss at iteration 20: 1976541.3
Training loss at iteration 40: 1036670.1
Training loss at iteration 60: 641384.5
Training loss at iteration 80: 422773.9
Training loss at iteration 100: 290376.2
Training loss at iteration 120: 205860.5
Training loss at iteration 140: 149805.6
Training loss at iteration 160: 111447.2
Training loss at iteration 180: 84493.8
Training loss at iteration 200: 65120.4
Training loss at iteration 220: 50923.8
Training loss at iteration 240: 40349.2
Training loss at iteration 260: 32364.0
Training loss at iteration 280: 26265.6


In [58]:
# plt.plot(costs)
# plt.show()

We now have learned the movie feature vector(X) and user feature vector(W)

# Evaluation

In [36]:
Y+u  #actual ratings

array([[0., 0., 0., ..., 4., 5., 3.],
       [0., 0., 0., ..., 0., 0., 2.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 4.]])

In [37]:
p = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()
#restore the mean
pm = p + u
pm  #predicted ratings

array([[2.11367035, 2.77343834, 1.40197207, ..., 3.9982926 , 4.95563522,
        3.0109457 ],
       [3.36082232, 3.60747085, 2.01135966, ..., 3.82489626, 1.8310824 ,
        2.03620771],
       [2.16120527, 1.55998548, 2.6633357 , ..., 3.06425634, 3.1926419 ,
        2.49527687],
       ...,
       [4.5430735 , 5.21949416, 4.4096274 , ..., 5.17428396, 5.48247641,
        4.51027765],
       [2.54278637, 3.24962434, 2.44298426, ..., 3.19119117, 3.43286775,
        2.48718606],
       [4.14889906, 4.67472945, 3.9155405 , ..., 4.65991507, 4.90307244,
        4.07809339]])

RMSE(root mean square error) and MAE (mean absolute error) are some of the ways which can be used to evaluate our trained model

In [38]:
rmse=np.sum(np.square((Y+u-pm)*R))/np.sum(R)  #rmse
rmse=np.sqrt(rmse)
mae=np.sum(np.abs((Y+u-pm)*R))/np.sum(R)
print('RMSE for above trained model:',rmse)
print('MAE for above trained model:',mae)

RMSE for above trained model: 0.033102047930094615
MAE for above trained model: 0.024205589984094722


Make a predict function to predict rating of a movie for a particular user

In [39]:
def predict(movie_index,user_index,x,w,b):
  y_cap=np.dot(x[movie_index],w[user_index])+b[0,user_index]+u[movie_index,0]
  return y_cap

user_rated dataframe will store movies already rated by our target user

In [40]:
userId=0      #target userId
user_rated=(utili[userId])[((utili[userId]) != 0)]
user_rated

movieId
1         3.0
2         2.0
4         3.5
1029      4.0
1061      5.0
3671      3.0
163949    4.0
Name: 0, dtype: float64

In [41]:
user_ratedId=user_rated.index
user_ratings=user_rated.to_numpy()

In [42]:
ui=utili.columns.get_loc(userId)
rate={}
for Id in (utili.index):
    mi=utili.index.get_loc(Id)
    rate[mi]={}
    rate[mi]['Id']=Id
    rate[mi]['pre_rate']=predict(mi,ui,X.numpy(),W.numpy(),b.numpy())

The above loop has a high runtime due to large number of movies.
    
We can use K-nearest-neighbiours approach which finds K similar movies to each of the movie user has watched and predicts ratings of each of them. The one already watched are filtered out and top 5 movies are recommended to the user.

In [43]:
opm=(pd.DataFrame.from_dict(rate, orient='index'))
opm = opm.sort_values('pre_rate', ascending=False).reset_index(drop=True)
# opm dataframe contains predicted ratings of movies in descending order of ratings

In [44]:
opm.head()

Unnamed: 0,Id,pre_rate
0,1061,4.844034
1,4011,4.806826
2,95377,4.625383
3,98615,4.560868
4,820,4.546996


We will suggest top5 movies to the user

In [45]:
t=0
for Id in (opm['Id']):
    if Id not in user_ratedId:
        print(final.loc[final['movieId'] == Id, 'title'].values[0],final.loc[final['movieId'] == Id, 'genres'].values[0])
        t+=1
    if t==5:
        break
    

Snatch ['Thriller', 'Crime']
One Man Band ['Animation', 'Family']
Burn Up! ['Animation']
Death in the Garden ['Adventure', 'Drama']
29th and Gay ['Comedy']


***Generalized recommedations (not specific to any user) can also be made using this method***

Check below!

In [46]:
# df.loc[df['movieId'] == 1, 'title'].values[0]

In [47]:
movieId=final.loc[final['title'] == fav_movie, 'movieId'].values[0]  #your fav movie
mi=utili.index.get_loc(movieId)

In [48]:
score={}
for i in range(num_movies):
  score[i]={}
  score[i]['score']=np.dot(X[i].numpy(),X[mi].numpy())/(norm(X[i].numpy())*norm(X[mi]))
  score[i]['id']=utili.index[i]


In [49]:
klm=(pd.DataFrame.from_dict(score, orient='index'))
klm = klm.sort_values('score', ascending=False).reset_index(drop=True)

In [50]:
klm.head()

Unnamed: 0,score,id
0,1.0,1
1,0.159152,4155
2,0.153625,2257
3,0.150473,126548
4,0.149143,5834


In [51]:
# print(klm.loc[klm['id'] == 3114, 'score'].values[0])

In [52]:
Top_5=klm['id'].iloc[1:6]
c=Top_5.to_numpy()
c  #array containing movieId of top5 movies

array([  4155,   2257, 126548,   5834,   8914], dtype=int64)

In [53]:
for movie in (c):
    print(final.loc[final['movieId'] == movie, 'title'].values[0],':',final.loc[df['movieId'] == movie, 'genres'].values[0])

Sweet November : ['Drama', 'Romance']
No Small Affair : ['Comedy', 'Drama', 'Music', 'Romance']
The DUFF : ['Romance', 'Comedy']
Fingers : ['Drama', 'Action', 'Thriller']
Primer : ['Science Fiction', 'Drama', 'Thriller']


Well this didn't made as good recommedations as were made by memory-based model

User-Factorisation method is better in predicting ratings for unwatched movies 

That was our movie recommedation system