In this project, I've built a **Recommender system** using the MovieLens Dataset.

Given a name of the movie, the model should output the top 5 movies similar to that particular movie.

First we would create a rating matrix and normalise it. 
Then, we weould compute the SVD (Singular value decomposition) of this normalised rating matrix.

We will define a function that computes the cosine similarity between any two movies.
And Based on that cosine similarity, we sort the movies which are most similar and return the top 5 movies that match a given a movie title.

In [1]:
# Import all the required libraries

import numpy as np
import pandas as pd
from numpy import dot

#### Read the Dataset from three files containing the ratings, movies and users info

In [2]:
ratings_data = pd.read_csv('ratings.dat', delimiter = '::', names= ['userId','movieId','rating','timestamp'], engine='python')
movies_data = pd.read_csv('movies.dat', delimiter = '::', names=['movieId', 'movie_title', 'genres'], engine='python', encoding = "ISO-8859-1");
user_data = pd.read_csv('users.dat', delimiter = '::', names= ['userId','Gender','Age','Occupation','Zip-code'], engine='python')

In [3]:
ratings_data

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


In [4]:
movies_data

Unnamed: 0,movieId,movie_title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


In [5]:
user_data

Unnamed: 0,userId,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
...,...,...,...,...,...
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,01060


### Rating Matrix -

The rows of the matrix are movies and columns will represent the users

In [6]:
rating_matrix = np.ndarray(shape=(np.max(ratings_data.movieId.to_numpy()), np.max(ratings_data.userId.to_numpy())), dtype=np.uint8)
rating_matrix[ratings_data.movieId.to_numpy()-1,ratings_data.userId.to_numpy()-1] = ratings_data.rating.to_numpy()

In [7]:
# Normalise the raating matrix -

temp_norm_matrix = (rating_matrix-rating_matrix.mean())/rating_matrix.std()
normalised_matrix = temp_norm_matrix.T/np.sqrt(rating_matrix.shape[0]-1)

### Singular Value Decomposition

$$ \mathbf {M} =\mathbf {U\Sigma V^{T}} $$

SVD breaks down a normal matrix (linear transformation) into three fundamental parts: a left singular matrix (for rotation required to prepare the space for scaling), a diagonal matrix (for axis aligned scaling), and a right singular matrix (another rotation to move the now properly scaled space into it's ultimate rotational position)

In [8]:
# To compute the SVD of the normalised matrix, we can use the svd function from np.linalg

U,S,V = np.linalg.svd(normalised_matrix)

### Cosine Similarity
Now we compute the Cosine Similarity to find the similarity between any two movies. Mathematically the cosine similarity is given as follows: 

$$ cosine(x,y) = \frac{x\cdot y^T}{||x||\cdot ||y||}  $$

In [9]:
# Out of the U, S and V that we have obtained through SVD, V (Right Singular Matrix) is the one we'll use to determine the -
# Cosine Similarity because V-Transpose represents the similarity between the items and the latent factors.
# While finding cosine similarity (in the print function), we'll iterate through columns of V.T --
column = V.transpose()[:,:500]  
# The value 500 above represents the number of singular values I'm using and can be changed. 
# 500 gave me the closest output for Universal Soldier(id=2808) example.
modulus = np.sqrt(np.einsum('ab,ab->a',column,column))

In [10]:
def print_top5_recommendations(movieId):
    print('Top 5 Recommendations for '+movies_data[movies_data.movieId == movieId].movie_title.to_numpy()[0]+': \n')
    i = movieId - 1  # movie_id is one more than the index of that movie ([0] = 1. Toy Story)
    r = column[i, :]  # from V.transpose column, here we get to the row of this particular movie
    cos_sim = np.dot(r, column.T)/(modulus[i]*modulus)  # computing Cosine Similarity
    descending_list = np.argsort(-(cos_sim))  # Sorting the list obtained in most similar to least similar order
    top_five = descending_list[:6]  # the Top 5 similar movies
    # Printing the similar movies, skipping the movie itself, because the most similar option will be the movie itself
    for id in top_five+1:
        if id == movieId:
            pass
        else:
            print(movies_data[movies_data.movieId == id].movie_title.to_numpy()[0])

##### Get the top 5 recommendations given a title




In [11]:
print_top5_recommendations(2808) # 2808 is the id for Universal Soldier(1992)

Top 5 Recommendations for Universal Soldier (1992): 

Soldier (1998)
Solo (1996)
Universal Soldier: The Return (1999)
Judge Dredd (1995)
Timecop (1994)
