**Movie recommendation system**

---
Download and extract the dataset from the following link : https://www.kaggle.com/datasets/shubhammehta21/movie-lens-small-latest-dataset?select=links.csv.
Click on choose files and select 'movies' and 'ratings'.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving movies.csv to movies (3).csv
Saving ratings.csv to ratings (3).csv


Imports 

---


Pandas and numpy will be used to manipulate the datasets and perform matrix operations respectively.

In [None]:
import pandas as pd
import io
import numpy as np


**The Solution**

---

We will be using content based filtering to recommend movies to the user based on the user's rating of a few movies.
We will mainly be working with two matrices : User matrix containing user matrix and genre matrix.
User matrix contains the ratings that the user gives and the genre matrix which will initially contain 1 if the movie is of a certain genre and 0 other wise.

I chose the matrix factorization algorithm for its simplicity and effectiveness besides the fact that it is widely used in various platforms. 

Comments explain the preprocessing of the datasets.

In [None]:
#Storing the movie information into a pandas dataframe
movies_df = pd.read_csv('movies.csv')
#Storing the user information into a pandas dataframe
ratings_df = pd.read_csv('ratings.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
#Using regular expressions to find a year stored between parentheses
#We specify the parantheses so we don't conflict with movies that may have years in their titles
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
#Removing the parentheses
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
#Using the strip function to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df.head()

  import sys


Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [None]:
#Every genre is separated by a | so we simply have to call the split function on |
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


In [None]:
#Copying the movie dataframe into a new one since we won't need to use the genre information in our first case.
moviesWithGenres_df = movies_df.copy()

#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column
for index, row in movies_df.iterrows():
    for genre in row['genres']:
        moviesWithGenres_df.at[index, genre] = int(1)
       
#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()
display(moviesWithGenres_df)




Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic,"[Action, Animation, Comedy, Fantasy]",2017,0.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9738,193583,No Game No Life: Zero,"[Animation, Comedy, Fantasy]",2017,0.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9739,193585,Flint,[Drama],2017,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9740,193587,Bungo Stray Dogs: Dead Apple,"[Action, Animation]",2018,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# dropping useless columns
# in drop('title', 1) 1 implies column and title is the column heading we want to drop
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


Here the user input is hardcoded. The movies are from the dataset. More movies can be added.

In [None]:
userInput = [
            {'title':'Jumanji', 'rating':1.5},
            {'title':'Toy Story', 'rating':0.5},
            {'title':'Flint', 'rating':5},
            {'title':"Waiting to Exhale", 'rating':3.5},
            {'title':'Assassins', 'rating':2.0}
         ] 
# converting user input list into a dataframe
inputMovies = pd.DataFrame(userInput)
# df.shape returns the dimensions of the matrix
print(inputMovies.shape)

(5, 2)


In [None]:
#Filtering out the movies by title
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]

#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)

#Dropping information we won't use from the input dataframe
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)

#Final input dataframe
#If a movie you added in above isn't here, then it might not be in the original 
#dataframe or it might spelled differently, please check capitalisation.
inputMovies
print(inputId)

      movieId              title  \
0           1          Toy Story   
1           2            Jumanji   
3           4  Waiting to Exhale   
22         23          Assassins   
9739   193585              Flint   

                                                 genres  year  
0     [Adventure, Animation, Children, Comedy, Fantasy]  1995  
1                        [Adventure, Children, Fantasy]  1995  
3                              [Comedy, Drama, Romance]  1995  
22                            [Action, Crime, Thriller]  1995  
9739                                            [Drama]  2017  


  


In [None]:
#Filtering out the movies from the input
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,23,Assassins,"[Action, Crime, Thriller]",1995,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9739,193585,Flint,[Drama],2017,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# dropping unnecessary values for now

userMovies=userMovies.drop(['movieId','title','genres','year'],axis=1)
userMovies.reset_index(inplace=True)
userMovies.reset_index(drop=True)
display(userMovies)

Unnamed: 0,index,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,9739,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
userMovies=userMovies.drop(['index'],axis=1)

inputMovies=inputMovies.drop(['movieId','title'],axis=1)
#converting matrix to array
np_array=userMovies.to_numpy()

# if you change the movie ratings, make sure to make the change in the same order in the following matrix too
a=np.array([[1.5,0.5,5.0,3.5,2.0,]]) 
a.transpose()

#pre userprofile
arr=np.dot(a,np_array)

#sum of all (to divide)
x=np.sum(arr)

# creating userprofile based on user input 
for i in range(0,1):
  for j in range(0,20):
    arr[i][j]/=x
    display(arr)

array([[0.05479452, 1.5       , 2.        , 6.5       , 2.        ,
        5.        , 7.        , 3.5       , 3.5       , 3.5       ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 2.        , 6.5       , 2.        ,
        5.        , 7.        , 3.5       , 3.5       , 3.5       ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 6.5       , 2.        ,
        5.        , 7.        , 3.5       , 3.5       , 3.5       ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 2.        ,
        5.        , 7.        , 3.5       , 3.5       , 3.5       ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        5.        , 7.        , 3.5       , 3.5       , 3.5       ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 7.        , 3.5       , 3.5       , 3.5       ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 3.5       , 3.5       , 3.5       ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 3.5       , 3.5       ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 3.5       ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 0.09589041,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 0.09589041,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 0.09589041,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 0.09589041,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 0.09589041,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 0.09589041,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 0.09589041,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 0.09589041,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 0.09589041,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 0.09589041,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

array([[0.05479452, 0.04109589, 0.05479452, 0.17808219, 0.05479452,
        0.1369863 , 0.19178082, 0.09589041, 0.09589041, 0.09589041,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

In [None]:
new = arr.reshape(1, 20)
j=moviesWithGenres_df
d=moviesWithGenres_df[['movieId']]
moviesWithGenres_df=moviesWithGenres_df.drop(['movieId','title','genres','year'],axis=1)
new.shape

(1, 20)


In [None]:
hey=moviesWithGenres_df.transpose()
nte=np.matmul(new,hey)
final=nte.transpose()

if [column for column in d.columns] not in [column for column in final.columns]:
    add = final.append(d)
    add = add[[column for column in moviesWithGenres_df.columns if column in add.columns]]

final['movieId']=j['movieId']
final['title']=j['title']


final.columns=['rating','movieId','title']

final.sort_values(by=['rating'],ascending=False,inplace=True)
#display(final)

top_15=final.head(15)
display(top_15)

Unnamed: 0,rating,movieId,title
3460,0.835616,4719,Osmosis Jones
3608,0.753425,4956,"Stunt Man, The"
7441,0.712329,81132,Rubber
9106,0.69863,144606,Confessions of a Dangerous Mind
2903,0.69863,3893,Nurse Betty
1394,0.69863,1912,Out of Sight
400,0.671233,459,"Getaway, The"
8597,0.671233,117646,Dragonheart 2: A New Beginning
505,0.657534,587,Ghost
4693,0.657534,7007,"Last Boy Scout, The"


References : https://towardsdatascience.com/recommendation-system-matrix-factorization-d61978660b4b
The article contains a detailed explanation of matrix factorization using content-based and collaborative filtering.