The objective is to build a recommender system for movies.

A dataset containing titles, genres, keywords,synopsis etc. will be downloaded.
https://www.kaggle.com/tmdb/tmdb-movie-metadata


A query will have the title of a movie, which must be existing in the dataset, and top 5 recommendations must be provided based on the given movie.

In [1]:
import numpy as np
import pandas as pd
import json

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances


In [None]:
!wget https://lazyprogrammer.me/course_files/nlp/tmdb_5000_movies.csv


In [None]:
moviedata=pd.read_csv('tmdb_5000_movies.csv')


In [None]:
moviedata.head()

It is observed that some of the columns store values as json strings.

Checking the first row of data, 

In [None]:
row1=moviedata.iloc[0]
row1

In [None]:
row1['genres']

This is a list of json strings, and the useful attribute here is "name", which contains the names of the genres.

In [None]:
row1['keywords']


Even in this case, the only attribute required is "name".

We need to convert this json string to a usable format.
json.loads() method can be used to parse a valid JSON string and
convert it into a Python Dictionary, as implemented here

In [None]:
x1=json.loads(row1['genres'])
x1

We need to convert this dictionary/json to a single string of text, such that it is appropriate for TFIDF.

In [None]:
" ".join("".join(x["name"].split()) for x in x1)

Taking the "name" key of every entry, split it on whitespace, 
and join it back together using empty string '' (for cases like Science Fiction made to ScienceFiction)

This will be done for all genre "name" elements, which are then joined by the final outer join() function, 
concatenating all genre tokens with a single whitespace between them.

We need to implement this for all genres and keywords of all rows over the entire dataset, 
implemented by :

In [None]:
def genres_and_keywords_to_string(row):
    genres=json.loads(row['genres'])
    genres= " ".join("".join(x["name"].split()) for x in genres)
    
    keywords=json.loads(row['keywords'])
    keywords=" ".join("".join(y["name"].split()) for y in keywords)
    
    combinedstring="%s %s" %(genres,keywords)
    
    return combinedstring
    
    

This function converts the useful tokens from each json row's genres and keywords to a single string

*It must be clear at this point that we are considering only the genres and keywords data to compare between the movies 
in the scope of this program.

In [None]:
Now, we can apply this function to each row, and generate a new column named string

In [None]:
moviedata['string']=moviedata.apply(genres_and_keywords_to_string,axis=1)


In [None]:
Now, we will create an instance of the TFIDF vectorizer

In [None]:
tfvectorizer=TfidfVectorizer(max_features=2000)

In [None]:
M1= tfvectorizer.fit_transform(moviedata['string'])

In [None]:
M1

When we recieve a query, to correspond between the vectorizer matrix, and the original dataframe, we will use the index,
as both are ordered in a similar manner.

In [None]:
So, we need to generate a MAPPING FOR each movie title to its index in the dataframe :

In [None]:
movie2idx=pd.Series(moviedata.index,index=moviedata['title'])

In [None]:
movie2idx

And now, a function to get the index of a movie from its title :

In [None]:
def find_index(name):
    return movie2idx[name]

Now, for example, we can get the index of the movie Scream 3 from the dataframe

In [None]:
j=find_index('Scream 3')
j

And use that index to lookup the corresponding TFIDF vector for Scream 3 in the vectorizer's matrix (M1)

In [None]:
query=M1[j]
query.toarray()

So, we have established a link between the vectorizer's matrix and the original dataframe.

Now, assuming we have to obtained user's query, we have to find its similarity to the other movies in order to recommend
similar ones 
i.e. we have to compute the similarity of the given movie's vector to the all the other movie vectors

In [None]:
scores = cosine_similarity(query,M1)
scores

In [None]:
The array storing the result of similarity scores is of dimension 1xN, so we have to make it a 1-D array

In [None]:
scores= scores.flatten()

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(scores)

Now, we need the most similar scores, that is the largest values of similarity scores.
So we need to sort to sort them in descending order, 

 To make things simpler, can use argsort() function, which orders a given set of elements, but reports the result indirectly usingthe existing manner of indexes.

In this case, it will return the dataframe index of the movies in descending order of the movies's similarity scores.

(-scores) is used to get the elements such that argsort(), which sorts in ascending order, will actually sort the scores in 
descending order.

In [None]:
result_scores =(-scores).argsort()

In [None]:
plt.plot((-scores).argsort())

In [None]:
Now that we have obtained the scores in descending order, we need to extract the top 5 scores of the movies.
We have to ignore the first value, since the query movie will have the highest similairty score for itself.


In [None]:
result_indices= result_scores[1:6]

In [None]:
And now to extract the corresponding movie titles for the indexes,

In [None]:
recd_movies=moviedata['title'].iloc[result_indices]
recd_movies

Thus, from a query, we have obtained the 5 most similar movies.

In [None]:
To implement the entire process as a function :

In [None]:
def get_recommendations(movie):
    idX=find_index(movie)
    if type(idX) == pd.Series:
        idX=idX.iloc[0]
    result= (-(cosine_similarity(M1[idX],M1).flatten())).argsort()[1:6]
    print(moviedata['title'].iloc[result])
    
    
    

In [None]:
get_recommendations('Fury')

In [45]:
pip install nbconvert[qtpdf]

Collecting pyqtwebengine>=5.15
  Downloading PyQtWebEngine-5.15.6-cp37-abi3-win_amd64.whl (182 kB)
     ------------------------------------- 182.7/182.7 kB 74.1 kB/s eta 0:00:00
Collecting PyQt5-sip<13,>=12.11
  Downloading PyQt5_sip-12.12.2-cp311-cp311-win_amd64.whl (78 kB)
     ---------------------------------------- 78.4/78.4 kB ? eta 0:00:00
Collecting PyQtWebEngine-Qt5>=5.15.0
  Downloading PyQtWebEngine_Qt5-5.15.2-py3-none-win_amd64.whl (60.0 MB)
     --------------------------------------- 60.0/60.0 MB 19.3 MB/s eta 0:00:00
Collecting PyQt5>=5.15.4
  Downloading PyQt5-5.15.9-cp37-abi3-win_amd64.whl (6.8 MB)
     ---------------------------------------- 6.8/6.8 MB 24.3 MB/s eta 0:00:00
Collecting PyQt5-Qt5>=5.15.2
  Using cached PyQt5_Qt5-5.15.2-py3-none-win_amd64.whl (50.1 MB)
Installing collected packages: PyQtWebEngine-Qt5, PyQt5-Qt5, PyQt5-sip, PyQt5, pyqtwebengine
Successfully installed PyQt5-5.15.9 PyQt5-Qt5-5.15.2 PyQt5-sip-12.12.2 PyQtWebEngine-Qt5-5.15.2 pyqtwebengine-


[notice] A new release of pip available: 22.3.1 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip
