In [None]:
%pip install pandas

In [None]:
import numpy as np
import pandas as pd

importing the python libraries 

In [None]:
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

uploading the datasets, movies and credits

In [None]:
movies.head

In [None]:
credits.head

In [None]:
movies.shape

Shape is nothing but the quantity of rows and columns present in the datasets.

We have 4803 movies stored in rows along with 20 columns of different atrributes

In [None]:
credits.shape

Similar to 'movies' dataset, 'credits' dataset has 4803 movies along with 4 columns

In [None]:
movies = movies.merge(credits,on='title')

merging both datasets to work on both of them at once. As there are 4803 movies, that many rows are present. 20 columns are present in movies and 4 in shapes, but the 'title' column is common in both of them so there are in total 23 columns

In [None]:
movies.head

In [None]:
movies.info()

In [None]:
movies = movies[['id',  'title','genres', 'keywords', 'overview', 'cast', 'crew']]

Eliminating all the unnecessary columns that won't be effective in recommending the movies.
['id',  'title','genres', 'keywords', 'overview', 'cast', 'crew'] are all the columns that we'll be considering for the next steps.

In [None]:
movies.head(1)

In [None]:
movies.isnull().sum()

checking if any of the columns in movies dataset is missing any value. since there are only 3 values that are missing in the dataset, we can easily ignore them.

In [None]:
movies.dropna(inplace=True)

This line of code removes any rows from the DataFrame that has any one missing value. since we are modifying the dataframe from the original itself, once we clean the missing values, we won't get any missing value in the original dataset. 

In [None]:
movies.isnull().sum()

In [None]:
movies.duplicated().sum()

we have checked for any duplicate data in the datset, since there is no duplicate data, it shows 0.

In [None]:
movies.iloc[0].genres

In [None]:
import ast

since the genres is in a little confused manner, we want to convert the column values into the list format, such as ['Action', 'Adventure', 'Fantasy', 'Science Fiction']

In [None]:
def convert(obj):
    L=[]
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L

calling the convert function as it is will not work properly as genres is a string of dictionaries with keys and values. First, we have to convert the string into a list aand each dictionary into a python object.

For the same purpose, there is a function in Python known as "ast.literal_eval()"

"ast.literal_eval()" is a Python function that safely evaluates strings containing Python expressions or literals. It takes a string representing a Python literal, such as a dictionary, list, tuple, etc., and converts it into the corresponding Python object.

In [None]:
import ast
ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'
)

In [None]:
movies['genres'] = movies['genres'].apply(convert)

In [None]:
movies.head()

Now, as we can see Genres have been all simplified into lists. We'll do the same for keywords.

In [None]:
movies['keywords'] = movies['keywords'].apply(convert)

In [None]:
movies.head()

In [None]:
movies['cast'][0]

There is no need for all the otherr information. In this string also, only 'name' key is important. ALso, we'll be fetching only 3 actor's/actress's names from the above data.

In [None]:
def convert3(obj):
    L=[]
    counter = 0 
    for i in ast.literal_eval(obj):
        if counter != 3:
                L.append(i['name'])
                counter +=1
        else:
            break
    return L

In [None]:
movies['cast'] = movies['cast'].apply(convert3)

In [None]:
movies['cast'][1]

In [None]:
movies.head()

In [None]:
movies['crew'][1]

From the 'crew' column, only the director's name will get fetched.

In [None]:
import ast
def direct(obj):
    L=[]
    for i in ast.literal_eval(obj):
        if i['job']=='Director':
            L.append(i['name'])
            break
    return L

In [None]:
movies['crew']=movies['crew'].apply(direct)

In [None]:
movies['crew']

In [None]:
movies.head()

Now that the other columns are sorted, we will split the overview column into a list of words in which, each word will act as a tag for  searching.

In [None]:
movies['overview'][0]

In [None]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [None]:
movies.head()

we will remove the space in between the two words present in the genres, keywords, cast and crew. 

In [None]:
movies['genres'] = movies['genres'].apply(lambda x:[i.replace(' ','') for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(' ','') for i in x])
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(' ','') for i in x])
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(' ','') for i in x])

In [None]:
movies.head()

As we'll be using ['overview','generes','keywords','cast','crew'] as tags to search for the movies, we'll be merging them together.

In [None]:
movies['tags']= movies['overview']+ movies['genres']+ movies['keywords']+ movies['cast']+ movies['crew']

In [None]:
movies.head()

now that all the columns that will work as tags are appended together, we'll remove other unnecessary columns. Will create new data frame and will introduce all the columns that are necessary 

In [None]:
new_df = movies [['id', 'title','tags']]

In [None]:
new_df

In [None]:
new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x))

we have created a string of tags in which we have attached all the elements that we have previously separated for the sake of simplicity

In [None]:
new_df.head()

In [None]:
new_df['tags'][0]

In [None]:
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

by recommendation, it is easier to work when all the charecters are in lowercase

In [None]:
new_df

In [None]:
new_df['tags'][0]

In the above tags column, we have so many unnecessary words like 'in','a','the' and so on. These words are known as stop words that doesn't have any specific meaning, but they are used to present the relationship between the other words.

To simplify the data, Bag of Words algorithm is used along with the concept of vectorization.
Vectorization will help us recognising how unique the word is in the certain movie tag and in the whole dataset, which will help us searching for the similar movies.


How exactly we are going to search for the top 5 similar movies? The answer is, we will first turn each movie's text information into numerical vector. Then, we'll be finding the coordinates of the movie on the graph and then find the distance between a movie and every other movie present in the dataset. From there, we'll be searching for top 5 movies that have shortest distance with the movie, and those 5 movies will be top 5 similar movies.

For doing so, we have a separate library in python which will minimize the work, know as ScikitLearn.

In [None]:
%pip install scikit-learn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter

cv = CountVectorizer(max_features=5000, stop_words='english')

In [None]:
vectors = cv.fit_transform(new_df['tags']).toarray()

In [None]:
vectors

This vector variable has taken the count of the meaningful words from each movie's tag column.

In [None]:
cv.get_feature_names()

Now, words like 'action' and 'actions' are similar, but since they have different charecters in it, they will work as different coordinates. 
Here, we'll be applying stemming. By this concept, the words like ['dance','dancing','dancer'] these will get converted into ['dance','dance','dance']
Meaning, all of the words will get converted into there original and most basic forms.

nltk is a 'natural language toolkit' that will help us do the same thing.

In [None]:
%pip install nltk

In [None]:
import nltk

In [None]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [None]:
def stem(text):
    y=[]
    
    for i in text.split():
        y.append(ps.stem(i))
    
    return " ".join(y)

In [None]:
new_df['tags'] = new_df['tags'].apply(stem)

by stemming, we will get words to their most basic form along also the problem of repeating words won't be occuring.

In [None]:
new_df['tags'].shape

now, we have to calculate distance between the vectors of two movies. That is, we will caalculate the similarity between them using numerical vectors.
Here, we have two options based on which we can calculate the distance,
1)euclidean distance
2)angular or cosine distance

Euclidean distance is nothing but the tip-to-tip distance between the two vectors. Euclidean distance will fail the accuracy as higher dimension data will get used.
That's why we will be using the cosine distance to measure the distance between them. 

Again for this also, we have a 'cosine similarity' function in sklearn.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
similarity = cosine_similarity(vectors)

In [None]:
similarity

In [None]:
similarity.shape

we got (4806,4806) as the shape of the similarity as we have calculated the distances between 4806 movies withe 4806 movies

From this we can say that, more the value of the similarity, less will be the distance between those two movies and vice versa.

Now, we have to create a function that will find the top 5 movies having least distance from the movie we have searched for.
-First, when we input the movie name, we'll be searching for the index of that movie
-Then, we'll be finding the value of the distance between that movie and all the other movies. But here, we have an issue. When we search for the least distance between the movie we entered and other movies, we have to sort them in reverse order, giving us similarity values from higher order to lower order. But when we'll sort the movies, we'll lost the index count of them, which will again cause the confusion.So, we'll be using 'enumerate' function. By this function, it will change the list of similarities into a tuple having the index number with the similarity distance

In [None]:
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])

In [None]:
def recommend(movie):
    movie_index = new_df[new_df['title'] == movie].index[0]
    distances = similarity[movie_index] 
    movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
    for i in movies_list:
        print(i[0])

In [None]:
recommend('Avatar')

Now, as we got the indexes of top 5 movies silimar to the movie we have searched for, now we will generate the movie names from these indexes 

In [None]:
print(new_df.iloc[539].title)

using this code, we can easily fetch movie titles from its index. 
Hence, keeping the whole function as it is, we will be adding the above code, to the function.

In [None]:
def recommend(movie):
    movie_index = new_df[new_df['title'] == movie].index[0]
    distances = similarity[movie_index] 
    movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

In [None]:
recommend('Batman Begins')

Now we get the top 5 recommended movies when we search for the name of any movie.

As we have develpoed a model and trained it on the dataset, we'll be converting this backend model into a website.