## Building a Movie Recommendation System with Python using NLP and Cosine Similarity Algorithm

### Introduction

The aim of this code is to create a content-based movie recommendation system using natural language processing techniques. The system will recommend movies to users based on their similarity to movies that they have previously watched and enjoyed. The dataset used in this project is the [TMDB 5000 Movie Dataset] (https://www.kaggle.com/tmdb/tmdb-movie-metadata) which contains metadata on approximately 5000 movies.

The method used in this project is content-based filtering. Content-based filtering is a recommendation technique that involves analyzing the features of an item in order to recommend similar items. In the context of this project, we will be analyzing the textual data associated with movies such as their overview, tagline and title to recommend similar movies.

The various steps involved in creating the recommendation system are:

    1. **Data preprocessing**: This step involves cleaning and preparing the dataset for analysis. We will remove any unnecessary columns and rows, handle missing values and preprocess the textual data by tokenizing, stemming and vectorizing it.
    Feature extraction: In this step, we will use vectorization techniques such as CountVectorizer and TfidfTransformer to extract features from the preprocessed textual data.
    Calculating similarity scores: We will calculate similarity scores between each pair of movies using techniques such as cosine similarity and linear kernel.
    Building the recommendation system: Finally, we will build the recommendation system by implementing a function that takes a movie title as input and outputs a list of similar movie titles.

    2. **Feature extraction**: In this step, we will use vectorization techniques such as CountVectorizer and TfidfTransformer to extract features from the preprocessed textual data.
    Calculating similarity scores: We will calculate similarity scores between each pair of movies using techniques such as cosine similarity and linear kernel.
    Building the recommendation system: Finally, we will build the recommendation system by implementing a function that takes a movie title as input and outputs a list of similar movie titles.

    3. **Calculating similarity scores**: We will calculate similarity scores between each pair of movies using techniques such as cosine similarity and linear kernel.

    4. **Building the recommendation system**: Finally, we will build the recommendation system by implementing a function that takes a movie title as input and outputs a list of similar movie titles.


Import the relevant  python libraries

In [18]:
import pandas as pd 
import numpy as np

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

from nltk import word_tokenize
from nltk.stem import PorterStemmer

#### Read the dataset from a CSV file, and check its first few rows

In [19]:
df = pd.read_csv("C:\\Users\\StudentIn\\Downloads\\tmdb_5000_movies.csv", encoding= "UTF-8") ##original dataset available on kaggle https://www.kaggle.com/tmdb/tmdb-movie-metadata

df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [20]:
#Checking the dimensions of the dataframe
print(df.columns)
print(df.shape)


Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')
(4803, 20)


#### Concatenate the chosen text fields  of the original dataframe to create a document vector, and print the resulting text from the first row for verification, also create a list of titles

This section of code that concatenates three columns of text data from the movie dataset (overview, tagline, and title) into a single string for each movie. The resulting string is used to create a document vector, which will be used to compute similarity scores between movies in subsequent steps of the code.

The code takes a dataframe df and selects three columns - overview, tagline, and title. It then applies a lambda function to each row of these columns to concatenate them into a single string, separated by spaces. The astype(str) method ensures that any non-string values are converted to strings. The resulting string is returned and saved in the variable text. This concatenated string will be used to create a document vector in subsequent steps of the code.

In [21]:
text = df[['overview', 'tagline', 'title']].apply(lambda x: ' '.join(x.astype(str)), axis=1) #concat the chosen text fields to create document vector
print(text[0]) #check the result

title = df['title']
title.head() #to check the results

In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Enter the World of Pandora. Avatar


0                                      Avatar
1    Pirates of the Caribbean: At World's End
2                                     Spectre
3                       The Dark Knight Rises
4                                 John Carter
Name: title, dtype: object

#### Define a new simplified dataframe containing only the title and the text string description of each movie



In [22]:
df_min = pd.DataFrame({'title':title, 'text':text})
df_min.head()

Unnamed: 0,title,text
0,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,Spectre,A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,Following the death of District Attorney Harve...
4,John Carter,"John Carter is a war-weary, former military ca..."


#### Define a StemTokenizer class that stems the tokens of a document using the Porter Stemmer algorithm

The code defines a custom tokenizer function that uses the Porter stemming algorithm to reduce words to their root form. This function will be used later on to tokenize the text data in the movie dataset.

In [23]:
class StemTokenizer:
    def __init__(self):
        self.porter = PorterStemmer()
    def __call__(self, doc): 
        tokens = word_tokenize(doc)
        return [self.porter.stem(t) for t in tokens]

#### Create a CountVectorizer object with lowercase words and a tokenizer that stems the words

This code that creates a document-term matrix from the preprocessed text data in the movie dataset and applies the TF-IDF transformation to the resulting matrix. This step is necessary for computing similarity scores between movies.

In [24]:
#first of all, compute the TF to get the count of all the unique words (to apply later to the TFIDF)

vectors=CountVectorizer(lowercase= True, tokenizer = StemTokenizer())

doc_term_matrix = vectors.fit_transform(df_min['text'])



#### Create a TfidfTransformer object and fit it to the document-term matrix

This code calculates the IDF values for each term in the movie dataset. IDF values measure the rarity of a term in a corpus, and are used to weight the importance of each term in the document-term matrix.

In [25]:
# Determine idf
idf = TfidfTransformer()

idf.fit(doc_term_matrix)

#### Create a dataframe that displays the IDF (inverse document frequency) values for each term in the document-term matrix, sorted by decreasing IDF values



In [26]:
#calculate inverse doc frequency
idf_df = pd.DataFrame(idf.idf_, index=vectors.vocabulary_.keys(), columns=['idf'])

idf_df.head()
idf_df.sort_values(by=['idf'],ascending = False)

Unnamed: 0,idf
herzling,8.784057
nine-year-old,8.784057
vitti,8.784057
counsel,8.784057
fundrais,8.784057
...,...
prisoner-of-war,1.235764
barsoom,1.163107
dewitt,1.105499
psychoneurot,1.083988


#### Apply the IDF weights to the document-term matrix to create a TF-IDF matrix

In [27]:


tf_idf = idf.transform(doc_term_matrix)

#### Create a pandas series that connects the rows in the sparse matrix with the titles of the movies in the 'title' column of the df_min dataframe

In [28]:

indices = pd.Series(df_min.index, index=df_min['title']) #connecting the rows in the sparse matrix with the titles (same rows)
indices.head()

title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64

#### This code snippet defines a function that takes a movie title, a cosine similarity matrix, and the indices series, and returns the titles of the 5 most similar movies

This code that calculates the cosine similarity scores between movies based on their document vectors. Cosine similarity is a measure of the similarity between two non-zero vectors of an inner product space, and is commonly used to calculate similarity scores between text documents.

In [29]:
def get_rec (title,cosine_sim, indices):
    
    idx= indices[title]

    # Get the pairwise similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 5 most similar movies
    sim_scores = sim_scores[1:6]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top similar movies
    return df_min['title'].iloc[movie_indices]


cosine_sim = linear_kernel(tf_idf, tf_idf) 

#### Finally, uses the TF-IDF matrix and linear_kernel() function to create a cosine similarity matrix, and then calls the get_rec() function to get the 5 most similar movies to the chosen movie title

In [30]:
print(get_rec("Lara Croft Tomb Raider: The Cradle of Life", cosine_sim, indices)) ##insert movie title

332     Lara Croft: Tomb Raider
0                        Avatar
4457              Pandora's Box
1902                    The Box
1632        The Next Three Days
Name: title, dtype: object
