# Movie Recommendation System

Recommendation System are of two types - 

- Content Based - Based Current Content you are watching

- Collaborative Based - Based on what users like you watch

- Hybrid - Mix of Both

In this Project, we will be using content based recommendation system.

# Project Flow (Overview of Things)

 - Data which we will preprocess to minimise error
 - Building Model by Training and Testing
 - Converting to a website
 - Deploying it on Heroku

 Dataset Used - TMDB 5000 Movie Dataset

In [3]:
#importing all necessary libraries
import numpy as np
import pandas as pd
import ast
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

Data Preprocessing

In [4]:
#creating dataframe to work upon
movies = pd.read_csv('./tmdb_5000_movies.csv')
credit = pd.read_csv('./tmdb_5000_credits.csv')

In [5]:
#merging both data frames
movies = movies.merge(credit, on='title')

In [6]:
#removing unwanted Columns
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]

In [7]:
#removing Missing Data
movies.dropna(inplace = True)

#removing Duplicate Data (No Duplicates)
duplicate_count = movies.duplicated().sum()

In [8]:
#helper function to extract specific words we want from our dataframes columns
def extract(obj):
    genres = []
    for i in ast.literal_eval(obj):
        genres.append(i['name'])
    return genres

#helper function to extract top 3 cast members
def extract_cast(obj):
    cast = []
    counter = 0
    for i in ast.literal_eval(obj):
        if counter != 3:
            cast.append(i['name'])
            counter += 1
        else:
            break
    return cast

#helper function to extract director
def extract_director(obj):
    director = []
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            director.append(i['name'])
            break
    return director

In [9]:
#extracting genres as words by calling the helper function above
movies['genres'] = movies['genres'].apply(extract)

#same for keywords
movies['keywords'] = movies['keywords'].apply(extract)

#now for cast
movies['cast'] = movies['cast'].apply(extract_cast)

#now for director
movies['crew'] = movies['crew'].apply(extract_director)

#now overview as words
movies['overview'] = movies['overview'].apply(lambda x:x.split(' '))

In [10]:
#removing spaces in between names, genres etc to optimise it better
movies['genres'] = movies['genres'].apply(lambda x:[i.replace(' ', '') for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(' ', '') for i in x])
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(' ', '') for i in x])
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(' ', '') for i in x])
#no need for overview as it doesn't have any spaces left

In [11]:
#merging all those columns into tags
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

#extracting all the necessary columns out
movies = movies[['movie_id', 'title', 'tags']]

#converting tags back to lower case strings
movies['tags'] = movies['tags'].apply(lambda x: (" ".join(x).lower()))

# Building Model

Approaches -

- Find Similar words and sort with highest
- Text Vectorization and finding similar (closest) vectors

We will go with Text Vectorization here

# Text Vectorization

We will use bag of words technique here

Bag of words Basic Algorithm -

- combine all tags
- find most common words (excluding stop words*, and stemming these words**)
- find frequency of all those words in all movies and create a dataframe
- find closest vectors which will be our recommended movies (Instead of Euclidean*** distance, we will calculate cosine**** distance)

*stop words - words like in, are, a, we, it etc as they do not contribute much to the meaning of sentences

**stemming - combining words like 'action', 'actions' which would be considered as different into one to improve efficiency. will be done before Vectorizing

***Euclidan = Normal Distance. Not reliable in higher dimensions

****Cosine Distance = Distance in terms of angles

In [12]:
#Helper function to stem our tags
ps = PorterStemmer()

def stem(text):
    tags = []
    for i in text.split():
        tags.append(ps.stem(i))
    
    return " ".join(tags)

In [13]:
#stemming our tags
movies['tags'] = movies['tags'].apply(stem)

In [14]:
#creating a Vectorizer Function
#max_features - number of words, stop_words - which language we are considering
cv = CountVectorizer(max_features = 5000, stop_words = 'english')

In [15]:
#finding the common words along with their frequency. we don't have access to words themselves but to their frequency array as the individual words do not matter to us. we are only concerned with their frequencies
#matrix would be sparse as 5000 words will not be there in every movie's tags
vectors = cv.fit_transform(movies['tags']).toarray()

In [16]:
# function to give us words
#cv.get_feature_names_out() 

In [17]:
#inverse of distance. gives values between 0-1 1 being same and 0 being opposite. will give a matrix with distance of each movies with each movie. Diagonal will always be 1
similarity = cosine_similarity(vectors)

In [18]:
#checking for name in title column. if found then that row will be fetched. of that row we are fetching the item at 0th index which is the index of that movie in dataframe

# movies['title'] == 'Avatar' will give the indexes which match with the title

# movies[movies['title'] == 'Avatar'] will give whole row as we are passing the index which we want from it

# movies[movies['title'] == 'Avatar'].index[0] will give just the index 

# we will get the index, from that we will get it's distances from similarity and then sort them based on their distance value while keeping its index with it so that we can fetch the recommended movies from our dataframe

def recommend(movie):
    movie_index = movies[movies['title'] == movie].index[0]
    distances = similarity[movie_index]
    movie_list = sorted(list(enumerate(distances)), reverse = True, key = lambda x: x[1])[1:6]
    
    for index, similarity_score in movie_list:
        print(f'"{movies.iloc[index]['title']}" matches with a score of {similarity_score * 100}')

In [19]:
movie_name = input("Enter Movie Name (Be vary of Title, Should Match in case spacing etc) - ")
recommend(movie_name)

"Aliens vs Predator: Requiem" matches with a score of 28.676966733820226
"Aliens" matches with a score of 26.901379342448518
"Falcon Rising" matches with a score of 26.051302464767538
"Independence Day" matches with a score of 25.560859370538303
"Titan A.E." matches with a score of 25.03866978335957


In [24]:
#exporting titlea
pickle.dump(movies.to_dict(), open('./movies_dict.pkl', 'wb'))

#exporting selection matrix
pickle.dump(similarity, open("./similarity.pkl", 'wb'))