 Myslef [Sudhindra V](https://www.linkedin.com/in/vsudhindra/) is creating an ML based Recommendation Engine in collaboration with [Mr. Rocky Jagtiani](https://www.linkedin.com/today/author/rocky-jagtiani-3b390649/)
> This is a simple Data Science project on Movies Recommendation System which recommends you the movie based on the Review of previous movie.

> Dataset: tmdb_5000_credits.csv,tmdb_5000_movies.csv from kaggle itself

> Tech Stack used: pandas, Scikit-learn,Python

> Recommended links : 

> https://datascience.suvenconsultants.com  ( For DS / AI / ML )

> https://monster.suvenconsultants.com  ( For Web development )

Recommender systems are among the most popular applications of data science today. They are used to predict the "rating" or "preference" that a user would give to an item. Almost every major tech company has applied them in some form. Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow.

Recommender systems have also been developed to explore research articles and experts, collaborators, and financial services. 

Recommender systems can be classified into Two types:

> **Content-based recommenders**: suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata. A good example could be YouTube, where based on your history, it suggests you new videos that you could potentially watch.

> **Collaborative filtering engines**: these systems are widely used, and they try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

Content Based -> Meta tags

*Collabrative filtering -> consumer or user behaviour

Collabrative -> Cold Start Problem a. U just started your website.

b. U won't have any recommendations / user preferences.
Soln : Content Based + Collabrative -> Hybrid Model ¶

Here we are going to implement **Content Based Filtering**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Import Pandas
import pandas as pd

# Loading Data sets
full_url='/kaggle/input/tmdb-movie-metadata/tmdb_5000_credits.csv'

full_url1='/kaggle/input/tmdb-movie-metadata/tmdb_5000_movies.csv'

credits = pd.read_csv(full_url)
movies=pd.read_csv(full_url1)

In [None]:
# Printing 1st 5 elements of credits dataset
credits.head()

In [None]:
# Printing 1st 5 elements of movies dataset
movies.head()

In [None]:
# Printing the shapes of both the datasets
print("Credits:",credits.shape)
print("Movies:",movies.shape)

In [None]:
# Renaming the column of credits data set
credits_renamed=credits.rename(index=str,columns={'movie_id':'id'})
credits_renamed.head()

Merge or Inner Join -> U r finding common between both.
5000 intersection 4803 => 4803 common elements

Emp 

A 	 100
B 	 101

Dept

100	 IT
102	 SALES

Outer Join : 
1> Left -> o/p of Inner + All those rows of the left table which didn't match. 


Inner : 
A 	 100 	 IT 

// No emp works for SALES dept as of now. 

// B works for an unknown dept

Emp Left Join Dept :

A 	 100 	 IT 
B 	 101 	 NULL

2> Rgt o/p of Inner + All those rows of the Rgt table which didn't match.

Emp Right Join Dept :

A 	 100 	 IT 
Null 102 	 SALES

3> Full  -> Inner + Left + Rgt 

In [None]:
# Merging both data sets
merge=movies.merge(credits_renamed,on='id')
merge.head()

In [None]:
my_list = list(merge)
my_list = merge.columns.values.tolist()
print(my_list)

In [None]:
# Dropping unnecessary columns 
cleaned=merge.drop(columns=['homepage','title_x','title_y','status','production_countries'])
cleaned.head(2)

In [None]:
my_list1 = list(cleaned)
my_list1 = cleaned.columns.values.tolist()
print(my_list1)

In [None]:
cleaned['overview'].head()

In [None]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english',ngram_range=(1,3),min_df=3,analyzer='word')

#Replace NaN with an empty string
cleaned['overview'] = cleaned['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(cleaned['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

In [None]:
#my_list2 = list(tfidf_matrix)
#my_list2 = tfidf_matrix.tolist()
#print(my_list2)

In [None]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
print(cosine_sim.shape)
print(cosine_sim[0])
print(cosine_sim[1])

**We are going to define a function that takes in a movie title as an input and outputs a list of 10 most similar movies. Firstly, for this we need a reverse mapping of movie titles and DataFrame indices. In other words, weneed a mechanism to identify the index of a movie in our metadata DataFrame given its title**

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(cleaned.index, index=cleaned['original_title']).drop_duplicates()
indices[ :5]

We are now in a good position to define our recommendation function. these are the following steps will follow.

* Get the index of the movie given its title.
* Get the list of Cosine similarity scores for that particular movie with all movies. convert it into a list of tuples where the first element is its position and the second is the similarity score.
* Sort the before mentioned list of tuples based on the similarity scores, That is the second element
* Get the top 10 elements of the list. Ignore the first element as it refers to self (Te movie most similar to a particular movie is the movie itself).
* Return the Titlescorresponding to the indices of the top elements.

In [None]:
def get_recommendations(title, sim_matrix):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(sim_matrix[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return cleaned['original_title'].iloc[movie_indices]

In [None]:
# Getting the recommendation
get_recommendations('Avatar',cosine_sim)

In [None]:
# Getting the recommendation
get_recommendations('The Dark Knight Rises',cosine_sim)

# Enhancements

In [None]:
cleaned.columns

In [None]:
## have a look at the way data is stored in columns like crew or cast
cleaned['crew'].values[0]

In [None]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']

for feature in features:
    cleaned[feature] = cleaned[feature].apply(literal_eval)
    
## about literal_eval()    
## https://stackoverflow.com/questions/15197673/

In [None]:
## lets see the data stored for the 0th movie.  
cleaned['crew'].values[0]

## Notice : its an list of dict objects.

In [None]:
cleaned['cast'].values[0]

In [None]:
## function to get the director's name
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
## a function that will return the top 3 elements or the entire list, whichever is more. 
## Here the list refers to the cast, keywords, and genres.

def get_list(x):
    
    if isinstance(x, list):
        names = [i['name'] for i in x]
    
    #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[ : 3]
        return names

    #Return empty list in case of missing/malformed data
    return []


In [None]:
# Define new director, cast, genres and keywords features 
## that are in a suitable form.
cleaned['director'] = cleaned['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']

for feature in features:
    cleaned[feature] = cleaned[feature].apply(get_list)

In [None]:
# Print the new features of the first 3 films
cleaned[['original_title', 'cast', 'director', 'keywords', 'genres']].head()

The next step would be to convert the names and keyword instances into lowercase and strip all the spaces between them.

Removing the spaces between words is an important preprocessing step. It is done so that your vectorizer doesn't count the Johnny of "Johnny Depp" and "Johnny Galecki" as the same. After this processing step, the aforementioned actors will be represented as "johnnydepp" and "johnnygalecki" and will be distinct to your vectorizer.

In [None]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [None]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    cleaned[feature] = cleaned[feature].apply(clean_data)

You are now in a position to create your "metadata", which is a string that contains all the metadata that you want to feed to your vectorizer (namely actors, director and keywords).

The create_metadata function will simply join all the required columns by a space. This is the final preprocessing step, and the output of this function will be fed into the word vector model.

In [None]:
def create_metadata(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])


In [None]:
# Create a new metadata feature
cleaned['metadata'] = cleaned.apply(create_metadata, axis=1)

In [None]:
cleaned[['metadata']].head(2)

The next steps are the same as what you did with above content based recommender.

One key difference is that you use the CountVectorizer() instead of TF-IDF. This is because you do not want to down-weight the actor/director's presence if he or she has acted or directed in relatively more movies. It doesn't make much intuitive sense to down-weight them in this context.

The major difference between CountVectorizer() and TF-IDF is the inverse document frequency (IDF) component which is present in

In [None]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')

count_matrix = count.fit_transform(cleaned['metadata'])


In [None]:
count_matrix.shape

In [None]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [None]:
# Reset index of your main DataFrame and construct reverse mapping as before

## cleaned = cleaned.reset_index()
indices = pd.Series(cleaned.index, index = cleaned['original_title'])
indices[:2]

In [None]:
## You can now reuse your get_recommendations() function 
## by passing in the new cosine_sim2 matrix as your second argument.

get_recommendations('The Dark Knight Rises', cosine_sim2)

In [None]:
get_recommendations('The Godfather', cosine_sim2)

I would like to humbly and sincerely thank my mentor [Rocky Jagtiani](https://www.linkedin.com/today/author/rocky-jagtiani-3b390649/). He is more of a friend to me then mentor. The Machine Learning course taught by him and various projects we did and are still doing is the best way to learn and skill in Data Science field. See https://datascience.suvenconsultants.com once for more.