 Myself [Akshay Vasala](https://www.linkedin.com/in/akshay-aki-3b03411b1/) are creating an ML based Recommendation Engine in collaboration with [Mr. Rocky Jagtiani](https://www.linkedin.com/today/author/rocky-jagtiani-3b390649/)
 
> This is a simple Data Science project on Movies Recommendation System which recommends you the movie based on the Review of previous movie.

> Dataset: tmdb_5000_credits.csv,tmdb_5000_movies.csv from kaggle itself

> Tech Stack used: pandas, Scikit-learn,Python

> Recommended links : 

> https://datascience.suvenconsultants.com  ( For DS / AI / ML )

> https://monster.suvenconsultants.com  ( For Web development )

Recommender systems are among the most popular applications of data science today. They are used to predict the "rating" or "preference" that a user would give to an item. Almost every major tech company has applied them in some form. Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow.

Recommender systems have also been developed to explore research articles and experts, collaborators, and financial services. 

Recommender systems can be classified into Two types:

> **Content-based recommenders**: suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata. A good example could be YouTube, where based on your history, it suggests you new videos that you could potentially watch.

> **Collaborative filtering engines**: these systems are widely used, and they try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

Here we are going to implement **Content Based Filtering**

In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/tmdb-5000-credits/tmdb_5000_credits.csv
/kaggle/input/tmdb-movie-metadata/tmdb_5000_movies.csv
/kaggle/input/tmdb-movie-metadata/tmdb_5000_credits.csv


In [4]:
# Import Pandas
import pandas as pd

# Loading Data sets
full_url='/kaggle/input/tmdb-movie-metadata/tmdb_5000_credits.csv'

full_url1='/kaggle/input/tmdb-movie-metadata/tmdb_5000_movies.csv'

credits = pd.read_csv(full_url)
movies=pd.read_csv(full_url1)

In [5]:
# Printing 1st 5 elements of credits dataset
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [6]:
# Printing 1st 5 elements of movies dataset
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [7]:
# Printing the shapes of both the datasets
print("Credits:",credits.shape)
print("Movies:",movies.shape)

Credits: (4803, 4)
Movies: (4803, 20)


In [8]:
# Renaming the column of credits data set
credits_renamed=credits.rename(index=str,columns={'movie_id':'id'})
credits_renamed.head()

Unnamed: 0,id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [9]:
# Merging both data sets
merge=movies.merge(credits_renamed,on='id')
print(merge.head())
print(merge.columns)


      budget                                             genres  \
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  300000000  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2  245000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3  250000000  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4  260000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

                                       homepage      id  \
0                   http://www.avatarmovie.com/   19995   
1  http://disney.go.com/disneypictures/pirates/     285   
2   http://www.sonypictures.com/movies/spectre/  206647   
3            http://www.thedarkknightrises.com/   49026   
4          http://movies.disney.com/john-carter   49529   

                                            keywords original_language  \
0  [{"id": 1463, "name": "culture clash"}, {"id":...                en   
1  [{"id": 270, "name": "ocean"}, {"id": 726, "na...                en   
2  [{"id": 470, "nam

In [10]:
# Dropping unnecessary columns 
cleaned=merge.drop(columns=['homepage','title_x','title_y','status','production_countries'])
cleaned.head()

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Enter the World of Pandora.,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]","At the end of the world, the adventure begins.",6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",A Plan No One Escapes,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",The Legend Ends,7.6,9106,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]","Lost in our world, found in another.",6.1,2124,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [12]:
cleaned['overview'].head()
cleaned['overview']= cleaned['overview']

In [13]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english',ngram_range=(1,3),min_df=3,analyzer='word')

#Replace NaN with an empty string
cleaned['overview'] = cleaned['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(cleaned['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(4803, 9919)

In [15]:
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [16]:
print(cosine_sim.shape)
print(cosine_sim[1])
print(cosine_sim[2])
print(cosine_sim[1341])

(4803, 4803)
[0.         1.         0.         ... 0.02445021 0.         0.        ]
[0.        0.        1.        ... 0.0162416 0.        0.       ]
[0.20925108 0.         0.         ... 0.00716611 0.         0.        ]


In [17]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(cleaned.index, index=cleaned['original_title']).drop_duplicates()
id=indices['Batman Begins']
sim= list(cosine_sim[id])
#print(sim)
sim_scores = list(enumerate(cosine_sim[id]))
#print(sim_scores)
print(sorted(sim_scores, key=lambda x: x[1], reverse= True))


[(119, 1.0), (299, 0.24863491303652346), (1359, 0.2446002820005567), (428, 0.18760818300412008), (3, 0.17636949048624026), (2722, 0.15871639376992866), (4135, 0.1524309829622883), (9, 0.1460330011221615), (210, 0.14388738559383546), (65, 0.14198792528064974), (3819, 0.13105426195678502), (3854, 0.12386191223463644), (423, 0.12271525467938586), (2050, 0.11212569448298254), (2371, 0.10627452702992735), (68, 0.10202843130489589), (200, 0.0998835565415842), (225, 0.0988816412991128), (4407, 0.09676585660314022), (3857, 0.08809253647571455), (2426, 0.08658721844126996), (2240, 0.07733900333049025), (4242, 0.0763000865594609), (157, 0.07601686485706233), (485, 0.07582214539163545), (2246, 0.0743091521851113), (116, 0.07401467579647694), (1242, 0.07010670119995711), (994, 0.06953863306984882), (522, 0.06803893703715244), (1742, 0.06757351488596383), (237, 0.0659250155807118), (651, 0.06539919047776475), (128, 0.0644882995068758), (112, 0.0643331681920142), (506, 0.06400863799417343), (14, 0.0

In [23]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return cleaned['original_title'].iloc[movie_indices]

In [24]:
# Getting the recommendation
get_recommendations('Avatar')

1341                Obitaemyy Ostrov
634                       The Matrix
3604                       Apollo 18
2130                    The American
775                        Supernova
529                 Tears of the Sun
151                          Beowulf
311     The Adventures of Pluto Nash
847                         Semi-Pro
570                           Ransom
Name: original_title, dtype: object

In [25]:
get_recommendations('Batman Begins')

299                         Batman Forever
1359                                Batman
428                         Batman Returns
3                    The Dark Knight Rises
2722                     Seven Psychopaths
4135                            Jerusalema
9       Batman v Superman: Dawn of Justice
210                         Batman & Robin
65                         The Dark Knight
3819                              Defendor
Name: original_title, dtype: object

In [26]:
cleaned.columnsngifie

Index(['budget', 'genres', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'release_date', 'revenue', 'runtime', 'spoken_languages', 'tagline',
       'vote_average', 'vote_count', 'cast', 'crew'],
      dtype='object')

In [27]:
## from Your New Features cast , crew and keywords,
# you need  to extract the three most important actors,
## the director and the keywords associated with that movie.

## but first things first, Your data is present in the form of "stringified" list.
## You need to conevert them int a way that is usable for you.

##Parse the strinfified features into their corresponding python objects.
from ast import literal_eval

features = ["cast", "crew", "keywords", "genres"]

for feature in features:
    cleaned[feature] = cleaned[feature].apply(literal_eval)
    
## about literal_eval()    
## https://stackoverflow.com/questions/15197673/using-pythons-eval-vs-ast-literal-eval



I would like to humbly and sincerely thank my mentor [Rocky Jagtiani](https://www.linkedin.com/today/author/rocky-jagtiani-3b390649/). He is more of a friend to me then mentor. The Machine Learning course taught by him and various projects we did and are still doing is the best way to learn and skill in Data Science field. See https://datascience.suvenconsultants.com once for more.

In [28]:
## lets see the data stored for the 0th movie.
cleaned['crew'].values[0]

# notice its an list of dict objects

[{'credit_id': '52fe48009251416c750aca23',
  'department': 'Editing',
  'gender': 0,
  'id': 1721,
  'job': 'Editor',
  'name': 'Stephen E. Rivkin'},
 {'credit_id': '539c47ecc3a36810e3001f87',
  'department': 'Art',
  'gender': 2,
  'id': 496,
  'job': 'Production Design',
  'name': 'Rick Carter'},
 {'credit_id': '54491c89c3a3680fb4001cf7',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Sound Designer',
  'name': 'Christopher Boyes'},
 {'credit_id': '54491cb70e0a267480001bd0',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Supervising Sound Editor',
  'name': 'Christopher Boyes'},
 {'credit_id': '539c4a4cc3a36810c9002101',
  'department': 'Production',
  'gender': 1,
  'id': 1262,
  'job': 'Casting',
  'name': 'Mali Finn'},
 {'credit_id': '5544ee3b925141499f0008fc',
  'department': 'Sound',
  'gender': 2,
  'id': 1729,
  'job': 'Original Music Composer',
  'name': 'James Horner'},
 {'credit_id': '52fe48009251416c750ac9c3',
  'department': 'Directing',
  

In [30]:
# import numpy
# import numpy as np

## user defined function, which finds the name of the director from the list  of crew.

def get_director(x):
    for i in x:
        if i['job']=="Director":
            return i['name']
    return np.nan

In [31]:
## write a function that will return the top 3 elements or the entire list, whichever is more. Here the list refers to the cast, keywords, genres.

def get_list(x):
    
    if isinstance(x, list):
        names = [i['name'] for i in x]
    
    #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[ : 3 ]
        return names

    #Return empty list in case of missing/malformed data
    return []


In [32]:
cleaned['director']= cleaned['crew'].apply(get_director)

feetures=['cast', 'keywords', 'genres']

for feature in features:
    cleaned[feature]=cleaned[feature].apply(get_list)

In [33]:
# print the new features of the first 3 films

cleaned[['original_title', 'cast', 'director', 'keywords','genres']].head(3)

Unnamed: 0,original_title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"


In [40]:
# in the tfidf vectorizer the algorithm always consider the first word as the important key so that if the first word will be same as the other so the vectorizer will also defined the same people as different people

# 
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''


In [41]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    cleaned[feature] = cleaned[feature].apply(clean_data)


In [44]:
# this function joins the all the meta data i.e cast crew director and genres.

def create_metadata(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])


In [45]:
# create a new meta data columns

cleaned['metadata']=cleaned.apply(create_metadata, axis=1)

In [46]:
cleaned[['cast', 'crew', 'director', 'metadata', 'genres']].head(5)

Unnamed: 0,cast,crew,director,metadata,genres
0,"[samworthington, zoesaldana, sigourneyweaver]","[Stephen E. Rivkin, Rick Carter, Christopher B...",jamescameron,cultureclash future spacewar samworthington zo...,"[action, adventure, fantasy]"
1,"[johnnydepp, orlandobloom, keiraknightley]","[Dariusz Wolski, Gore Verbinski, Jerry Bruckhe...",goreverbinski,ocean drugabuse exoticisland johnnydepp orland...,"[adventure, fantasy, action]"
2,"[danielcraig, christophwaltz, léaseydoux]","[Thomas Newman, Sam Mendes, Anna Pinnock]",sammendes,spy basedonnovel secretagent danielcraig chris...,"[action, adventure, crime]"
3,"[christianbale, michaelcaine, garyoldman]","[Hans Zimmer, Charles Roven, Christopher Nolan]",christophernolan,dccomics crimefighter terrorist christianbale ...,"[action, crime, drama]"
4,"[taylorkitsch, lynncollins, samanthamorton]","[Andrew Stanton, Andrew Stanton, John Lasseter]",andrewstanton,basedonnovel mars medallion taylorkitsch lynnc...,"[action, adventure, sciencefiction]"


In [47]:
cleaned['metadata'].values[0]

'cultureclash future spacewar samworthington zoesaldana sigourneyweaver jamescameron action adventure fantasy'

In [49]:
from sklearn.feature_extraction.text import CountVectorizer

count= CountVectorizer(stop_words='english')

count_matrix=count.fit_transform(cleaned['metadata'])

In [50]:
count_matrix.shape

(4803, 11520)

In [51]:
# compute the cosine similarity matrix baed on the cosine similarity
from  sklearn.metrics.pairwise import cosine_similarity

cosine_sim2= cosine_similarity(count_matrix, count_matrix)

In [53]:
# reset index of your main dataFrame and construct revrse mapping as before.

# cleaned= cleaned.Reset_index()

indices=pd.Series(cleaned.index, index = cleaned ['original_title'])
indices[:2]

original_title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
dtype: int64

In [54]:
## You can now reuse your get_recommendations() function 
## by passing in the new cosine_sim2 matrix as your second argument.

get_recommendations('The Dark Knight Rises', cosine_sim2)


65               The Dark Knight
119                Batman Begins
4638    Amidst the Devil's Wings
1196                The Prestige
3073           Romeo Is Bleeding
3326              Black November
1503                      Takers
1986                      Faster
303                     Catwoman
747               Gangster Squad
Name: original_title, dtype: object

In [55]:
get_recommendations('The Godfather', cosine_sim2)

867      The Godfather: Part III
2731      The Godfather: Part II
4638    Amidst the Devil's Wings
2649           The Son of No One
1525              Apocalypse Now
1018             The Cotton Club
1170     The Talented Mr. Ripley
1209               The Rainmaker
1394               Donnie Brasco
1850                    Scarface
Name: original_title, dtype: object

In [None]:
get_recommendations('The Talented Mr. Ripley', cosine_sim2)