![Recommendation System](https://cdn.activestate.com/wp-content/uploads/2019/12/RecommendationEngine-1200x675.png)

> This is a Recommendation System which recommends you the movie based on the Review of previous movie.

> Dataset used : db_5000_credits.txt, db_5000_movies.txt

> links for the datasets : <br />



> Tech Stack used: pandas, Scikit-learn,Python

Recommender systems are among the most popular applications of data science today. They are used to predict the "rating" or "preference" that a user would give to an item. Almost every major tech company has applied them in some form. Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow.

Recommender systems have also been developed to explore research articles and experts, collaborators, and financial services.

**Recommender systems can be classified into Two types:**

> **Content-based recommenders**: suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata. A good example could be YouTube, where based on your history, it suggests you new videos that you could potentially watch.

<img src="https://miro.medium.com/max/1642/1*BME1JjIlBEAI9BV5pOO5Mg.png" heigth='180px' width='220px' />



> **Collaborative filtering engines**: these systems are widely used, and they try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

<img src="https://miro.medium.com/max/4056/1*KBriLd3AYrLuULCqdffxCQ.png" heigth='220px' width='420px' />

Here we are going to implement **Content Based Filtering**
--

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Loading Data sets
_url_1 = 'db_5000_credits.txt'

_url_2 = 'db_5000_movies.txt'

credits = pd.read_csv(_url_1)

movies = pd.read_csv(_url_2)

In [None]:
# Printing 1st few elements of credits dataset
credits.head(2)
#credits['cast'].values[0]

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [None]:
# Printing 1st few elements of movies dataset
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [None]:
# Printing the shapes of both the datasets

print("Credits: ", credits.shape)

print("Movies: ", movies.shape)

Credits:  (4803, 4)
Movies:  (4803, 20)


In [None]:
# Renaming the column of credits data set

credits_renamed = credits.rename(columns={'movie_id':'id'})
## Learn working of rename function here
## https://note.nkmk.me/en/python-pandas-dataframe-rename/

credits_renamed.head(2)

Unnamed: 0,id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [None]:
# Merging both data sets
mergedf = movies.merge(credits_renamed, on='id')
mergedf.head(2)

## to understand merge , look at the digram below.  merge are very similar to sql joins

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title_x,vote_average,vote_count,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [None]:
# Dropping unnecessary columns
cleaned = mergedf.drop(columns=['homepage','title_x','title_y','status','production_countries'])
## cleaning is very important.
## one aspect of cleaning is dropping unnecessary columns which wont help the task at hand

cleaned.head(2)

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Enter the World of Pandora.,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]","At the end of the world, the adventure begins.",6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


<img src="https://miro.medium.com/max/1200/1*bRA62jiJm8MbCPzWtXxPsQ.png" />

In [None]:
cleaned['overview'].head()

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

In [None]:
cleaned['overview'].isnull().sum()  ## No Nan or null values for overview

3

In [None]:
## Just in case if their were Nan or null values then do as :
## Replace NaN with an empty string
cleaned['overview'] = cleaned['overview'].fillna('')

In [None]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,3), min_df=3, analyzer='word')


#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(cleaned['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(4803, 9919)

We see that over 9919 different words were used to describe the 4803 movies in our dataset.

With this matrix in hand, we can now compute a similarity score. There are several candidates for this; such as the euclidean, the Pearson and the cosine similarity scores. There is no right answer to which score is the best. Different scores work well in different scenarios and it is often a good idea to experiment with different metrics.

Refer 1 : https://deepai.org/machine-learning-glossary-and-terms/cosine-similarity

Refer 2 : http://image.slidesharecdn.com/datadaytexas2017presentation-180128141255/95/improving-graph-based-entity-resolution-with-data-mining-and-nlp-45-638.jpg?cb=1517148867

We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. We use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate.


**Note :** We can use sklearn's linear_kernel() also instead of cosine_similarity(),  since it is little faster. `Use it in case of large datasets`.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
#cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

## Discuss in class : https://www.machinelearningplus.com/nlp/cosine-similarity/

In [None]:
print(cosine_sim.shape)

## printing the simililarity indexes for 0th movie i.e Avatar
print(cosine_sim[0])

## printing the simililarity indexes for 1st movie i.e Pirates of the Caribbean: At World's End
print(cosine_sim[1])

print(cosine_sim[2])

(4803, 4803)
[1. 0. 0. ... 0. 0. 0.]
[0.         1.         0.         ... 0.02445021 0.         0.        ]
[0.        0.        1.        ... 0.0162416 0.        0.       ]


We are going to define a function that takes in a movie title as an input and outputs a list of 10 most similar movies. Firstly, for this, we need a reverse mapping of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(cleaned.index, index=cleaned['original_title']).drop_duplicates()
## drop_duplicates() would drop the repeating movie_names

indices[ : 5]   ## lets check out the mapping between the movie_names and indexes.

original_title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64

We are now in a good position to define our recommendation function. These are the following steps we'll follow :-

* Get the index of the movie given its title.

* Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position and the second is the similarity score.

* Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

* Get the top 10 elements of this list. Ignore the first element as it refers to self ( the movie most similar to a particular movie is the movie itself )

* Return the titles corresponding to the indices of the top elements.

In [None]:
def get_recommendations(title, sim_matrix):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwise similarity score of all movies with that movie
    sim_scores = list(enumerate(sim_matrix[idx]))
    print(sim_scores[:5])  ## just for testing purpose
    print("------------------------")

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies, skipping the first one
    sim_scores = sim_scores[1:11]
    print(sim_scores[:5])  ## just for testing purpose
    print("------------------------")


    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return cleaned['original_title'].iloc[movie_indices]

In [None]:
# Getting the recommendation
get_recommendations('Avatar', cosine_sim)

[(0, 1.0), (1, 0.0), (2, 0.0), (3, 0.022890926990200844), (4, 0.0)]
------------------------
[(1341, 0.20925107609846516), (634, 0.20548453580005477), (3604, 0.18362526160968518), (2130, 0.1742568977678487), (775, 0.16421163130645072)]
------------------------


1341                Obitaemyy Ostrov
634                       The Matrix
3604                       Apollo 18
2130                    The American
775                        Supernova
529                 Tears of the Sun
151                          Beowulf
311     The Adventures of Pluto Nash
847                         Semi-Pro
570                           Ransom
Name: original_title, dtype: object

In [None]:
## check for "The Dark Knight Rises"
# Getting the recommendation
get_recommendations('The Dark Knight Rises', cosine_sim)

[(0, 0.022890926990200844), (1, 0.0), (2, 0.0), (3, 0.9999999999999999), (4, 0.010730456625887959)]
------------------------
[(299, 0.4396046253319627), (65, 0.38197192017195536), (1359, 0.32933413552140023), (428, 0.2631661750805582), (2507, 0.20488574655585495)]
------------------------


299                              Batman Forever
65                              The Dark Knight
1359                                     Batman
428                              Batman Returns
2507                                  Slow Burn
119                               Batman Begins
1181                                        JFK
3854    Batman: The Dark Knight Returns, Part 2
9            Batman v Superman: Dawn of Justice
210                              Batman & Robin
Name: original_title, dtype: object

Enchancements Possible
--

In [None]:
cleaned.columns

Index(['budget', 'genres', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'release_date', 'revenue', 'runtime', 'spoken_languages', 'tagline',
       'vote_average', 'vote_count', 'cast', 'crew'],
      dtype='object')

In [None]:
## have a look at the way data is stored in orginal dataframe
cleaned['crew'].values[0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [None]:
## From your new features, cast, crew, and keywords,
## you need to extract the three most important actors,
## the director and the keywords associated with that movie.

## But first things first, your data is present in the form of "stringified" lists.
## You need to convert them into a way that is usable for you.

# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']

for feature in features:
    cleaned[feature] = cleaned[feature].apply(literal_eval)

## about literal_eval()
## https://stackoverflow.com/questions/15197673/using-pythons-eval-vs-ast-literal-eval


In [None]:
## lets see the data stored for the 0th movie.
cleaned['crew'].values[0]

## Notice : its an list of dict objects.

[{'credit_id': '52fe48009251416c750aca23',
  'department': 'Editing',
  'gender': 0,
  'id': 1721,
  'job': 'Editor',
  'name': 'Stephen E. Rivkin'},
 {'credit_id': '539c47ecc3a36810e3001f87',
  'department': 'Art',
  'gender': 2,
  'id': 496,
  'job': 'Production Design',
  'name': 'Rick Carter'},
 {'credit_id': '54491c89c3a3680fb4001cf7',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Sound Designer',
  'name': 'Christopher Boyes'},
 {'credit_id': '54491cb70e0a267480001bd0',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Supervising Sound Editor',
  'name': 'Christopher Boyes'},
 {'credit_id': '539c4a4cc3a36810c9002101',
  'department': 'Production',
  'gender': 1,
  'id': 1262,
  'job': 'Casting',
  'name': 'Mali Finn'},
 {'credit_id': '5544ee3b925141499f0008fc',
  'department': 'Sound',
  'gender': 2,
  'id': 1729,
  'job': 'Original Music Composer',
  'name': 'James Horner'},
 {'credit_id': '52fe48009251416c750ac9c3',
  'department': 'Directing',
  

In [None]:
# Import Numpy
# import numpy as np

## function to get the director's name
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
## a function that will return the first 3 elements or the entire list, whichever is more.
## Here the list refers to the cast, keywords, and genres.
def get_list(x):

    if isinstance(x, list):
        names = [i['name'] for i in x]

    #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[ : 3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [None]:
# Define new director, cast, genres and keywords features
## that are in a suitable form.
cleaned['director'] = cleaned['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']

for feature in features:
    cleaned[feature] = cleaned[feature].apply(get_list)

In [None]:
# Print the new features of the first 3 films
cleaned[['original_title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,original_title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"


The next step would be to convert the names and keyword instances into lowercase and strip all the spaces between them.

Removing the spaces between words is an important preprocessing step. It is done so that your vectorizer doesn't count the Johnny of "Johnny Depp" and "Johnny Galecki" as the same. After this processing step, the aforementioned actors will be represented as "johnnydepp" and "johnnygalecki" and will be distinct to your vectorizer.

In [None]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [None]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    cleaned[feature] = cleaned[feature].apply(clean_data)

You are now in a position to create your **`"metadata"`**, which is a string that contains all the metadata that you want to feed to your vectorizer (namely actors, director and keywords).

The create_metadata function will simply join all the required columns by a space. This is the `final preprocessing` step, and the output of this function will be fed into the word vector model.

In [None]:
def create_metadata(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])


In [None]:
# Create a new metadata feature
cleaned['metadata'] = cleaned.apply(create_metadata, axis=1)

In [None]:
cleaned[['metadata']].head(2)

Unnamed: 0,metadata
0,cultureclash future spacewar samworthington zo...
1,ocean drugabuse exoticisland johnnydepp orland...


The next steps are the same as what you did with above `content based recommender`.

**`One key difference`** is that you use the CountVectorizer() instead of TF-IDF. This is because you do not want to down-weight the actor/director's presence if he or she has acted or directed in relatively more movies. It doesn't make much intuitive sense to down-weight them in this context.

The major difference between CountVectorizer() and TF-IDF is the inverse document frequency (IDF) component which is present in later and not in the former.

In [None]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')

count_matrix = count.fit_transform(cleaned['metadata'])

In [None]:
count_matrix.shape

(4803, 11520)

From the above output, you can see that there are 11,520 vocabularies in the metadata that you fed to it.

Next, you will use the `cosine_similarity` to measure the distance between the embeddings.

In [None]:
# Compute the Cosine Similarity matrix based on the count_matrix
# from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim2 = linear_kernel(count_matrix, count_matrix)

In [None]:
# Reset index of your main DataFrame and construct reverse mapping as before

## cleaned = cleaned.reset_index()
indices = pd.Series(cleaned.index, index = cleaned['original_title'])
indices[:2]

original_title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
dtype: int64

In [None]:
## You can now reuse your get_recommendations() function
## by passing in the new cosine_sim2 matrix as your second argument.

get_recommendations('The Dark Knight Rises', cosine_sim2)

[(0, 1.0), (1, 1.0), (2, 2.0), (3, 10.0), (4, 1.0)]
------------------------
[(65, 7.0), (119, 7.0), (1196, 4.0), (3073, 4.0), (72, 3.0)]
------------------------


65                     The Dark Knight
119                      Batman Begins
1196                      The Prestige
3073                 Romeo Is Bleeding
72                       Suicide Squad
82      Dawn of the Planet of the Apes
157             Exodus: Gods and Kings
210                     Batman & Robin
280                     Public Enemies
299                     Batman Forever
Name: original_title, dtype: object

In [None]:
get_recommendations('The Godfather', cosine_sim2)

[(0, 0.0), (1, 0.0), (2, 1.0), (3, 2.0), (4, 0.0)]
------------------------
[(867, 5.0), (2731, 4.0), (1018, 3.0), (1170, 3.0), (1209, 3.0)]
------------------------


867     The Godfather: Part III
2731     The Godfather: Part II
1018            The Cotton Club
1170    The Talented Mr. Ripley
1209              The Rainmaker
1394              Donnie Brasco
1525             Apocalypse Now
1850                   Scarface
2280                Sea of Love
2649          The Son of No One
Name: original_title, dtype: object

**`Great!`**  You see that your recommender has been successful in capturing more information due to more metadata and has given you better recommendations. *There are, of course, numerous ways of experimenting with this system to improve recommendations*.

Some suggestions:
--

`Introduce a popularity filter` : this recommender would take the 30 most similar movies, calculate the weighted ratings (using the IMDB formula as below), sort movies based on this rating, and return the top 10 movies.

<img src="https://prod-content-care-community-cdn.sprinklr.com/26653d1b-7bb8-47bf-ac21-90f16f2e4b48/RackMultipart2019053111532z9kq-5bb2d25a-f758-49b7-82c4-94c2bcda8009-393433847.JPG1559341578"  heigth='230px'  width='180px'/>

`Other crew members`: other crew member names, such as screenwriters and producers, could also be included.

`Increasing weight of the director` : to give more weight to the director, he or she could be mentioned multiple times in the soup to increase the similarity scores of movies with the same director.

How would make a Collaborative Filtering recommender ?
--

You have learned and coded your very own Simple and Content-Based Movie Recommender Systems. There is also another extremely popular type of recommender known as collaborative filters.

Collaborative filters can further be classified into two types:

1. User-based Filtering: these systems recommend products to a user that similar users have liked. For example, let's say Alice and Bob have a similar interest in books (that is, they largely like and dislike the same books). Now, let's say a new book has been launched into the market, and Alice has read and loved it. It is, therefore, highly likely that Bob will like it too, and therefore, the system recommends this book to Bob.

2. Item-based Filtering: these systems are extremely similar to the content recommendation engine that you built. These systems identify similar items based on how people have rated it in the past. For example, if Alice, Bob, and Eve have given 5 stars to The Lord of the Rings and The Hobbit, the system identifies the items as similar. Therefore, if someone buys The Lord of the Rings, the system also recommends The Hobbit to him or her.

An example of collaborative filtering based on a rating system:
--

<img src="https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1590777873/Collaborative_filtering_vbujt7.gif"  heigth='250px' width='250px' />

Extra Reading
--

Building a machine learning recommendation system tutorial using Python and collaborative filtering for a Netflix use case. <br />
https://medium.com/towards-artificial-intelligence/recommendation-system-in-depth-tutorial-with-python-for-netflix-using-collaborative-filtering-533ff8a0e444


Build a Recommendation Engine With Collaborative Filtering  <br />
https://realpython.com/build-recommendation-engine-collaborative-filtering/

Grouplens Dataset <br />
https://grouplens.org/datasets/movielens/