<h3 style="color: green;">IMPORT PANDAS LIBRARY AND READ DATA</h3>

In [1]:
# Import Pandas
import pandas as pd

# Load Movies Metadata
metadata = pd.read_csv('movies_metadata.csv',low_memory=False)



In [2]:
# Print the first three rows

metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


<h3 style="color: green;"> SIMPLE RECOMMENDER </h3>

Simple recommenders are basic systems that recommend the top items based on a certain metric or score. This is a simplified clone of IMDB Top 250 Movies using metadata collected from IMDB

**The following are the steps involved:** 
<ul>
    <li>Decide on the metric or scope to rate movies on.</li>
    <li>Calculate the score for every movie.</li>
    <li>Sort the movies based on the score and output the top results.</li>
</ul>

In [42]:
# Calculate mean of vote average column
C = metadata['vote_average'].mean()
print(C)

5.6117278654770075


In [4]:
# Calculate the minimum number of votes required to be in the chart, m

m = metadata['vote_count'].quantile(0.9)
print(m)

160.0


<div>
    <img src="https://media.geeksforgeeks.org/wp-content/uploads/20201127112813/NORMALDISTRIBUTION-660x362.png">
</div>
<b>Explanation: </b>
<ul>
    <li>The <b>quantile()</b> function is used to calculate the 90th percentile of the <b>vote_count</b> column in the <b>metadata</b> data frame.
    <li>This means that 90% of the movies have a vote count lower than the value return turned by this function.
</ul>

In [5]:
# Filter out ll qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count']>=m]

In [6]:
# Define Function that computes the weighted rating of each movie

def weighted_rating(x, m=m, C=C):
    v = x['vote_count'] # Number of votes for the movie
    R = x['vote_average'] # The average rating of the movie
    # Calculate based on the IMDB formula
    return (v/(v+m)*R) + (m/(m+v)*C)

In [7]:
# Define a new feature 'score' and calculate its value with 'weighted_rating'

q_movies['IMDB_score'] = q_movies.apply(weighted_rating,axis = 1)

In [8]:
# Show dimension of q_movies
q_movies.shape

(4555, 25)

In [43]:
# Sort the movies based on score calculated above
q_movies = q_movies.sort_values('IMDB_score', ascending = False)

# Print the top 20 movies
q_movies[['title','vote_count','vote_average','IMDB_score']].head(20)

Unnamed: 0,title,vote_count,vote_average,IMDB_score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


**CONCLUSION :**
According to the output above, we can infer that a simple recommender did a great job. There are some popular movies that we love and that what proves the recommender make sense !

<h3 style="color: green;"> CONTENT-BASED RECOMMENDER </h3>

<h4 style="color: yellow;">PLOT DESCRIPTION BASED RECOMMENDER</h4>
<div>
    In this section, we build a system that recommends movies that are similar to a particular movie. To achieve this, we compute the pairwise <b>cosine</b> similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.
</div>

The plot description is available as the overview feature in <b>metadata</b> dataset.
<br>
Let's inspect the plots of a movies.

In [10]:
# Print plot overviews of the first 5 movies.
metadata[['title','overview']].head(5)

Unnamed: 0,title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...


<div>
    <p>
        The problem at hand is a Natural Language Processing problem. Hence we need to extract some kind of features from the above text data before we can compute the similarity and/or dissimilarity between them. To put it simply, it is not possible to compute the similarity between any two overviews in their raw forms. To do this, we need to compute the word vectors of each overview or document, as it will be called from now on
    </p>
    <p>
        As the name suggests, word vectors are vectorized representation of words in a document. The vectors carry a semantic meaning with it. For example, man & king will have vector representations close to each other while man & woman would have representation far from each other
    </p>
    <p>
        We will compute <b>Term Frequency-Inverse Document Frequency (TF-IDF)</b> vectors for each document. This will give you a matrix where each <b>column
        </b> represents a word in the overview vocabulary (all the words that appear in at least one document), and each <b>row</b> represents a movie, as before.
    </p>
    <p>
        In its essence, the TF-IDF score is frequency of a word occurring in a document, down-weighted by the
        number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.
    </p>
    <p>
        Fortunately, scikit-learn gives you a built-in <b>TfidfVectorizer</b> class that produces the TF-IDF matrix in a couple of lines.
    </p>
    <ul>
        <li>Import the Tfidf module using scikit-learn.</li>
        <li>Remove the stop words like 'the','an',etc. Since they do not give any useful information.</li>
        <li>Replace not-a-number values with a blank string.</li>
        <li>Finally, construct the TF-IDF matrix on the data</li>
    </ul>
</div>

<img src="https://media.licdn.com/dms/image/C4E12AQFUKVijnm1YLw/article-inline_image-shrink_400_744/0/1520583276370?e=1703721600&v=beta&t=kZ0i3oQDCMz1cWzs9Ujnw7vVrqSoDkipEQz3xvyEQOU">

<code>
public static void main(String[] args) {

    List<String> doc1 = Arrays.asList("red", "green", "blue", "yellow", "red", "red");
    List<String> doc2 = Arrays.asList("red", "pink", "white", "dark", "orange", "pink");
    List<String> doc3 = Arrays.asList("green", "yellow", "white", "white", "purpil");
    List<List<String>> documents = Arrays.asList(doc1, doc2, doc3);

    TFIDFCalculator calculator = new TFIDFCalculator();
    double tfidf = calculator.tfIdf(doc1, documents, "red");
    System.out.println("TF-IDF (red) = " + tfidf);
}
</code>

<p><b style="color: red">Red:</b> 0.2027325540540822</p>
<p><b style="color: #800080">Purpil:</b> 0.2197224577 </p>

In [28]:
# Import TfIdVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF Vectorizer Object. Remove all English stop words such as 'the'
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

# Construct the required TF - IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

(46628, 75827)

<div>
    <ul>
        <li>
            This code is used to create a TF-IDF matrix for a dataset of movie metadata.
        </li>
        <li>
            First, the TfidfVectorizer class is imported from the scikit-learn library.
        </li>
        <li>
            This class is used to convert a collection of raw documents into a matrix of TF-IDF features
        </li>
        <li>
            Next, a TfidfVectorizer object is defined with the parameter stop_words set to 'english'
        </li>
        <li>
            This means that common English stop words such as 'the' and 'a' will be removed from the text data during vectorization.
        </li>
        <li>
            The metadata dataframe is then loaded and any missing values in the 'overview' column are replaced with an empty string.
        </li>
        <li>
            The TF-IDF matrix is then constructed by fitting and transforming the 'overview' column of the metadata dataframe using the TfidfVectorizer object.
        </li>
        <li>
            The resulting matrix is stored in the variable tfidf_matrix.
        </li>
        <li>
            Finally, the shape of the tfidf_matrix is outputted to the console to show the number of rows and columns in the matrix.
        </li>
    </ul>
</div>

In [29]:
# Array mapping from feature integer indices to feature name.

tfidf.get_feature_names_out()[5000:5010]

# Print every words - Just for fun
# for i in tfidf.get_feature_names_out():
#     print(i)




array(['avails', 'avaks', 'avalanche', 'avalanches', 'avallone', 'avalon',
       'avant', 'avanthika', 'avanti', 'avaracious'], dtype=object)

<b>Explanation: </b>
<div>
    <ul>
        <li>The <b>get_feature_names_out()</b> method of the <b>tfidf</b> object returns an array of feature names in the order they appear in the feature matrix.</li>
        <li> In this code snippet, the <b>[5000:5010]</b> slide is used to get the feature names for the indices between 5000 and 5010 (exclusive).</li>
    <ul>
</div>

<div>
    <p>
        From the above output, we observe that 75,827 different vocabularies or words in the dataset have 45466 movies.
    </p>
    <p>
        With this matrix in hand, we can now compute a similarity score. There are several similarity metrics that we can use for this, such as Manhattan, Euclidean, Pearson, and the <b>Cosine similarity scores</b>.
    </p>
    <p>
        We use Cosine similarity here to calculate a numeric quantity that denotes the similarity between two movies. The reason here is it is independent of magnitude and relatively easy and fast to calculate (especially when used in conjunction with TF-IDF scores, which will be explained later). Mathematically, it is defined as follows:
    </p>
    <img src="https://images.datacamp.com/image/upload/f_auto,q_auto:best/v1590782185/cos_aalkpq.png" />
</div>

<div>
    <p>
        Since you have used the TF-IDF vectorizer, calculating the dot product between each vector will directly give you the cosine similarity score. Therefore, you will use <b>sklearn's linear_kernel()</b> instead of <b>cosine_similarities()</b> since it is faster.
    </p>
    <p>
        This would return a matrix of shape 45466x45466, which means each movie overview cosine similarity score with every other movie <b>overview</b>. Hence, each movie will be a 1x45466 column vector where each column will be a similarity score with each movie.
    </p>
</div>

<img style="height: 400px;" src="https://images.deepai.org/glossary-terms/98c132dc646d49bb8dec45162095e74e/cosinesimilar.png">

In [30]:
# # Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# # Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

<div><b>Explanation:</b><div>
<div>
    <ul>
        <li>
            This code imports the <b>linear_kernel</b> function from the <b>sklearn.metrics.pairwise</b> module.
        </li>
        <li>
            The <b>linear_kernel</b> function is used to compute the dot product of two matrices.
        </li>
        <li>
            In the next line, the <b>cosine_sim</b> variable is assigned the result of applying the <b>linear_kernel</b> function to the <b>tfidf_matrix</b> twice.
        </li>
        <li>
            This computes the cosine similarity matrix of the <b>tfidf_matrix</b>
        </li>
        <li>
            The <b>tfidf_matrix</b> is a matrix that represents the text data in a numerical form using the term frequency-inverse document frequency (TF-IDF) method.
        </li>
        <li>
            The cosine similarity matrix is a measure of similarity between each pair of documents in the <b>tfidf_matix</b>
        </li>
        <li>
            Overall, this code computes the cosine similarity matrix of the <b>tfidf_matrix</b> using the <b>linear_kernel</b> function.
        </li>
    </ul>
</div>

In [31]:
cosine_sim.shape

(46628, 46628)

In [47]:
for i in cosine_sim[0]:
    print(i)

1.0
0.01496864278103375
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.038385938120942834
0.0
0.0
0.009715428303700306
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.018567735669854396
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.006329466180081574
0.0
0.0
0.008889757432939146
0.0
0.0
0.0
0.012952504508605371
0.009120977894744278
0.010700167543241238
0.0
0.0
0.020033443762435224
0.0
0.02521961522883325
0.02071597169182873
0.0
0.03343221080935841
0.0
0.0
0.007625204878232736
0.0
0.009449907320291571
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.009993275048180752
0.0
0.0
0.0
0.0
0.010111643783001844
0.01443321185153422
0.026133828610900064
0.0
0.0
0.0
0.0
0.0
0.010587578242887287
0.0
0.010866301530295854
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.013475434088958843
0.0
0.0
0.016145710007223354
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.018909347497805935
0.0
0.0
0.023837232189644546
0.0
0.0
0.0
0.011961304244370396
0.0
0.0
0.0
0.0
0.0
0.009862682

In [32]:
# Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title'])

In [33]:
indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

<div>
    <p>
        We are now in good shape to define your recommendation function. These are the following next steps:
    </p>
    <ul>
        <li>
            Get the index of the movie given its title
        </li>
        <li>
            Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.</li>
        <li>
            Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.
        </li>
        <li>
            Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).
        </li>
        <li>
            Return the titles corresponding to the indices of the top elements.
        </li>
    </ul>
</div>

In [38]:
# Function that takes in movie tittle as input and outputs most similar movies

def get_recommendations (title, cosine_sim):
        # Get the index of the movie that matches the title
        idx = indices[title]
        

        # Get the pairwise similarity scores of all movies with that movie
        if idx.size == 1:
                sim_scores = list(enumerate(cosine_sim[idx]))
        else:
                sim_scores = []
                for i in idx:
                        sim_scores = sim_scores + list(enumerate(cosine_sim[i]))

        # # Sort the movies based on the similarity scores
        sim_scores = sorted(sim_scores,key=lambda x: x[1], reverse=True)

        # # Get the scores of the 30 most similar movies
        sim_scores = sim_scores[1:30]

        # # Get the movie indices
        movie_indices = [i[0] for i in sim_scores]

        # # Return the top 10 most similar movies
        return metadata['title'].iloc[movie_indices]


In [51]:
id = get_recommendations('Iron Man 3',cosine_sim)
id

29501                     Copperhead
16916    Bungee Jumping of Their Own
24575                 Berlin Babylon
15324                     Iron Man 2
30277                        Holiday
42652     They Knew What They Wanted
17299                      King Lear
12696                       Iron Man
8418                        Scarface
20752                  Dead Man Down
11171                           Fuse
25446         Two Weeks in September
2809                   The Dark Half
6127              Cradle 2 the Grave
13841                     Incendiary
1224                   Touch of Evil
5611            Saturday Night Fever
36970                      Kill Kane
36971                      Kill Kane
13545        Fireflies in the Garden
2241                       The Siege
24914                  Cyborg Cop II
19235                     Cosmopolis
16190                    Road, Movie
10176                         Bataan
1281      Until the End of the World
2156              Married to the Mob
3

<div>
    <p>
        We see that, while the system has done a decent job of finding movies with similar plot, descriptions, the quality of recommendations is not that great. "The Dark Knight Rises" returns all Batman movies while it is more likely that the people who liked that movie are more interested in enjoying Christian Bale method acting. 
    </p>
    <p>
        This is something that cannot be captured by the Plot description based system.
    </p>
</div>

<h4 style="color: yellow;">CREDITS, GENRES, AND KEYWORDS BASED RECOMMENDER</h4>

<div>
    <p>
        The quality of your recommender would be increased with the usage of better metadata
        and by capturing more of the finer details. That is precisely what you are going to do in this section. You will build a recommender system based on the following metadata: <b>The 3 top actors, the director, related genres, and the movie plot keywords</b>
    </p>
</div>

In [11]:
# Load keywords and credits
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

# Remove rows with bad IDs.
metadata = metadata.drop([19730,29503,35587])

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] =credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')


In [12]:
# Print the first two movies of your newly merged metadata
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


<div>
    <p>
        From your new features, cast, crew, and keywords, you need to extract the three most important actors, the director and the keywords associated with that movie.
    </p>
    <p>
        But first things, your data is present in the form of "stringified" lists. You need to convert them into a way that is usable for you.
    </p>
</div>

In [13]:
# Parse the stringified features into their corresponding python objects

from ast import literal_eval
features = ['cast','crew','keywords','genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

<div>
    <ul>
        <li>
            This code snippet is parsing stringified features into their corresponding Python objects.
        </li>
        <li>
            It first imports the <b>literal_eval</b> function from the <b>ast</b> module.
        </li>
        <li>
            Then, it defines a list of features to be parsed, which includes 'cast', 'crew', 'keywords', and 'genres'.
        </li>
        <li>
            Next, it loops through each feature in the list and applies the <b>literal_eval</b> function to the corresponding column in the <b>metadata</b> dataframe.
        </li>
        <li>
            This function evaluates a string containing a Python literal or container, such as a list or dictionary, and returns the corresponding Python object.
        </li>
        <li>
            By applying <b>literal_eval</b> to each feature column, the code is converting the stringified data into actual Python objects, which can be more easily manipulated and analyzed.
        </li>
    </ul>
</div>

<div>
    <p>
        Next, write functions that will help you to extract the required information from each feature.
    </p>
    <p>
        First, you will import the NumPy package to get access to its <mark>NaN</mark> constant. Next, you can use it to write the <mark>get_director()</mark> function.
    </p>
</div>

In [14]:
# Import Numpy
import numpy as np

<div>
    <p>
        Get the director's name from the crew feature. If the director is not listed, return <mark>NaN</mark>.
    </p>
</div>

In [15]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

<div>
    <p>
        Next, you will write a function that will return the top 3 elements or the entire list, whichever is more. Here the list refers to the <mark>cast</mark>, <mark>keywords</mark>, and <mark>genres</mark>.
    </p>
</div>

In [16]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        # Check if more than 3 elements exist. If yes, return only first three, else return the entire list
        if len(names) >3:
            names = names[:3]
        return names
    
    # Return empty list in case of missing/malformed data
    return []

In [17]:
# Define new director, cast, genres and keywords features that are in a suitable form
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast','keywords','genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)


In [18]:
# Print the new features of the first 3 films
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


<div>
    <p>
        The next step would be to convert the names and keyword instances into lowercase and strip all the spaces between them.
    </p>
    <p>
        Removing the spaces between words is an important preprocessing step. It is done so that your vectorizer does not count the Johnny of 'Johnny Depp' and 'Johnny Galecki' as the same. After this processing step, the aforementioned actors will be represented as 'johnnydepp' and 'johnnygalecki' and will be distinct to your vectorizer.
    </p>
    <p>
        Another good example where the model might output the same vector representation is 'bread jam' and 'traffic jam'. Hence, it is better to strip off any space that is present. 
    </p>
</div>

In [19]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x,list):
        return [str.lower(i.replace(" ","")) for i in x]
    else:
        # Check if director exists. If not, return empty string
        if isinstance(x,str):
            return str.lower(x.replace(" ",""))
        else:
            return ''

In [20]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

<div>
    <p>
        You are now in a position to create your "metadata soup", which is a string that contains all the metadata that you want to feed your vectorizer(namely actors, director and keywords).
    </p>
    <p>
        The <mark>create_soup</mark> function will simply join all the required columns by a space. this is the final preprocessing step, and the output of this function will be fed into the word vector model.
    </p>
</div>

In [21]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [22]:
# Create a new soup feature
metadata['soup'] = metadata.apply(create_soup, axis = 1)

In [23]:
for i in metadata['soup'].head(3):
    print(i)

jealousy toy boy tomhanks timallen donrickles johnlasseter animation comedy family
boardgame disappearance basedonchildren'sbook robinwilliams jonathanhyde kirstendunst joejohnston adventure fantasy family
fishing bestfriend duringcreditsstinger waltermatthau jacklemmon ann-margret howarddeutch romance comedy


<div>
    <p>
        The next steps are the same as what you did with your plot description based recommender. One key difference is that you use the <mark>CountVectorizer()</mark> instead of <mark>TF-IDF</mark>. This is because you do not want to down-weight the actor/director's presence if he or she has acted or directed in relatively more movies. It does not much intuitive sense to down-weight them in this context.
    </p>
    <p>
        The major difference between <mark>CountVectorizer()</mark> and <mark>TF-IDF</mark> is the inverse document frequency (IDF) component which is present in later and not in the former.
    </p>
</div>

In [24]:
# Import CountVectorizer and  create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

In [27]:
count_matrix.shape

(46628, 73881)

<div>
    <p>
        From the above output, you can see that there are 73,881 vocabularies in the metadata that you fed to it.
    </p>
    <p>
        Next, you will use the <mark>cosine_similarity</mark> to measure the distance between the embeddings.
    </p>
</div>

In [25]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [26]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

<p>
    You can now reuse your <mark>get_recommendations()</mark> function by passing in the new <mark>cosine_sim2</mark> matrix as your second argument
</p>

In [50]:
get_recommendations('Toy Story',cosine_sim2)

3024                                           Toy Story 2
15519                                          Toy Story 3
29198                                      Superstar Goofy
26001                           Toy Story That Time Forgot
22126                                 Toy Story of Terror!
3336                                     Creature Comforts
25999                                      Partysaurus Rex
27606                                                Anina
43071                        Dexter's Laboratory: Ego Trip
28005                                        Radiopiratene
29607                                          Cheburashka
40904                   VeggieTales: Josh and the Big Wall
40913    VeggieTales: Minnesota Cuke and the Search for...
41371                                              Uncle P
15734                     The Bugs Bunny/Road Runner Movie
46299                                               Banana
11209                                        Monster Hou

<h3 style="color: green">Reference:</h3>
<div>
    <ul>
        <li>
            <b>Original project from Datacamp:</b> <a href="https://www.datacamp.com/tutorial/recommender-systems-python?utm_source=google&utm_medium=paid_search&utm_campaignid=19589720824&utm_adgroupid=143216588537&utm_device=c&utm_keyword=&utm_matchtype=&utm_network=g&utm_adpostion=&utm_creative=671350460579&utm_targetid=dsa-1947282172981&utm_loc_interest_ms=&utm_loc_physical_ms=9074084&utm_content=dsa~page~community-tuto&utm_campaign=230119_1-sea~dsa~tofu-tutorials_2-b2c_3-row-p2_4-prc_5-na_6-na_7-le_8-pdsh-go_9-na_10-na_11-na-sep23&gclid=CjwKCAjw6p-oBhAYEiwAgg2PgoPUR96J-7LztyVlYOrFwX_Lyxlw9urxDXNjNOnsL8_ofOzd1MCdBhoCfTAQAvD_BwE" >Click here</a>
        </li>
        <li>
            <b>The Movie Dataset:</b> <a href="https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data">Click here</a>
        </li>
    </ul>
</div>