## Hybrid Movie Recommender System
<ol>
   <li>I will implement a hybrid approach combining collaborative-based filtering and content-based filtering to recommend similar movies to users, based on their recently watched movies.</li>
    <ul>
        <li><b> My methodology is as follows:</b></li>
        <ol>
            <li>First, I will filter the top 100 similar movies using "keywords," "tagline," and "genres".</li>
            <li>Next, after obtaining the top 100 movies, I will apply a filter based on their "vote_count" and select the top 50 movies for further analysis.</li>
            <li>Finally, I will perform a final filtering process by selecting the top 5 similar "overview" movies from the previous top 50 results.</li>
            </ol>
        </ul>
    </ol>

<ul>
    <li> Import the necessary libraries.</li>
</ul>
    

In [6]:
import pandas as pd
import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import json
import re
# from wordcloud import WordCloud

<ul>
    <li> Load the datasets.</li>
</ul>

In [7]:
credits = pd.read_csv("tmdb_5000_credits.csv")
movies = pd.read_csv("tmdb_5000_movies.csv")

In [8]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [9]:
movies=movies.drop('homepage',axis=1)

In [10]:
movies.dropna(inplace=True)

<ul>
    <li> Remove all the missing values in the dataset.</li>
</ul>

In [11]:
movies.shape

(3959, 19)

In [12]:
movies = movies.dropna()
credits = credits.dropna()

<ul>
    <li> Prepare the datasets to be used.</li>
</ul>

In [13]:
movies.shape

(3959, 19)

In [14]:
datasets = movies[["id", "original_title", "overview", "genres", "keywords", "tagline", "vote_average", "vote_count"]]

<ul>
    <li> I'll extract the genres and keywords information from the dataset, which are stored in json format.
I'll create a new dataframe for this.</li>
</ul>

In [15]:
keywords_taglines_genres = datasets[["id", "original_title"]]

In [16]:
keywords_taglines_genres["genres_combined"] = datasets.genres.apply(lambda row: ' ' . join([x['name'].lower() for x in json.loads(row)]))
keywords_taglines_genres["keywords_combined"] = datasets.keywords.apply(lambda row: ' ' . join([x['name'].lower() for x in json.loads(row)]))
keywords_taglines_genres["tagline_processed"] = datasets.tagline.apply(lambda row: row.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  keywords_taglines_genres["genres_combined"] = datasets.genres.apply(lambda row: ' ' . join([x['name'].lower() for x in json.loads(row)]))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  keywords_taglines_genres["keywords_combined"] = datasets.keywords.apply(lambda row: ' ' . join([x['name'].lower() for x in json.loads(row)]))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pa

In [17]:
keywords_taglines_genres.head(2)

Unnamed: 0,id,original_title,genres_combined,keywords_combined,tagline_processed
0,19995,Avatar,action adventure fantasy science fiction,culture clash future space war space colony so...,enter the world of pandora.
1,285,Pirates of the Caribbean: At World's End,adventure fantasy action,ocean drug abuse exotic island east india trad...,"at the end of the world, the adventure begins."


<ul>
    <li> Next, we will merge them to prepare for the encoding process.</li>
</ul>

In [18]:
keywords_taglines_genres["keywords_taglines_genres"] = keywords_taglines_genres.apply((lambda row: ' ' . join([row.genres_combined, row.keywords_combined, row.tagline_processed])), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  keywords_taglines_genres["keywords_taglines_genres"] = keywords_taglines_genres.apply((lambda row: ' ' . join([row.genres_combined, row.keywords_combined, row.tagline_processed])), axis=1)


In [19]:
keywords_taglines_genres.head(2)

Unnamed: 0,id,original_title,genres_combined,keywords_combined,tagline_processed,keywords_taglines_genres
0,19995,Avatar,action adventure fantasy science fiction,culture clash future space war space colony so...,enter the world of pandora.,action adventure fantasy science fiction cultu...
1,285,Pirates of the Caribbean: At World's End,adventure fantasy action,ocean drug abuse exotic island east india trad...,"at the end of the world, the adventure begins.",adventure fantasy action ocean drug abuse exot...


<ul>
    <li> To enable computation, we need to encode the words as we cannot process data in string format. However, simply encoding each word into numbers may not provide enough information. In this case, we can use the CountVectorizer to encode the data, which will not only enable computation but also provide additional insights into the data.
For this purpose, I will use the stop words provided by scikit-learn as the default, which excludes words like 'the', 'and', 'a', 'an', 'in', 'of', 'to', etc.</li>
</ul>

In [20]:
keywords_taglines_genres = keywords_taglines_genres.sort_values("id")
keywords_taglines_genres = keywords_taglines_genres.reset_index(drop=True)
keywords_taglines_genres.head()

Unnamed: 0,id,original_title,genres_combined,keywords_combined,tagline_processed,keywords_taglines_genres
0,5,Four Rooms,crime comedy,hotel new year's eve witch bet hotel room sper...,twelve outrageous guests. four scandalous requ...,crime comedy hotel new year's eve witch bet ho...
1,11,Star Wars,adventure action science fiction,android galaxy hermit death star lightsaber je...,"a long time ago in a galaxy far, far away...",adventure action science fiction android galax...
2,12,Finding Nemo,animation family,father son relationship harbor underwater fish...,"there are 3.7 trillion fish in the ocean, they...",animation family father son relationship harbo...
3,13,Forrest Gump,comedy drama romance,vietnam veteran hippie mentally disabled runni...,"the world will never be the same, once you've ...",comedy drama romance vietnam veteran hippie me...
4,14,American Beauty,drama,male nudity female nudity adultery midlife cri...,look closer.,drama male nudity female nudity adultery midli...


In [21]:
keywords_taglines_genres.keywords_taglines_genres

0       crime comedy hotel new year's eve witch bet ho...
1       adventure action science fiction android galax...
2       animation family father son relationship harbo...
3       comedy drama romance vietnam veteran hippie me...
4       drama male nudity female nudity adultery midli...
                              ...                        
3954    history drama war action korea fictionalized h...
3955    action adventure animation family fantasy adve...
3956          horror  the hot spot where satan's waitin'.
3957    comedy family drama  it’s better to stand out ...
3958    thriller drama christian film sex trafficking ...
Name: keywords_taglines_genres, Length: 3959, dtype: object

In [22]:
vectorizer = CountVectorizer(stop_words='english')
vectorized = vectorizer.fit_transform(keywords_taglines_genres.keywords_taglines_genres)

In [23]:
vectorized.toarray().shape

(3959, 8920)

In [24]:
vectorized.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

<ul>
    <li>We have obtained information about each word, and now we can plot the top 10 most frequent words in the sentences.</li>
</ul>

In [25]:
words = vectorizer.get_feature_names_out()
words

array(['000', '007', '05', ..., '最后的舞者', '绝地奶霸', '超级妈妈'], dtype=object)

In [26]:
freqs = vectorized.sum(axis=0).A1 # sum matrix into a 1D NumPy array
freqs

array([17,  6,  2, ...,  1,  1,  1], dtype=int64)

In [27]:
sorted_indices = freqs.argsort()[::-1] # Sort the words by frequency in descending order
words = [words[idx] for idx in sorted_indices]
freqs = freqs[sorted_indices]

<ul>
    <li>I'll now use consine similarity to find the similarity. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. It is often used in natural language processing and information retrieval to measure how similar two documents are based on their content. The value of cosine similarity ranges from -1 to 1, with 1 indicating that the two vectors are identical and 0 indicating that they are completely dissimilar. Higher values of cosine similarity indicate greater similarity between the vectors.
The formula for cosine similarity between two vectors A and B can be expressed as:

cosine_similarity = (A . B) / (||A|| ||B||)

where A . B represents the dot product of A and B, and ||A|| and ||B|| represent the magnitudes of A and B, respectively.</li>
</ul>

In [28]:
cos_results = cosine_similarity(vectorized.toarray())
pd.DataFrame(cos_results)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3949,3950,3951,3952,3953,3954,3955,3956,3957,3958
0,1.000000,0.000000,0.000000,0.024419,0.050016,0.043581,0.083388,0.024419,0.000000,0.031404,...,0.000000,0.059470,0.038837,0.071611,0.000000,0.000000,0.000000,0.0,0.065372,0.056614
1,0.000000,1.000000,0.000000,0.025777,0.026398,0.046004,0.198053,0.051554,0.054133,0.033150,...,0.084515,0.031388,0.040996,0.075593,0.000000,0.070491,0.122988,0.0,0.000000,0.000000
2,0.000000,0.000000,1.000000,0.096449,0.049386,0.021517,0.000000,0.000000,0.025318,0.031009,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.076696,0.0,0.064550,0.000000
3,0.024419,0.025777,0.096449,1.000000,0.071449,0.062257,0.059561,0.046512,0.000000,0.000000,...,0.076249,0.028318,0.000000,0.000000,0.032513,0.063596,0.036986,0.0,0.186772,0.053916
4,0.050016,0.026398,0.049386,0.071449,1.000000,0.021253,0.020332,0.047633,0.000000,0.030628,...,0.000000,0.029001,0.000000,0.000000,0.033296,0.032564,0.000000,0.0,0.127515,0.110432
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3954,0.000000,0.070491,0.000000,0.063596,0.032564,0.085126,0.027146,0.031798,0.033389,0.040893,...,0.104257,0.077440,0.000000,0.093250,0.044455,1.000000,0.050572,0.0,0.085126,0.073721
3955,0.000000,0.122988,0.076696,0.036986,0.000000,0.033005,0.126302,0.000000,0.155347,0.095130,...,0.121268,0.090075,0.117647,0.108465,0.000000,0.050572,1.000000,0.0,0.099015,0.000000
3956,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.0,0.000000,0.000000
3957,0.065372,0.000000,0.064550,0.186772,0.127515,0.055556,0.000000,0.062257,0.000000,0.000000,...,0.000000,0.075810,0.000000,0.000000,0.087039,0.085126,0.099015,0.0,1.000000,0.144338


<ul>
    <li>Now that we have the cosine similarity table for each movie, we can retrieve the movie ID that the user just watched and find its similarity with other movies.</li>
</ul>

In [29]:
[x for x in keywords_taglines_genres["original_title"] if 'iron' in x.lower()]

['Iron Man',
 'The Man in the Iron Mask',
 'Gridiron Gang',
 'Iron Man 2',
 'The Iron Giant',
 'Ironclad',
 'Iron Man 3',
 'The Iron Lady',
 'The Man with the Iron Fists']

<ul>
    <li>Now i'll create a function for my hybrid recommender system. Just by using the title of the movie, the recommender system will suggests top 5 movies that user has just watched. </li>
    <li>Once we obtain the movie index, we can utilize the cosine_similarity matrix to determine the 100 most similar movies to the movie ID that the user watched. </li>
    <li>Next, we will select the top 50 movies from the 100 movies based on their vote counts.
To achieve this, we will use the vote average approach, where the Weighted Rating (WR) is calculated using the formula:

 Weighted Rating (WR) = (v / (v + m)) * R + (m / (v + m)) * C
We will now create a new dataframe for this purpose.</li>
    <li>C = Mean vote (or rating) across all movies. </li>
    <li>m = Minimum votes (or ratings) required for the movie to be considered. </li>
    <li>Function WR = Calculate the weighted average ratings using the provided formula. </li>
    <li>Sort the values and pick the highest scores to select the top 50 movies for recommendation. </li>
    <li>Filter the movies based on their content similarity, and for this, we'll be using the movie overview. </li>
    <li>To ensure that the user's watched movie is among the top 50, we will manually add it if it is not already there, as we need its overview for comparison. It appears that the movie is already included, as indicated by the "true" output. </li>
    <li>To start with, in this scenario, I need to eliminate any numbers or symbols from the overview sentences and convert them all to lowercase.</li>
    <li>Let's now proceed with extracting their features. The main difference between CountVectorizer and TfidfVectorizer is that CountVectorizer only counts the occurrences of each word in the document, while TfidfVectorizer considers the frequency of a word across all documents. TfidfVectorizer is better at identifying the most important words in a document or a corpus.</li>
    <li>The cosine similarity can now be computed between them.</li>
    <li>Let's now retrieve the movie index that we want to examine from the similarity table.
It's worth noting that each row in the table contains the similarities of that particular row to all the available movies in the table.</li>
    

In [30]:
# Function for hybrid movie recommender sys.

def Hybrid_recommender(Movie):
    movie_title = Movie
    
    # Obtain Movie Index and determine 100 movies similar to index.
    
    movie_index = [index for index, x in enumerate(keywords_taglines_genres["original_title"] == movie_title) if x == True][0]
    top_100_movies = sorted(list(enumerate(cos_results[movie_index])), key=lambda x: x[1], reverse=True)[:100]
    
    # Top 50 movies from the 100 movies based on their vote counts (Weighted Rating method).
    
    top_movies = pd.DataFrame(np.squeeze(keywords_taglines_genres.values[[np.array(top_100_movies)[:, 0].astype(int)]]), columns=keywords_taglines_genres.columns)
    second_filters = top_movies.merge(datasets, on="id", how="inner")
    weight_averaged_method = second_filters[["id", "original_title_x", "vote_average", "vote_count"]]
    weight_averaged_method['R'] = weight_averaged_method.vote_average
    C = weight_averaged_method.R.mean()
    weight_averaged_method['v'] = weight_averaged_method.vote_count
    PERCENTAGE = .95 # set to 95%
    m = weight_averaged_method.v.quantile(PERCENTAGE)
    def WR(query):
        R = query.R
        v = query.v
        return (v / (v + m)) * R + (m / (v + m)) * C
    weight_averaged_method["WR"] = weight_averaged_method.apply(WR, axis=1)
    weight_averaged_method = weight_averaged_method.sort_values("WR", ascending=False)
    weight_averaged_method.reset_index(drop=True, inplace=True)
    top_50_movies = weight_averaged_method.head(50)
    
    # Filter the movies based on their content similarity
    
    contentbased = top_50_movies.merge(datasets[["id", "overview"]], on="id", how="inner")[["id", "original_title_x", "overview"]]
    contentbased.columns = ["id", "original_title", "overview"]
    check_movie_exist = True in list(contentbased.original_title == movie_title) 
    add_title = datasets[["id", "original_title", "overview"]][datasets[["id", "original_title", "overview"]].original_title == movie_title]
    def func(row):
        text = row.overview.lower()
        text = ' ' . join(re.findall(r"[a-z]+", text))
        return text
    contentbased["overview_cleaned"] = contentbased.apply(func, axis=1)
    
    # Extracting their features.
    
    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_features = tfidf.fit_transform(contentbased.overview_cleaned)
    tfidf_features.toarray()
    
    # Cosine similarity can now be computed between them.
    
    cosine_similarity_tbls = cosine_similarity(tfidf_features)
    movie_index = [index for index, x in enumerate(contentbased["original_title"] == movie_title) if x == True][0]
    
    # Let's now retrieve the movie index that we want to examine from the similarity table.
    
    top_movies = pd.DataFrame(list(sorted(enumerate(cosine_similarity_tbls[movie_index]), key=lambda x: x[1], reverse=True)))
    top_movies.columns = ["top_index", "cosine_similarity"]
    top_5_movies = pd.DataFrame(contentbased[["id", "original_title"]].values[top_movies.top_index[:6]], columns=["id", "original_title"]) 
    print(top_5_movies)
    print(f"The movies related to the {movie_title} movie that the user just watched are ...\n")
    for i in top_5_movies.values[1:]:
        print(i[1])

<ul>
<li>We have obtained the final output of our basic movie recommendation system!. There are various ways to tackle this problem. In this approach, I am finding the movies sequentially based on the previous results.</li>
</ul>

In [31]:
Hybrid_recommender("The Iron Giant")

       id                     original_title
0   10386                     The Iron Giant
1  294272                      Pete's Dragon
2    9023   Spirit: Stallion of the Cimarron
3   22794  Cloudy with a Chance of Meatballs
4      62              2001: A Space Odyssey
5   12244                                  9
The movies related to the The Iron Giant movie that the user just watched are ...

Pete's Dragon
Spirit: Stallion of the Cimarron
Cloudy with a Chance of Meatballs
2001: A Space Odyssey
9


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weight_averaged_method['R'] = weight_averaged_method.vote_average
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weight_averaged_method['v'] = weight_averaged_method.vote_count
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weight_averaged_method["WR"] = weight_averaged_method.apply(WR, axis=1)


In [32]:
import pickle

In [33]:
pickle.dump(keywords_taglines_genres.to_dict(),open("movies_dict.pkl","wb"))

In [34]:
pickle.dump(cos_results,open("similarity.pkl","wb"))

In [35]:
pickle.dump(datasets.to_dict(),open("datasets_dict.pkl","wb"))