In [50]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity

pandas is used for data manipulation and analysis.

TfidfVectorizer is used to convert text data into numerical vectors.

linear_kernel is used to compute the similarity between vectors

In [2]:
df = pd.read_csv('movies_metadata.csv')
df.head(3)

  df = pd.read_csv('movies_metadata.csv')


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [4]:
df.shape

(45466, 24)

In [5]:
half_df = df.sample(frac=0.5) 

This code calculates the mean of the 'vote_average' column in the 'metadata' dataset.

to represents the mean rating across all movies in the dataset.

In [6]:
C =df['vote_average'].mean()
print(C)

5.618207215134185


Calculating the minimum number of votes required to be in the chart
to  ensure that only movies with a significant number of votes are considered for the ranking.

In [4]:
m = df['vote_count'].quantile(0.90)
print(m)

160.0


- df['vote_count'].quantile(0.90) : calculates the 90th percentile of the vote_count column, meaning that only the top 10% of movies with the most votes are considered.

- m : represents this threshold value. Only movies with a vote count greater than or equal to m will be included in the ranking.

In [5]:
# Filter out all qualified movies into a new DataFrame
q_movies = df.copy().loc[df['vote_count'] >= m]
q_movies.shape

(4555, 24)

- df.copy() : creates a copy of the original DataFrame to avoid modifying the original dataset.
- loc[df['vote_count'] >= m] : filters the DataFrame to include only those movies where the vote_count is greater than or equal to m.
- q_movies.shape : displays the shape (number of rows and columns) of the filtered DataFrame, indicating how many movies meet the vote count threshold.

In [6]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

- v = x['vote_count'] : extracts the number of votes for the movie.
- R = x['vote_average'] : extracts the average rating of the movie.
- The formula (v / (v + m) * R) + (m / (m + v) * C) : calculates the weighted rating.
- The result is a weighted score that balances the movie’s own average rating with the overall average rating, adjusted by the number of votes.

In [7]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [8]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

In [9]:
#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


In [10]:
#Print plot overviews of the first 5 movies.
df['overview'].head()


0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

we need to find the similarities between movies overviews , so we need to compute the TF-IDF vectors

In [9]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df['overview'] = df['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(45466, 75827)

In [12]:
half_df['overview'] = half_df['overview'].fillna('')
half_tfidf_matrix = tfidf.transform(half_df['overview'])
half_tfidf_matrix.shape

(22733, 75827)

- initialize a TfidfVectorizer object.
- stop_words='english' : configures the vectorizer to remove common English stop words, which are words that do not carry significant meaning for the analysis.
- fillna(''): replaces any missing values (NaNs) in the overview column with empty strings. This ensures that the vectorizer does not encounter any missing values, which would cause errors during the transformation process.
- tfidf.fit_transform(df['overview']) : fits the TfidfVectorizer to the overview data and transforms the overviews into TF-IDF vectors.
- tfidf_matrix.shape : outputs the shape of the resulting TF-IDF matrix. The shape (number of documents, number of features) indicates how many documents (movies) and unique words (features) are in the matrix.

In [13]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names_out()[5000:5010]

array(['avails', 'avaks', 'avalanche', 'avalanches', 'avallone', 'avalon',
       'avant', 'avanthika', 'avanti', 'avaracious'], dtype=object)

To understand the vocabulary used in the TF-IDF matrix and see specific feature names (words) at certain positions

In [14]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(half_tfidf_matrix, half_tfidf_matrix)

In [15]:
cosine_sim.shape

(22733, 22733)

In [16]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

In [17]:
indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

Constructing a reverse map of indices and movie titles is essential for efficiently retrieving the index of a movie title when making recommendations.

This reverse mapping allows us to quickly look up the index of a movie given its title, which is crucial for accessing the similarity scores and corresponding recommendations.

In [18]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

The function takes a movie title as input and outputs a list of the top 10 most similar movies based on cosine similarity scores.
- Given a movie title, the function first retrieves the index of that movie from the indices Series, which maps movie titles to their corresponding indices in the dataset.

- Using the provided cosine similarity matrix (cosine_sim), the function retrieves the pairwise similarity scores of the input movie with all other movies in the dataset.
- The similarity scores are sorted in descending order to identify the most similar movies.

Examples :

In [19]:
get_recommendations('The Dark Knight Rises')

16341                The Life of Reilly
17179                Bis zum Ellenbogen
21327             The Battle for Marjah
5879                Divine Intervention
12615                           Reprise
16891               L: Change the World
1995                                Tex
9924                            Ringu 0
20134    The Cherry Orchard: Blossoming
12467                What Just Happened
Name: title, dtype: object

In [20]:
get_recommendations('The Godfather')

21251    A Limousine the Colour of Midsummer's Eve
15823                               I'm Still Here
11858                              Hostel: Part II
3070                                         Trans
7091                      The Beast of Yucca Flats
19737                                      The Day
4496                                    Millennium
20917                             The Little Thief
15758             The Reincarnation of Peter Proud
12733                             The Unholy Three
Name: title, dtype: object

**Credits, Genres,** **and Keywords** **Based** **Recommender**

- Load the credits.csv and keywords.csv files.
- Drop the rows with specific bad IDs.
- Ensure the IDs are in the correct format for merging.
- Merge the additional data into the main dataframe.

In [25]:
# Load keywords and credits
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

# Remove rows with bad IDs.
df = df.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
df['id'] = df['id'].astype('int')

# Merge keywords and credits into the main dataframe
df = df.merge(credits, on='id')
df = df.merge(keywords, on='id')


After merging, check few rows of the resulting dataframe to ensure that the merge was successful.

In [26]:
# Print the first two movies of your newly merged metadata
df.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,tagline,title,video,vote_average,vote_count,cast_x,crew_x,cast_y,crew_y,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


In [28]:
print(df.columns)

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'cast_x', 'crew_x', 'cast_y', 'crew_y',
       'keywords'],
      dtype='object')


In [29]:
df.rename(columns={'cast_y': 'cast', 'crew_y': 'crew'}, inplace=True)

The features 'cast', 'crew', 'keywords', and 'genres' are stored as strings. These need to be converted into lists or dictionaries to facilitate further processing.

Properly formatted data allows for easier extraction of relevant information (e.g., extracting actor names from the cast).

literal_eval: This function safely evaluates an expression node or a string containing a Python literal (e.g., a list, dictionary, string, number)

In [30]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(literal_eval)


Function get_director : to extract the director's name from the 'crew' column of the dataframe. 
- The 'crew' column contains a list of dictionaries, where each dictionary represents a crew member with details such as their job and name. 
- The function iterates over these dictionaries and returns the name of the person whose job is 'Director'. If no director is found, it returns np.nan.

In [31]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

get_list function : to extract names from a list of dictionaries, typically used for extracting names from fields like 'cast', 'keywords', or 'genres' in the dataframe. 
- This function ensures that only the first three names are returned if the list contains more than three elements. 
- If the input is not a list or is missing/malformed, it returns an empty list.

In [32]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

The goal is to define new features for director, cast, genres, and keywords in a suitable form for the content-based recommender system. 

Here, we will:

- Extract the director's name from the crew column : Use the get_director function to extract the director's name from the crew column .
- Extract and limit the names in the cast, keywords, and genres columns to three elements : Use the get_list function to extract names from the cast, keywords, and genres columns and limit them to three elements if necessary.

In [33]:
# Define new director, cast, genres and keywords features that are in a suitable form.
df['director'] = df['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(get_list)

In [34]:
# Print the new features of the first 3 films
df[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


clean_data function : to preprocess strings by converting them to lowercase and removing spaces. This is useful for standardizing text data, making it easier to compare and analyze.

- Convert all strings to lowercase to ensure consistency in case sensitivity.
- Remove spaces from strings to eliminate inconsistencies due to varying spacing.
- Handle missing or non-string data types by converting them to empty strings.

In [35]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [36]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df[feature] = df[feature].apply(clean_data)

function create_soup :  creating a combined string representation (soup) for each movie based on its keywords, cast, director, and genres. This soup will be utilized to compute the similarity between movies in the content-based recommender system.

- Combine various features such as keywords, cast, director, and genres into a single string representation for each movie.
- Generate a soup that encapsulates the essence of each movie by incorporating relevant information from its features.

In [37]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [38]:
# Create a new soup feature
df['soup'] = df.apply(create_soup, axis=1)

In [39]:
df[['soup']].head(2)

Unnamed: 0,soup
0,jealousy toy boy tomhanks timallen donrickles ...
1,boardgame disappearance basedonchildren'sbook ...


This code segment imports the CountVectorizer from scikit-learn and utilizes it to create a count matrix based on the 'soup' column in the metadata dataframe. 

The count matrix represents the frequency of each word in the 'soup' column, which will be used to calculate the similarity between movies in the content-based recommender system.

In [41]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['soup'])

In [42]:
count_matrix.shape

(46934, 73880)

Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

THIS IS THE CODE FROM THE TUTORIAL BUT I EDITED IT BECAUSE OF RAM 

In [55]:
count_matrix_half1 = count_matrix[:10000]
# count_matrix_half2 = count_matrix[half_size:]

# Compute the cosine similarity matrix for each half
cosine_sim_half1 = cosine_similarity(count_matrix_half1, count_matrix)

In [56]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = df.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

In [58]:
get_recommendations('Toy Story', cosine_sim_half1)

3056                       Toy Story 2
15745                      Toy Story 3
29499                  Superstar Goofy
26274       Toy Story That Time Forgot
22391             Toy Story of Terror!
3368                 Creature Comforts
26272                  Partysaurus Rex
27907                            Anina
43377    Dexter's Laboratory: Ego Trip
28306                    Radiopiratene
Name: title, dtype: object