## Movie Recommendation System
Using TMDB 5000 Movie Dataset
#### [Link to Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata)

In [1]:
import pandas as pd 
import numpy as np 

In [2]:
df1=pd.read_csv('data/tmdb_5000_credits.csv')
df2=pd.read_csv('data/tmdb_5000_movies.csv')

In [3]:
df1.columns, df2.columns

(Index(['movie_id', 'title', 'cast', 'crew'], dtype='object'),
 Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
        'original_title', 'overview', 'popularity', 'production_companies',
        'production_countries', 'release_date', 'revenue', 'runtime',
        'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
        'vote_count'],
       dtype='object'))

In [4]:
# Merge the dataframes and store in csv
df1.columns = ['id', 'title', 'cast', 'crew']
merged_df = df2.merge(df1, on=['id', 'title'])
merged_df.to_csv('data/tmdb_5000.csv', index=False)

In [5]:
df = pd.read_csv('data/tmdb_5000.csv')
df = df[['id', 'title', 'genres', 'keywords', 'overview', 'release_date', 'cast', 'crew']]
df.head(3)

Unnamed: 0,id,title,genres,keywords,overview,release_date,cast,crew
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...",2009-12-10,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...",2007-05-19,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...,2015-10-26,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


### Credits, Genres and Keywords Based Recommender
We are going to build a recommender based on the following metadata: the 3 top actors, the director, related genres and the movie plot keywords.

In [6]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(literal_eval)

In [7]:
df[features].head(2)

Unnamed: 0,cast,crew,keywords,genres
0,"[{'cast_id': 242, 'character': 'Jake Sully', '...","[{'credit_id': '52fe48009251416c750aca23', 'de...","[{'id': 1463, 'name': 'culture clash'}, {'id':...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
1,"[{'cast_id': 4, 'character': 'Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de...","[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."


In [8]:
original_df = df.copy()

In [9]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(crew_list):
    for crew in crew_list:
        if crew['job'] == 'Director':
            return crew['name']
    return np.nan


# Returns only the names from given list
def get_names_from_list(x):
    if isinstance(x, list):
        names = [row['name'] for row in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names
    return []


# Convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(' ', '')) for i in x]
    else:
        # Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(' ', ''))
        else:
            return ''


def create_combined_features_data(df):
    return ' '.join(df['cast']) + ' ' + df['director'] + ' ' + ' '.join(df['genres'])  + ' ' + ' '.join(df['keywords']) 

In [10]:
# Define new director, cast, genres and keywords features that are in  a suitable form
df['director'] = df['crew'].apply(get_director)

for feature in ['cast', 'keywords', 'genres']:
    df[feature] = df[feature].apply(get_names_from_list)

In [11]:
df[['title', 'cast', 'director', 'keywords', 'genres']].head()

Unnamed: 0,title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"
3,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman]",Christopher Nolan,"[dc comics, crime fighter, terrorist]","[Action, Crime, Drama]"
4,John Carter,"[Taylor Kitsch, Lynn Collins, Samantha Morton]",Andrew Stanton,"[based on novel, mars, medallion]","[Action, Adventure, Science Fiction]"


The next step would be to convert the names and keyword instances into lowercase and strip all the spaces between them. This is done so that our vectorizer doesn't count the Tom of "Tom Hardy" and "Tom Hanks" as the same.

In [12]:
# Apply clean_data function to features
features = ['cast', 'keywords', 'director', 'genres']
for feature in features:
    df[feature] = df[feature].apply(clean_data)

In [13]:
df[features].head(2)

Unnamed: 0,cast,keywords,director,genres
0,"[samworthington, zoesaldana, sigourneyweaver]","[cultureclash, future, spacewar]",jamescameron,"[action, adventure, fantasy]"
1,"[johnnydepp, orlandobloom, keiraknightley]","[ocean, drugabuse, exoticisland]",goreverbinski,"[adventure, fantasy, action]"


Now, we will combine all the metadata into a single string, that will be fed to vectorizer.

In [14]:
df['combined_features_data'] = df.apply(create_combined_features_data, axis=1)

In [15]:
df[['title', 'combined_features_data']].head(3)

Unnamed: 0,title,combined_features_data
0,Avatar,samworthington zoesaldana sigourneyweaver jame...
1,Pirates of the Caribbean: At World's End,johnnydepp orlandobloom keiraknightley gorever...
2,Spectre,danielcraig christophwaltz léaseydoux sammende...


## Vectorizer
We use the CountVectorizer() instead of TF-IDF. This is because we do not want to down-weight the presence of an actor/director if he or she has acted or directed in relatively more movies. It doesn't make much intuitive sense.

In [16]:
import pickle

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
count_matrix = vectorizer.fit_transform(df['combined_features_data'])

# Save the vectorizer
with open('data/vectorizer.pkl', 'wb') as vectorizer_file:
    pickle.dump(vectorizer, vectorizer_file)
# Save the count matrix
with open('data/count_matrix.pkl', 'wb') as count_matrix_file:
    pickle.dump(count_matrix, count_matrix_file)

Testing the recommendations

In [18]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [19]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations_from_title(title, cosine_sim):
    # Get the index of the movie with the given title
    movie_index = df[df['title'] == title].index[0]
    
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[movie_index]))
    
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]  # Exclude the movie itself
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return df.iloc[movie_indices]['title']


In [20]:
get_recommendations_from_title('The Avengers', cosine_sim)

7                  Avengers: Age of Ultron
26              Captain America: Civil War
79                              Iron Man 2
169     Captain America: The First Avenger
174                    The Incredible Hulk
85     Captain America: The Winter Soldier
31                              Iron Man 3
33                   X-Men: The Last Stand
68                                Iron Man
94                 Guardians of the Galaxy
Name: title, dtype: object

In [21]:
get_recommendations_from_title('Batman', cosine_sim)

428                        Batman Returns
210                        Batman & Robin
5                            Spider-Man 3
9      Batman v Superman: Dawn of Justice
10                       Superman Returns
14                           Man of Steel
299                        Batman Forever
473                         Mars Attacks!
813                              Superman
870                           Superman II
Name: title, dtype: object

Find recommendations using a list of movies liked by user

In [22]:
titles = ['Warrior', 'The Dark Knight', 'Gone Girl', 'Fight Club', 'The Game']
idxs=[]
for title in titles:
    # Get the index of the movie with the given title
    movie_index = df[df['title'] == title].index[0]    
    idxs.append(movie_index)
idxs

[1920, 65, 693, 662, 946]

In [23]:
def preprocess_df(df):
    # Define new director, cast, genres and keywords features that are in  a suitable form
    df['director'] = df['crew'].apply(get_director)

    for feature in ['cast', 'keywords', 'genres']:
        df[feature] = df[feature].apply(get_names_from_list)

    # Apply clean_data function to features
    features = ['cast', 'keywords', 'director', 'genres']
    for feature in features:
        df[feature] = df[feature].apply(clean_data)

    df['combined_features_data'] = df.apply(create_combined_features_data, axis=1)

In [24]:
def get_recommendations(input_df):
    preprocess_df(input_df)

    # Load the vectorizer
    with open('data/vectorizer.pkl', 'rb') as vectorizer_file:
        vectorizer = pickle.load(vectorizer_file)
    # Load the count matrix
    with open('data/count_matrix.pkl', 'rb') as count_matrix_file:
        count_matrix = pickle.load(count_matrix_file)

    # Transform the combined_features_data of input_df into a count matrix 
    count_matrix_input = vectorizer.transform(input_df['combined_features_data'])

    # Calculate cosine similarity
    cosine_sim_input = cosine_similarity(count_matrix, count_matrix_input)

    # Aggregate similarity scores for all movies in input_df
    total_sim_scores = cosine_sim_input.sum(axis=1)

    # Get the indices of the top recommendations
    top_indices = total_sim_scores.argsort()[::-1]

    # Excluding movies from input_df
    top_indices = [idx for idx in top_indices if idx not in input_df.index]

    # Get the top recommended movies and their corresponding scores
    top_recommendations = df.iloc[top_indices[:10]].copy()
    top_scores = total_sim_scores[top_indices[:10]]

    top_recommendations['Similarity Score'] = top_scores

    return top_recommendations

In [25]:
input_df = original_df.iloc[idxs].copy()
recommended_df = get_recommendations(input_df)
recommended_df[['title', 'combined_features_data', 'Similarity Score']]


Unnamed: 0,title,combined_features_data,Similarity Score
4638,Amidst the Devil's Wings,drama action crime,1.482143
4589,Fabled,drama mystery thriller independentfilm,1.441688
3,The Dark Knight Rises,christianbale michaelcaine garyoldman christop...,1.211803
2915,Trash,ricksontevez eduardoluis gabrielweinstein step...,1.208282
4564,Straight Out of Brooklyn,mattyrich drama,1.144427
100,The Curious Case of Benjamin Button,cateblanchett bradpitt tildaswinton davidfinch...,1.13541
1010,Panic Room,jodiefoster kristenstewart forestwhitaker davi...,1.123607
421,Zodiac,jakegyllenhaal robertdowneyjr. markruffalo dav...,1.123607
1196,The Prestige,hughjackman christianbale michaelcaine christo...,1.111803
119,Batman Begins,christianbale michaelcaine liamneeson christop...,1.111803
