Movie Recommender System
Here we try to implement a recommender system using meta-data such as keywords, cast, and crew.
We hope to suggest movies with similar keywords or cast or crew members

Importing Data Processing Libraries

In [2]:
import numpy as np
import pandas as pd

Reading the Datasets into python

In [5]:
cred_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/credits.csv')
cred_df.head()

ParserError: ignored

In [None]:
key_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/keywords.csv')
key_df.head()

We can merge dataframes on id as they are similar columns.

In [None]:
cred_df.shape

In [None]:
key_df.shape

In [None]:
key_df['keywords'][0]

In [None]:
cred_df['cast'][0]

In [None]:
key_df.info()

In [None]:
cred_df.info()

Merging Dataframes


Since the shape is different, we can check for duplicates.

In [None]:
cred_df['id'].nunique()

In [None]:
key_df['id'].nunique()

In [None]:
key_df.drop_duplicates(subset=['id'], inplace=True)

In [None]:
cred_df.drop_duplicates(subset=['id'], inplace=True)

In [None]:
key_df.shape, cred_df.shape

After dropping duplicates, the shapes align

In [None]:
new_df = key_df.merge(cred_df, on='id')

In [None]:
new_df.head()

Processing Data

In [None]:
new_df['keywords'][0]

We can use literal eval to convert the values into list type from string

In [None]:
from ast import literal_eval

In [None]:
new_df['keywords'] = new_df['keywords'].apply(literal_eval)

In [None]:
new_df['keywords'][0][0]['name']

Taking only the keywords into list format.

In [None]:
new_df['keywords'] = new_df['keywords'].apply(lambda x: [i['name'].lower() for i in x] if isinstance(x, list) else [])

In [None]:
new_df['keywords']

In [None]:
new_df['cast'][0]

Since there are many cast members, we can take the first four as they are the major characters in the movie.

In [None]:
new_df['cast'] = new_df['cast'].apply(literal_eval)

In [None]:
new_df['cast'] = new_df['cast'].apply(lambda x: [i['name'].lower() for i in x[:4]] if isinstance(x, list) else [])

In [None]:
new_df['cast']

In [None]:
new_df['crew'][0]

Similarly for crew, we take first four names as they are: the director, the writers, and screenplay writers.

In [None]:
new_df['crew'] = new_df['crew'].apply(literal_eval)

In [None]:
new_df['crew'] = new_df['crew'].apply(lambda x: [i['name'].lower() for i in x[:4]] if isinstance(x, list) else [])

In [None]:
new_df.head()

Converting to string

We convert the data extracted into string and then join the three features into one single feature.

In [None]:
','.join(map(str, new_df['keywords'][0]))

In [None]:
new_df['new_key'] = new_df['keywords'].apply(lambda x: ','.join(map(str, x)))

In [None]:
new_df.head()

In [None]:
new_df['new_cast'] = new_df['cast'].apply(lambda x: ','.join(map(str, x)))

In [None]:
new_df['new_crew'] = new_df['crew'].apply(lambda x: ','.join(map(str, x)))

In [None]:
new_df.head()

In [None]:
def merge_cols(X):
    a = X['new_key']
    b = X['new_cast']
    c = X['new_crew']
    return f'{a}, {b}, {c}'

In [None]:
new_df['movie_details'] = new_df.apply(merge_cols, axis=1)

In [None]:
new_df.head()

In [None]:
new_df['movie_details'][0]

Reading movie data to get names of movies

In [None]:
movies = pd.read_csv('/content/drive/MyDrive/Ml_course/recommender_systems/bootcamp/movies_metadata.csv')
movies.head()

In [None]:
movies.shape

In [None]:
movies.columns

In [None]:
movies['id'].nunique()

In [None]:
movies.drop_duplicates(subset=['id'], inplace=True)

In [None]:
movies.shape

In [None]:
movies = movies.iloc[:20000,:]

In [None]:
def to_num(x):
    try:
        return int(x)
    except ValueError:
        return 0

In [None]:
movies['id'] = movies['id'].apply(to_num)

In [None]:
movies.loc[movies['id'] == 0]

Since the process crashed when taking cosine simialarity several times, we are only taking 19500 rows

In [None]:
movies = movies.iloc[:19500,:]

In [None]:
movies.loc[movies['id'] == 0]

In [None]:
movies.shape

In [None]:
movies['id'].loc[~movies['id'].isin(new_df['id'])]

In [None]:
new_df.shape

In [None]:
new_df = new_df.iloc[:19500,:]

Adding Movies titles to working dataframe

In [None]:
movie_names = movies['title'].to_list()

In [None]:
new_df.shape

In [None]:
new_df.loc[new_df['movie_details'] == '']

In [None]:

new_df['title'] = movie_names

In [None]:
new_df.columns

In [None]:
new_df.drop(columns=['keywords', 'cast', 'crew', 'new_key', 'new_cast', 'new_crew'], inplace=True)

Final Data

In [None]:
new_df.head()

TFIDF Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf = TfidfVectorizer(stop_words='english')

In [None]:
new_df.isnull().sum()

In [None]:
tfidf_matrix = tfidf.fit_transform(new_df['movie_details'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

In [None]:
# Import linear_kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel

In [None]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
indices = pd.Series(new_df.index, index=new_df['title']).drop_duplicates()

In [None]:
indices

In [None]:
# Function that takes in movie title as input and gives recommendations 
def content_recommender(title, cosine_sim=cosine_sim, df=new_df, indices=indices):

    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    # And convert it into a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the 10 most similar movies. Ignore the first movie.
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [None]:
content_recommender('The Lion King')

In [None]:
new_df['movie_details'].loc[new_df['title'] == 'Shark Tale'].values

In [None]:
new_df['movie_details'].loc[new_df['title'] == 'The Lion King'].values

Here we can see some similarities in crew and keywords. So our recommendation system is working as intended.