# Movie Recommendation SYstem

In this notebook, I have created a content-based recommendation system for recommending movie using cosine similarity scores.

## Data Loading and Preprocessing

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import ast

In [2]:
# Load the data
df = pd.read_csv('data/tmdb_5000_movies.csv')
# Load the credits data
df_credits = pd.read_csv('data/tmdb_5000_credits.csv')


In [3]:
# Display the first few rows of the dataframe
df_credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [4]:
# Display the first few rows of the dataframe
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


For a content-based recommendation system, the important features from tmdb_5000_movies datset could be genres, keywords, original_language, production_companies, and production_countries.

In [5]:
import ast

# Define a function to convert the stringified lists into actual lists
def convert_stringified_lists(df, column):
    df[column] = df[column].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)
    return df

# Convert the stringified lists into actual lists
for column in ['genres', 'keywords', 'production_companies', 'production_countries']:
    df = convert_stringified_lists(df, column)

# Extract the name from the lists of dictionaries and convert them into list of strings
df['genres'] = df['genres'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else np.nan)
df['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else np.nan)
df['production_companies'] = df['production_companies'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else np.nan)
df['production_countries'] = df['production_countries'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else np.nan)

# Display the first few rows of the dataframe
df[['title', 'genres', 'keywords', 'production_companies', 'production_countries']].head()


Unnamed: 0,title,genres,keywords,production_companies,production_countries
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Ingenious Film Partners, Twentieth Century Fo...","[United States of America, United Kingdom]"
1,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Walt Disney Pictures, Jerry Bruckheimer Films...",[United States of America]
2,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Columbia Pictures, Danjaq, B24]","[United Kingdom, United States of America]"
3,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Legendary Pictures, Warner Bros., DC Entertai...",[United States of America]
4,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...",[Walt Disney Pictures],[United States of America]


Now we have extracted the required information from the genres, keywords, production_companies, and production_countries columns. Each of these columns now contains a list of strings instead of a stringified list of dictionaries.

In the credits dataset, cast and crew could be important features.
The cast and crew columns contain list of dictionaries. We need to extract the relevant information from these dictionaries, similar to what we did earlier with the movies dataset.

We can merge the both the movies and credits dataframes using the 'title' column as the common key so that we can deal with a single dataframe as datset from now on.

In [6]:
# Merge the movies and credits dataframes
df = pd.merge(df, df_credits, on='title')

# Convert the stringified lists into actual lists
for column in ['cast', 'crew']:
    df = convert_stringified_lists(df, column)

# Extract the top 3 actors from the cast (Since all the actors name would not be important to viewers while piking movie to watch)
df['cast'] = df['cast'].apply(lambda x: [i['name'] for i in x[:3]] if isinstance(x, list) else np.nan)

# Extract the director from the crew
df['director'] = df['crew'].apply(lambda x: [i['name'] for i in x if i['job'] == 'Director'] if isinstance(x, list) else np.nan)

# Display the first few rows of the dataframe
df[['title', 'cast', 'director']].head()


Unnamed: 0,title,cast,director
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,John Carter,"[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


In [7]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew,director
0,237000000,"[Action, Adventure, Fantasy, Science Fiction]",http://www.avatarmovie.com/,19995,"[culture clash, future, space war, space colon...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[Ingenious Film Partners, Twentieth Century Fo...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{'credit_id': '52fe48009251416c750aca23', 'de...",[James Cameron]
1,300000000,"[Adventure, Fantasy, Action]",http://disney.go.com/disneypictures/pirates/,285,"[ocean, drug abuse, exotic island, east india ...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[Walt Disney Pictures, Jerry Bruckheimer Films...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[Johnny Depp, Orlando Bloom, Keira Knightley]","[{'credit_id': '52fe4232c3a36847f800b579', 'de...",[Gore Verbinski]
2,245000000,"[Action, Adventure, Crime]",http://www.sonypictures.com/movies/spectre/,206647,"[spy, based on novel, secret agent, sequel, mi...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[Columbia Pictures, Danjaq, B24]",...,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,"[Daniel Craig, Christoph Waltz, Léa Seydoux]","[{'credit_id': '54805967c3a36829b5002c41', 'de...",[Sam Mendes]
3,250000000,"[Action, Crime, Drama, Thriller]",http://www.thedarkknightrises.com/,49026,"[dc comics, crime fighter, terrorist, secret i...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[Legendary Pictures, Warner Bros., DC Entertai...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,49026,"[Christian Bale, Michael Caine, Gary Oldman]","[{'credit_id': '52fe4781c3a36847f81398c3', 'de...",[Christopher Nolan]
4,260000000,"[Action, Adventure, Science Fiction]",http://movies.disney.com/john-carter,49529,"[based on novel, mars, medallion, space travel...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,[Walt Disney Pictures],...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,49529,"[Taylor Kitsch, Lynn Collins, Samantha Morton]","[{'credit_id': '52fe479ac3a36847f813eaa3', 'de...",[Andrew Stanton]


## Feature Selection and Transformation

The next step is to transform these features into a format that can be fed into a machine learning model. One common approach to represent a list of strings is to use the "bag of words" model. This model transforms each list of strings into a vector, where each element of the vector corresponds to a word in the list. The element is 1 if the word is in the list and 0 otherwise. This transformation can be done using the CountVectorizer class from scikit-learn.

Steps before applying the vectorizer:

* Converting the text to lowercase, so that the algorithm does not treat the same words in different cases as different
* Removing punctuation and other special characters
* Removing stop words (common words that do not contain important meaning and are usually removed from texts)
* Stemming words (reducing inflected or derived words to their stem, base or root form)

But here these are not necessary as these conditions are already followed in the dataset.


In [8]:
# Combine the features into a single string
df['combined_features'] = df['genres'] + df['keywords'] + df['production_companies'] + df['production_countries']+ df['cast'] + df['director']
df['combined_features'] = df['combined_features'].apply(lambda x: ' '.join(x) if isinstance(x, list) else np.nan)

# Display the first few rows of the dataframe
df[['title', 'combined_features']].head()

Unnamed: 0,title,combined_features
0,Avatar,Action Adventure Fantasy Science Fiction cultu...
1,Pirates of the Caribbean: At World's End,Adventure Fantasy Action ocean drug abuse exot...
2,Spectre,Action Adventure Crime spy based on novel secr...
3,The Dark Knight Rises,Action Crime Drama Thriller dc comics crime fi...
4,John Carter,Action Adventure Science Fiction based on nove...


## Model Building

In [9]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Apply the vectorizer
count_matrix = vectorizer.fit_transform(df['combined_features'].dropna())

# Compute the pairwise cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix)

# Create a reverse mapping from index to movie title and a direct mapping from title to index
indices = pd.Series(df.index, index=df['title']).drop_duplicates()


## Get Recommendations

In [10]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies (skip the first movie because it's the movie itself)
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]



In [11]:
# Test the function with a few movies
print(get_recommendations('Avatar'))
print("\n")
print(get_recommendations('The Dark Knight Rises'))
print("\n")
print(get_recommendations('John Carter'))

2409                     Aliens
838                      Alien³
3163                      Alien
278          Planet of the Apes
1537                  Moonraker
1658       Dragonball Evolution
1204                  Predators
47      Star Trek Into Darkness
507            Independence Day
373             Mission to Mars
Name: title, dtype: object


65           The Dark Knight
119            Batman Begins
210           Batman & Robin
72             Suicide Squad
1362                  Batman
1199            The Prestige
1363                  Batman
2799    The Killer Inside Me
428           Batman Returns
14              Man of Steel
Name: title, dtype: object


1329                                 The 5th Wave
752                           My Favorite Martian
1071         The Hitchhiker's Guide to the Galaxy
141                               Mars Needs Moms
972                                      The Host
3499    Beastmaster 2: Through the Portal of Time
4407                          Th

In [12]:
import pickle


pickle.dump(cosine_sim,open('artifacts/similarity.pkl','wb'))

In [13]:
final_df= df[['title', 'combined_features']]
# Merge the dataframes based on the 'title' column
merged_df = final_df.merge(df[['title', 'id']], on='title', how='left')

# Rename the 'id' column to 'movie_id'
merged_df.rename(columns={'id': 'movie_id'}, inplace=True)
pickle.dump(merged_df,open('artifacts/movie_list.pkl','wb'))