# Movie Recommendation System

This notebook demonstrates the process of building a content-based movie recommendation system using the TMDB 5000 Movie Dataset. The system recommends movies based on the similarity of their content, including genres, keywords, cast, and crew.

## Table of Contents

1.  [Loading the Data](#loading-the-data)
2.  [Merging the DataFrames](#merging-the-dataframes)
3.  [Data Inspection](#data-inspection)
4.  [Filtering Relevant Columns](#filtering-relevant-columns)
5.  [Handling Missing Values](#handling-missing-values)
6.  [Checking for Duplicates](#checking-for-duplicates)
7.  [Extracting Information from JSON Strings](#extracting-information-from-json-strings)
8.  [Cleaning and Combining Tags](#cleaning-and-combining-tags)
9.  [Creating a New DataFrame with Essential Information](#creating-a-new-dataframe-with-essential-information)
10. [Lowercasing and Joining Tags](#lowercasing-and-joining-tags)
11. [Text Stemming](#text-stemming)
12. [Vectorizing Text Data](#vectorizing-text-data)
13. [Calculating Cosine Similarity](#calculating-cosine-similarity)
14. [Creating a Recommendation Function](#creating-a-recommendation-function)
15. [Testing the Recommendation Function](#testing-the-recommendation-function)

## 1. Loading the Data

We begin by loading the two datasets: `tmdb_5000_movies.csv` and `tmdb_5000_credits.csv` into pandas DataFrames. These datasets contain information about movies and their corresponding cast and crew.

In [33]:
import pandas as pd
import numpy as np
import ast
from google.colab import drive

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [34]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

## 2. Merging the DataFrames

To combine the movie information with the cast and crew details, we merge the `movies` and `credits` DataFrames based on the 'title' column.

In [35]:
movies = movies.merge(credits, on='title')

## 3. Data Inspection

Before proceeding, we inspect the merged DataFrame to understand its structure and identify any potential issues. We check the shape of the DataFrame and examine the distribution of the 'status' column.

In [36]:
print("Shape of the merged DataFrame:", movies.shape)
print("\nValue counts for 'status' column:")
print(movies['status'].value_counts())
print("\nInformation about the DataFrame:")
print(movies.info())

Shape of the merged DataFrame: (4809, 23)

Value counts for 'status' column:
status
Released           4801
Rumored               5
Post Production       3
Name: count, dtype: int64

Information about the DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_d

## 4. Filtering Relevant Columns

For our content-based recommendation system, we only need a subset of the columns. We select 'movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', and 'crew'.

In [37]:
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]
print("\nDataFrame after filtering columns:")
print(movies.head())


DataFrame after filtering columns:
   movie_id                                     title  \
0     19995                                    Avatar   
1       285  Pirates of the Caribbean: At World's End   
2    206647                                   Spectre   
3     49026                     The Dark Knight Rises   
4     49529                               John Carter   

                                            overview  \
0  In the 22nd century, a paraplegic Marine is di...   
1  Captain Barbossa, long believed to be dead, ha...   
2  A cryptic message from Bond’s past sends him o...   
3  Following the death of District Attorney Harve...   
4  John Carter is a war-weary, former military ca...   

                                              genres  \
0  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4  [

## 5. Handling Missing Values

We check for and remove any rows with missing values in the selected columns to ensure data quality and prevent errors during subsequent processing.

In [38]:
print("Number of missing values before dropping:")
print(movies.isnull().sum())
movies.dropna(inplace=True)
print("\nNumber of missing values after dropping:")
print(movies.isnull().sum())

Number of missing values before dropping:
movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

Number of missing values after dropping:
movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64


## 6. Checking for Duplicates

We verify that there are no duplicate rows in the filtered DataFrame. Duplicate rows can skew the recommendation results.

In [39]:
print("Number of duplicate rows:", movies.duplicated().sum())

Number of duplicate rows: 0


## 7. Extracting Information from JSON Strings

Several columns ('genres', 'keywords', 'cast', and 'crew') contain data in JSON string format. We need to convert these strings into Python lists of dictionaries and extract the relevant information. For 'genres' and 'keywords', we extract the 'name'. For 'cast', we extract the 'name' of the first 3 cast members. For 'crew', we extract the 'name' of the director.

In [24]:
movies['genres'] = movies['genres'].apply(lambda x: [i['name'] for i in ast.literal_eval(x)])
movies['keywords'] = movies['keywords'].apply(lambda x: [i['name'] for i in ast.literal_eval(x)])
movies['cast'] = movies['cast'].apply(lambda x: [i['name'] for i in ast.literal_eval(x)][:3]) # Extracting only the first 3 cast members
movies['crew'] = movies['crew'].apply(lambda x: [i['name'] for i in ast.literal_eval(x) if i['job'] == 'Director']) # Extracting only the director's name

print("\nDataFrame after extracting information from JSON strings:")
print(movies.head())


DataFrame after extracting information from JSON strings:
   movie_id                                     title  \
0     19995                                    Avatar   
1       285  Pirates of the Caribbean: At World's End   
2    206647                                   Spectre   
3     49026                     The Dark Knight Rises   
4     49529                               John Carter   

                                            overview  \
0  In the 22nd century, a paraplegic Marine is di...   
1  Captain Barbossa, long believed to be dead, ha...   
2  A cryptic message from Bond’s past sends him o...   
3  Following the death of District Attorney Harve...   
4  John Carter is a war-weary, former military ca...   

                                          genres  \
0  [Action, Adventure, Fantasy, Science Fiction]   
1                   [Adventure, Fantasy, Action]   
2                     [Action, Adventure, Crime]   
3               [Action, Crime, Drama, Thriller]   
4

## 8. Cleaning and Combining Tags

To create a unified representation for each movie, we clean the extracted lists by removing spaces from the elements and then combine the 'overview' (split into words), 'genres', 'keywords', 'cast', and 'crew' lists into a single 'tags' column.

In [25]:
movies['overview'] = movies['overview'].apply(lambda x: x.split())
movies['genres'] = movies['genres'].apply(lambda x: [i.replace(' ', '') for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [i.replace(' ', '') for i in x])
movies['cast'] = movies['cast'].apply(lambda x: [i.replace(' ', '') for i in x])
movies['crew'] = movies['crew'].apply(lambda x: [i.replace(' ', '') for i in x])

movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

print("\nDataFrame after cleaning and combining tags:")
print(movies.head())


DataFrame after cleaning and combining tags:
   movie_id                                     title  \
0     19995                                    Avatar   
1       285  Pirates of the Caribbean: At World's End   
2    206647                                   Spectre   
3     49026                     The Dark Knight Rises   
4     49529                               John Carter   

                                            overview  \
0  [In, the, 22nd, century,, a, paraplegic, Marin...   
1  [Captain, Barbossa,, long, believed, to, be, d...   
2  [A, cryptic, message, from, Bond’s, past, send...   
3  [Following, the, death, of, District, Attorney...   
4  [John, Carter, is, a, war-weary,, former, mili...   

                                         genres  \
0  [Action, Adventure, Fantasy, ScienceFiction]   
1                  [Adventure, Fantasy, Action]   
2                    [Action, Adventure, Crime]   
3              [Action, Crime, Drama, Thriller]   
4           [Action

## 9. Creating a New DataFrame with Essential Information

We create a new DataFrame `newdf` containing only the 'movie_id', 'title', and the newly created 'tags' column. This DataFrame will be used for vectorization and similarity calculation.

In [26]:
newdf = movies[['movie_id', 'title', 'tags']]

print("\nNew DataFrame with essential information:")
print(newdf.head())


New DataFrame with essential information:
   movie_id                                     title  \
0     19995                                    Avatar   
1       285  Pirates of the Caribbean: At World's End   
2    206647                                   Spectre   
3     49026                     The Dark Knight Rises   
4     49529                               John Carter   

                                                tags  
0  [In, the, 22nd, century,, a, paraplegic, Marin...  
1  [Captain, Barbossa,, long, believed, to, be, d...  
2  [A, cryptic, message, from, Bond’s, past, send...  
3  [Following, the, death, of, District, Attorney...  
4  [John, Carter, is, a, war-weary,, former, mili...  


## 10. Lowercasing and Joining Tags

We convert the 'tags' column to lowercase and join the list of words/tags into a single string separated by spaces. This standardization is important for accurate text vectorization.

In [27]:
newdf['tags'] = newdf['tags'].apply(lambda x: (" ".join(x)).lower())

print("\nDataFrame after lowercasing and joining tags:")
print(newdf.head())


DataFrame after lowercasing and joining tags:
   movie_id                                     title  \
0     19995                                    Avatar   
1       285  Pirates of the Caribbean: At World's End   
2    206647                                   Spectre   
3     49026                     The Dark Knight Rises   
4     49529                               John Carter   

                                                tags  
0  in the 22nd century, a paraplegic marine is di...  
1  captain barbossa, long believed to be dead, ha...  
2  a cryptic message from bond’s past sends him o...  
3  following the death of district attorney harve...  
4  john carter is a war-weary, former military ca...  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newdf['tags'] = newdf['tags'].apply(lambda x: (" ".join(x)).lower())


## 11. Text Stemming

We apply stemming to the 'tags' column using the PorterStemmer from the NLTK library. Stemming reduces words to their root form, which helps in reducing the vocabulary size and improving the accuracy of the recommendation system by treating words with similar meanings as the same.

In [28]:
import nltk
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

newdf['tags'] = newdf['tags'].apply(stem)

print("\nDataFrame after stemming tags:")
print(newdf.head())


DataFrame after stemming tags:
   movie_id                                     title  \
0     19995                                    Avatar   
1       285  Pirates of the Caribbean: At World's End   
2    206647                                   Spectre   
3     49026                     The Dark Knight Rises   
4     49529                               John Carter   

                                                tags  
0  in the 22nd century, a parapleg marin is dispa...  
1  captain barbossa, long believ to be dead, ha c...  
2  a cryptic messag from bond’ past send him on a...  
3  follow the death of district attorney harvey d...  
4  john carter is a war-weary, former militari ca...  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newdf['tags'] = newdf['tags'].apply(stem)


## 12. Vectorizing Text Data

We use `CountVectorizer` from the scikit-learn library to convert the text data in the 'tags' column into a matrix of token counts. We limit the number of features to 5000 and remove English stop words to reduce noise and computational complexity.

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=5000, stop_words='english')
vectors = cv.fit_transform(newdf['tags']).toarray()

print("\nShape of the vectors matrix:", vectors.shape)
print("\nTop 50 features:")
print(cv.get_feature_names_out()[:50])


Shape of the vectors matrix: (4806, 5000)

Top 50 features:
['000' '007' '10' '100' '11' '12' '13' '14' '15' '16' '17' '17th' '18'
 '18th' '18thcenturi' '19' '1910' '1920' '1930' '1940' '1944' '1950'
 '1950s' '1960' '1960s' '1970' '1970s' '1971' '1974' '1976' '1980' '1985'
 '1990' '1999' '19th' '19thcenturi' '20' '200' '2003' '2009' '20th' '21st'
 '23' '24' '25' '30' '300' '3d' '40' '50']


## 13. Calculating Cosine Similarity

We calculate the cosine similarity between the movie vectors to determine the similarity between movies based on their tags. Cosine similarity measures the cosine of the angle between two non-zero vectors, providing a measure of their similarity.

In [30]:
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(vectors)

print("\nCosine similarity matrix:")
print(similarity.shape)


Cosine similarity matrix:
(4806, 4806)


## 14. Creating a Recommendation Function

We define a Python function `recommend` that takes a movie title as input and returns the top 5 most similar movies based on the calculated cosine similarity. The function finds the index of the input movie, calculates the distance to all other movies using the similarity matrix, sorts the movies based on similarity, and returns the titles of the top 5 most similar movies (excluding the input movie itself).

In [31]:
def recommend(movie):
    movie_index = newdf[newdf['title'] == movie].index[0]
    distance = similarity[movie_index]
    movie_list = sorted(list(enumerate(distance)), reverse=True, key=lambda x: x[1])[1:6]

    print(f"\nRecommendations for '{movie}':")
    for i in movie_list:
        print(newdf.iloc[i[0]].title)

## 15. Testing the Recommendation Function

Finally, we test the `recommend` function with an example movie title, "Avatar", to see the recommendations.

In [41]:
recommend('Avatar')


Recommendations for 'Avatar':
Aliens vs Predator: Requiem
Aliens
Falcon Rising
Independence Day
Titan A.E.


In [42]:
import pickle

In [46]:
pickle.dump(newdf,open('movie_list.pkl','wb'))
pickle.dump(similarity,open('similarity.pkl','wb'))

In [49]:
pickle.dump(newdf.to_dict(), open('movie_dict.pkl', 'wb'))