# Movie Recommender System

This script processes movie metadata to build a content-based recommender system.
It performs the following steps:
1. Load and merge datasets.
2. Preprocess data by converting and cleaning relevant fields.
3. Create a 'tags' column combining key textual information.
4. Convert text data into numerical vectors using TF-IDF.
5. Compute cosine similarity between movie vectors.
6. Define a function to recommend movies based on similarity.
7. Save the processed data and similarity matrix for future use.

In [1]:
import numpy as np
import pandas as pd
import ast
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

In [2]:
#  Load the datasets
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

In [3]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [4]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [5]:
# Merge the datasets on the title column

movies = movies.merge(credits,on="title")

In [6]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [7]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

In [8]:
# Select relevant columns
movies = movies[['movie_id','title',"genres",'overview','keywords','cast','crew']]
movies.head(1)

Unnamed: 0,movie_id,title,genres,overview,keywords,cast,crew
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","In the 22nd century, a paraplegic Marine is di...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [9]:
movies.isnull().sum()

movie_id    0
title       0
genres      0
overview    3
keywords    0
cast        0
crew        0
dtype: int64

In [10]:
# Fill missing overview with empty strings instead of dropping
movies['overview'] = movies['overview'].fillna('')

In [11]:
movies.isnull().sum()

movie_id    0
title       0
genres      0
overview    0
keywords    0
cast        0
crew        0
dtype: int64

In [12]:
movies.duplicated().sum()

0

In [13]:
# Function to convert stringified lists of dictionaries to lists of names
def convert(obj):
    return [i["name"] for i in ast.literal_eval(obj)]

In [14]:
movies["genres"] = movies['genres'].apply(convert)  

In [15]:
movies["genres"]

0       [Action, Adventure, Fantasy, Science Fiction]
1                        [Adventure, Fantasy, Action]
2                          [Action, Adventure, Crime]
3                    [Action, Crime, Drama, Thriller]
4                [Action, Adventure, Science Fiction]
                            ...                      
4804                        [Action, Crime, Thriller]
4805                                [Comedy, Romance]
4806               [Comedy, Drama, Romance, TV Movie]
4807                                               []
4808                                    [Documentary]
Name: genres, Length: 4809, dtype: object

In [16]:
movies['keywords'] = movies['keywords'].apply(convert)


In [17]:
# Function to get top 3 cast members
def convert3(obj):
    return [i["name"] for i in ast.literal_eval(obj)[:3]]


In [18]:
movies['cast'] = movies['cast'].apply(convert3)

In [19]:
# Function to fetch the director
def fetch_director(obj):
    for i in ast.literal_eval(obj):
        if i["job"] == "Director":
            return [i["name"]]
    return []


In [20]:
movies["crew"] = movies['crew'].apply(fetch_director)

In [21]:
# Text processing for tags
def clean_text(text):
    return text.lower().replace(" ", "")

In [22]:
# transform each movie's overview (which is a string) into a list of words.
movies['overview'] = movies['overview'].apply(lambda x: x.split())
movies['genres'] = movies['genres'].apply(lambda x: [clean_text(i) for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [clean_text(i) for i in x])
movies['cast'] = movies['cast'].apply(lambda x: [clean_text(i) for i in x])
movies['crew'] = movies['crew'].apply(lambda x: [clean_text(i) for i in x])


In [23]:
movies.head()

Unnamed: 0,movie_id,title,genres,overview,keywords,cast,crew
0,19995,Avatar,"[action, adventure, fantasy, sciencefiction]","[In, the, 22nd, century,, a, paraplegic, Marin...","[cultureclash, future, spacewar, spacecolony, ...","[samworthington, zoesaldana, sigourneyweaver]",[jamescameron]
1,285,Pirates of the Caribbean: At World's End,"[adventure, fantasy, action]","[Captain, Barbossa,, long, believed, to, be, d...","[ocean, drugabuse, exoticisland, eastindiatrad...","[johnnydepp, orlandobloom, keiraknightley]",[goreverbinski]
2,206647,Spectre,"[action, adventure, crime]","[A, cryptic, message, from, Bond’s, past, send...","[spy, basedonnovel, secretagent, sequel, mi6, ...","[danielcraig, christophwaltz, léaseydoux]",[sammendes]
3,49026,The Dark Knight Rises,"[action, crime, drama, thriller]","[Following, the, death, of, District, Attorney...","[dccomics, crimefighter, terrorist, secretiden...","[christianbale, michaelcaine, garyoldman]",[christophernolan]
4,49529,John Carter,"[action, adventure, sciencefiction]","[John, Carter, is, a, war-weary,, former, mili...","[basedonnovel, mars, medallion, spacetravel, p...","[taylorkitsch, lynncollins, samanthamorton]",[andrewstanton]


In [24]:
# Combine text data into a single 'tags' column
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']
movies['tags'] = movies['tags'].apply(lambda x: " ".join(x))


In [25]:
# Function for stemming
ps = PorterStemmer()
def stem(text):
    return " ".join([ps.stem(i) for i in text.split()])

In [26]:
# Apply stemming
movies['tags'] = movies['tags'].apply(stem)

In [27]:
# Create the tf-idf matrix
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
vectors = tfidf.fit_transform(movies['tags']).toarray()

In [28]:
vectors

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [29]:
# Calculate cosine similarity matrix
similarity = cosine_similarity(vectors)

In [30]:
similarity[0].shape

(4809,)

In [31]:
# Function to recommend movies
def recommend(movie):
    if movie in movies['title'].values:
        movie_index = movies[movies['title'] == movie].index[0]
        distances = similarity[movie_index]
        movie_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
        for i in movie_list:
            print(movies.iloc[i[0]].title)
    else:
        print("No movie found")

In [32]:
# Example recommendation
recommend("Avatar")

Aliens
Falcon Rising
Battle: Los Angeles
Aliens vs Predator: Requiem
Apollo 18


In [35]:
# Save the data and model
pickle.dump(movies.to_dict(), open("movie_dict_2.pkl", "wb"))
pickle.dump(similarity, open("similarity_2.pkl", 'wb'))

# Conclusion

This script has built a content-based movie recommender system using movie metadata. 
The model calculates the similarity between movies based on their descriptions, genres, keywords, cast, and crew, 
and recommends movies that are most similar to a given movie. 
The processed data and similarity matrix are saved for future use. 
You can now use the `recommend` function to get movie recommendations.