# TMDB-Based Movie Recommendation System using NLP and Cosine Similarity

This notebook builds a content-based movie recommendation system using the TMDB 5000 dataset. It processes metadata such as genres, keywords, cast, and crew to create a meaningful representation of each movie. Then, using natural language processing (NLP) and cosine similarity, it finds and recommends movies that are most similar to a given input movie.

In [1]:
import pandas as pd
import numpy as np

In [2]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

In [3]:
movies = movies.merge(credits, on = 'title')

In [4]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [5]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

In [6]:
# genres
# homepage
#id
#keywords 
# title
# overview
#cast
# crew

In [7]:
movies = movies[['id','genres','title','overview','keywords','cast','crew']]

In [8]:
movies.head(2)

Unnamed: 0,id,genres,title,overview,keywords,cast,crew
0,19995,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [9]:
movies.isnull().sum()

id          0
genres      0
title       0
overview    3
keywords    0
cast        0
crew        0
dtype: int64

In [10]:
movies.dropna(inplace = True)

In [11]:
import ast

In [12]:
movies['genres'][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [13]:
newlist = ['genres','keywords','cast','crew']

In [14]:
movies.head(2)

Unnamed: 0,id,genres,title,overview,keywords,cast,crew
0,19995,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [15]:
movies['genres'][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [16]:
movies['genres']

0       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
1       [{"id": 12, "name": "Adventure"}, {"id": 14, "...
2       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
3       [{"id": 28, "name": "Action"}, {"id": 80, "nam...
4       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
                              ...                        
4804    [{"id": 28, "name": "Action"}, {"id": 80, "nam...
4805    [{"id": 35, "name": "Comedy"}, {"id": 10749, "...
4806    [{"id": 35, "name": "Comedy"}, {"id": 18, "nam...
4807                                                   []
4808                  [{"id": 99, "name": "Documentary"}]
Name: genres, Length: 4806, dtype: object

In [17]:
def show1(obj):
    list = []
    for x in ast.literal_eval(obj):
            list.append(x['name'])
    return list

In [18]:
movies['genre_names'] = movies['genres'].apply(show1)

In [19]:
movies['genres'] = movies['genre_names']

In [20]:
movies.drop('genre_names',axis=1,inplace = True)

In [21]:
# movies['keywords'] = movies['keywords'].apply(lambda x: [genre['name'] for genre in x])
movies['keywords'] = movies['keywords'].apply(show1)

In [22]:
# movies['cast'] = movies['cast'].apply(lambda x: [genre['name'] for genre in x])
movies['cast'] = movies['cast'].apply(show1)

In [23]:
movies['crew']

0       [{"credit_id": "52fe48009251416c750aca23", "de...
1       [{"credit_id": "52fe4232c3a36847f800b579", "de...
2       [{"credit_id": "54805967c3a36829b5002c41", "de...
3       [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4       [{"credit_id": "52fe479ac3a36847f813eaa3", "de...
                              ...                        
4804    [{"credit_id": "52fe44eec3a36847f80b280b", "de...
4805    [{"credit_id": "52fe487dc3a368484e0fb013", "de...
4806    [{"credit_id": "52fe4df3c3a36847f8275ecf", "de...
4807    [{"credit_id": "52fe4ad9c3a368484e16a36b", "de...
4808    [{"credit_id": "58ce021b9251415a390165d9", "de...
Name: crew, Length: 4806, dtype: object

In [24]:
def show(obj):
    list = []
    counter = 0
    for x in ast.literal_eval(obj):
        if counter != 3:
            list.append(x['name'])
            counter += 1
        else:
            break
    return list

In [25]:
# movies['crew'] = movies['crew'].apply(show)

In [26]:
movies.head(2)

Unnamed: 0,id,genres,title,overview,keywords,cast,crew
0,19995,"[Action, Adventure, Fantasy, Science Fiction]",Avatar,"In the 22nd century, a paraplegic Marine is di...","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,"[Adventure, Fantasy, Action]",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [27]:
def show3(obj):
    list = []
    for x in ast.literal_eval(obj):
            if x['job'] == 'Director':
                list.append(x['name'])
                break
    return list

In [28]:
movies['Director name'] = movies['crew'].apply(show3)

In [29]:
movies.drop('crew',axis = 1,inplace = True)

In [30]:
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(" ","") for i in x] )
movies['genres'] = movies['genres'].apply(lambda x:[i.replace(" ","") for i in x] )
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(" ","") for i in x] )
movies['Director name'] = movies['Director name'].apply(lambda x:[i.replace(" ","") for i in x] )

In [31]:
movies.columns

Index(['id', 'genres', 'title', 'overview', 'keywords', 'cast',
       'Director name'],
      dtype='object')

In [32]:
# movies['tag'] = movies['genres'] + movies['Director name'] + movies['cast'] + movies['keywords']+ movies['overview']

In [33]:
list1 = movies['overview'].tolist()

In [34]:
# movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [35]:
# newdf = movies[['id','title','tag']]

In [36]:
movies.head(2)

Unnamed: 0,id,genres,title,overview,keywords,cast,Director name
0,19995,"[Action, Adventure, Fantasy, ScienceFiction]",Avatar,"In the 22nd century, a paraplegic Marine is di...","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron]
1,285,"[Adventure, Fantasy, Action]",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski]


In [37]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [38]:
movies['tag'] = movies['genres'] + movies['Director name'] + movies['cast'] + movies['keywords']+ movies['overview']

In [39]:
newdf = movies[['id','title','tag']]

In [40]:
# newdf.drop('tags',axis = 1,inplace = True)

# Vectorization

In [42]:
from sklearn.feature_extraction.text import CountVectorizer

In [43]:
cv = CountVectorizer(max_features=5000,stop_words='english')

In [44]:
# cv.fit_transform(newdf['tag']).toarray()

In [45]:
newdf['tag']

0       [Action, Adventure, Fantasy, ScienceFiction, J...
1       [Adventure, Fantasy, Action, GoreVerbinski, Jo...
2       [Action, Adventure, Crime, SamMendes, DanielCr...
3       [Action, Crime, Drama, Thriller, ChristopherNo...
4       [Action, Adventure, ScienceFiction, AndrewStan...
                              ...                        
4804    [Action, Crime, Thriller, RobertRodriguez, Car...
4805    [Comedy, Romance, EdwardBurns, EdwardBurns, Ke...
4806    [Comedy, Drama, Romance, TVMovie, ScottSmith, ...
4807    [DanielHsia, DanielHenney, ElizaCoupe, BillPax...
4808    [Documentary, BrianHerzlinger, DrewBarrymore, ...
Name: tag, Length: 4806, dtype: object

In [47]:
newdf['tag'] = newdf['tag'].apply(lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newdf['tag'] = newdf['tag'].apply(lambda x: " ".join(x))


In [48]:
vector = cv.fit_transform(newdf['tag']).toarray()

# Finding similarity

In [50]:
from sklearn.metrics.pairwise import cosine_similarity

In [51]:
similarity = cosine_similarity(vector)

In [52]:
similarity[0]

array([1.        , 0.07142857, 0.05216405, ..., 0.02326211, 0.02571722,
       0.        ])

In [53]:
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zoo', 'zooeydeschanel', 'zoëkravitz'],
      dtype=object)

# Removing the extra words and similar words 

In [55]:
import nltk

In [56]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [57]:
def show(text):
    y = []

    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

In [58]:
newdf['tag'] = newdf['tag'].apply(show)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newdf['tag'] = newdf['tag'].apply(show)


In [59]:
newdf[newdf['title'] == 'Avatar'].index[0]

0

In [60]:
sorted(list(enumerate(similarity[0])),reverse=True,key= lambda x:x[1])[1:6]

[(1916, 0.23473823893078544),
 (1214, 0.23294541397390256),
 (582, 0.23097828906119441),
 (539, 0.2252817784447915),
 (507, 0.2191252450446388)]

In [61]:
def recoment(movie):
    movie_index= newdf[newdf['title'] == movie].index[0]
    distance = similarity[movie_index]
    movie_list = sorted(list(enumerate(distance)),reverse=True,key= lambda x:x[1])[1:6]
    for x in movie_list:
        print(newdf.iloc[x[0]].title)

In [62]:
recoment('Avatar')

Lifeforce
Aliens vs Predator: Requiem
Battle: Los Angeles
Titan A.E.
Independence Day


In [63]:
newdf.iloc[1916].title

'Lifeforce'

In [64]:
import pickle

In [65]:
pickle.dump(newdf,open('movie_list.pkl','wb'))
pickle.dump(similarity,open('similarity.pkl','wb'))

In [66]:
pickle.dump(newdf.to_dict(),open('movie_dict.pkl.pkl','wb'))

In [144]:
newdf.head()

Unnamed: 0,id,title,tag
0,19995,Avatar,action adventur fantasi sciencefict jamescamer...
1,285,Pirates of the Caribbean: At World's End,adventur fantasi action goreverbinski johnnyde...
2,206647,Spectre,action adventur crime sammend danielcraig chri...
3,49026,The Dark Knight Rises,action crime drama thriller christophernolan c...
4,49529,John Carter,action adventur sciencefict andrewstanton tayl...


In [142]:
recoment('John Carter')

Star Trek: Insurrection
Ghosts of Mars
The Thing
Mission to Mars
The Marine 4: Moving Target


In [151]:
recoment('Avatar')

Lifeforce
Aliens vs Predator: Requiem
Battle: Los Angeles
Titan A.E.
Independence Day
