# Movies Recommendation

In [1]:
import numpy as np
import pandas as pd

## Load, understand, and format dataset

In [3]:
df = pd.read_csv('tmdb_5000_movies.csv.gz', compression='gzip')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [4]:
df.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [8]:
df = df[['genres', 'overview', 'popularity', 'tagline', 'title']]
df.isnull().sum()

genres          0
overview        3
popularity      0
tagline       844
title           0
dtype: int64

In [11]:
df['tagline'].fillna('', inplace=True)
df['overview'].fillna('', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tagline'].fillna('', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['overview'].fillna('', inplace=True)


In [19]:
df['description'] = df['tagline'] + ' ' + df['overview']
df.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['description'] = df['tagline'] + ' ' + df['overview']


Unnamed: 0,genres,overview,popularity,tagline,title,description
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","In the 22nd century, a paraplegic Marine is di...",150.437577,Enter the World of Pandora.,Avatar,Enter the World of Pandora. In the 22nd centur...
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","Captain Barbossa, long believed to be dead, ha...",139.082615,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,"At the end of the world, the adventure begins...."


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   genres       4803 non-null   object 
 1   overview     4803 non-null   object 
 2   popularity   4803 non-null   float64
 3   tagline      4803 non-null   object 
 4   title        4803 non-null   object 
 5   description  4803 non-null   object 
dtypes: float64(1), object(5)
memory usage: 225.3+ KB


## Preprocessing textual data

In [31]:
import nltk
import re
nltk.download('punkt')

stop_words = nltk.corpus.stopwords.words('english')

def normalize_doc(doc):
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A) # Replace anything except 0..9, a..z and A..Z with nothing, ignore case, keep ASCII characters only.
    doc = doc.lower()
    doc = doc.strip()
    tokens = nltk.word_tokenize(doc)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    doc = ' '.join(filtered_tokens)
    return doc

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\THANG\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [32]:
normalize_corpus = np.vectorize(normalize_doc)

norm_corpus = normalize_corpus(list(df['description']))
norm_corpus[:3]

array(['enter world pandora 22nd century paraplegic marine dispatched moon pandora unique mission becomes torn following orders protecting alien civilization',
       'end world adventure begins captain barbossa long believed dead come back life headed edge earth turner elizabeth swann nothing quite seems',
       'plan one escapes cryptic message bonds past sends trail uncover sinister organization battles political forces keep secret service alive bond peels back layers deceit reveal terrible truth behind spectre'],
      dtype='<U803')

In [33]:
len(norm_corpus)

4803

## Extract TF-IDF Features


In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range=(1,2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape

(4803, 20667)

## Compute Pairwise Document Similarity

In [35]:
from sklearn.metrics.pairwise import cosine_similarity

doc_similarity = cosine_similarity(tfidf_matrix)

doc_sim_df = pd.DataFrame(doc_similarity)
doc_sim_df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4793,4794,4795,4796,4797,4798,4799,4800,4801,4802
0,1.0,0.010704,0.0,0.019031,0.028692,0.024902,0.0,0.026517,0.0,0.007421,...,0.009704,0.0,0.023337,0.03355,0.0,0.0,0.0,0.006894,0.0,0.0
1,0.010704,1.0,0.011893,0.0,0.041627,0.0,0.014566,0.027124,0.034692,0.007616,...,0.009958,0.0,0.004819,0.0,0.0,0.012594,0.0,0.022394,0.013725,0.0
2,0.0,0.011893,1.0,0.0,0.0,0.0,0.0,0.022242,0.015855,0.004892,...,0.042623,0.0,0.0,0.0,0.01652,0.0,0.0,0.011683,0.0,0.004001


## Build a movie recommender function to recommend top 5 similar movies for any movie

In [36]:
movies = df['title'].values
movies[:3], movies.shape

(array(['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre'],
       dtype=object),
 (4803,))

#### Test first

In [66]:
movie_idx = np.where(movies=='Minions')[0][0]
movie_idx

546

In [67]:
movie_sim = doc_sim_df.iloc[movie_idx].values
movie_sim

array([0.01045672, 0.01073063, 0.        , ..., 0.00691106, 0.        ,
       0.        ])

In [68]:
movie_sim_idx = np.argsort(-movie_sim)[1:6]
movie_sim_idx 

array([506, 614, 241, 813, 154], dtype=int64)

In [69]:
movies[movie_sim_idx]

array(['Despicable Me 2', 'Despicable Me',
       'Teenage Mutant Ninja Turtles: Out of the Shadows', 'Superman',
       'Rise of the Guardians'], dtype=object)

#### Build function

In [43]:
# Option 1

def recommend(movie_index):
    index = np.where(doc_sim_df.index == movie_index)[0][0]
    similar_titles_idx = sorted(enumerate(doc_similarity[index]),key= lambda x: x[1], reverse =True)[1:6]
    
    for i in similar_titles_idx:
        print(doc_sim_df.index[i[0]])

In [74]:
# Option 2

def recommend(movie_title):
    index = np.where(movies==movie_title)[0][0]
    movie_sim = doc_sim_df.iloc[index].values
    movie_sim_idx = np.argsort(-movie_sim)[1:6]
    return movies[movie_sim_idx]

In [75]:
recommend('Minions')

array(['Despicable Me 2', 'Despicable Me',
       'Teenage Mutant Ninja Turtles: Out of the Shadows', 'Superman',
       'Rise of the Guardians'], dtype=object)

## Get popular movie recommendation

In [77]:
pop = df.sort_values(by='popularity', ascending=False)
pop.head()

Unnamed: 0,genres,overview,popularity,tagline,title,description
546,"[{""id"": 10751, ""name"": ""Family""}, {""id"": 16, ""...","Minions Stuart, Kevin and Bob are recruited by...",875.581305,"Before Gru, they had a history of bad bosses",Minions,"Before Gru, they had a history of bad bosses M..."
95,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 18, ""...",Interstellar chronicles the adventures of a gr...,724.247784,Mankind was born on Earth. It was never meant ...,Interstellar,Mankind was born on Earth. It was never meant ...
788,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Deadpool tells the origin story of former Spec...,514.569956,Witness the beginning of a happy ending,Deadpool,Witness the beginning of a happy ending Deadpo...
94,"[{""id"": 28, ""name"": ""Action""}, {""id"": 878, ""na...","Light years from Earth, 26 years after being a...",481.098624,All heroes start somewhere.,Guardians of the Galaxy,All heroes start somewhere. Light years from E...
127,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",An apocalyptic story set in the furthest reach...,434.278564,What a Lovely Day.,Mad Max: Fury Road,What a Lovely Day. An apocalyptic story set in...


In [78]:
pop_title = list(pop['title'].values)
pop_title[:3]

['Minions', 'Interstellar', 'Deadpool']

In [82]:
for movie in pop_title[:5]:
    print('Popular Movie:', movie)
    print('Top 5 recommended similar movies:', recommend(movie))
    print()

Popular Movie: Minions
Top 5 recommended similar movies: ['Despicable Me 2' 'Despicable Me'
 'Teenage Mutant Ninja Turtles: Out of the Shadows' 'Superman'
 'Rise of the Guardians']

Popular Movie: Interstellar
Top 5 recommended similar movies: ['Gattaca' 'Space Pirate Captain Harlock' 'Space Cowboys'
 'Starship Troopers' 'Final Destination 2']

Popular Movie: Deadpool
Top 5 recommended similar movies: ['Silent Trigger' 'Underworld: Evolution' 'Bronson' 'Shaft' 'Don Jon']

Popular Movie: Guardians of the Galaxy
Top 5 recommended similar movies: ['Chasing Mavericks' 'E.T. the Extra-Terrestrial' 'American Sniper'
 'The Amazing Spider-Man 2' 'Hoop Dreams']

Popular Movie: Mad Max: Fury Road
Top 5 recommended similar movies: ['The 6th Day' 'Star Trek Beyond' 'Kites' 'The Orphanage'
 'The Water Diviner']

