# TMDB
---
## Scenario
TMDB.org is a crowd-sourced movie information database used by many film-related consoles, sites and apps, such as XBMC, MythTV and Plex. Dozens of media managers, mobile apps and social sites make use of its API.
TMDb lists some 80,000 films at time of writing, which is considerably fewer than IMDb. While not as complete as IMDb, it holds extensive information for most popular/Hollywood films.
This is dataset of the 10,000 most popular movies across the world has been fetched through the read API.
TMDB's free API provides for developers and their team to programmatically fetch and use TMDb's data.
Their API is to use as long as you attribute TMDb as the source of the data and/or images. Also, they update their API from time to time.

This data set is fetched using exception handling process so the data set contains some null values as there are missing fields in the tmdb database.

P.S: In the overview section, there are 30 missing data. In this Analysis, we will remove those data. 


### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load the dataset
df = pd.read_csv('TMDB - Recommendation.csv')

In [3]:
# Check data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         10000 non-null  int64  
 1   title              10000 non-null  object 
 2   overview           9970 non-null   object 
 3   original_language  10000 non-null  object 
 4   vote_count         10000 non-null  int64  
 5   vote_average       10000 non-null  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 468.9+ KB


### Check for missing values 

In [4]:
# Check for missing values
df.isnull().sum()

Unnamed: 0            0
title                 0
overview             30
original_language     0
vote_count            0
vote_average          0
dtype: int64

In [5]:
# drop the default index number
df = df.drop('Unnamed: 0', axis=1)

In [6]:
# View once more
df.head(10)

Unnamed: 0,title,overview,original_language,vote_count,vote_average
0,Ad Astra,"The near future, a time when both hope and har...",en,2853,5.9
1,Bloodshot,"After he and his wife are murdered, marine Ray...",en,1349,7.2
2,Bad Boys for Life,Marcus and Mike are forced to confront new thr...,en,2530,7.1
3,Ant-Man,Armed with the astonishing ability to shrink i...,en,13611,7.1
4,Percy Jackson: Sea of Monsters,"In their quest to confront the ultimate evil, ...",en,3542,5.9
5,Birds of Prey (and the Fantabulous Emancipatio...,"Harley Quinn joins forces with a singer, an as...",en,2639,7.1
6,Live Free or Die Hard,"John McClane is back and badder than ever, and...",en,3714,6.5
7,Cold Blood,A legendary but retired hit man lives in peace...,fr,119,5.1
8,Underwater,After an earthquake destroys their underwater ...,en,584,6.5
9,The Platform,"A mysterious place, an indescribable prison, a...",es,1924,7.2


In [7]:
# Check the date types once more
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   title              10000 non-null  object 
 1   overview           9970 non-null   object 
 2   original_language  10000 non-null  object 
 3   vote_count         10000 non-null  int64  
 4   vote_average       10000 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 390.8+ KB


In [8]:
# Check for missing values once more
df.isnull().sum()

title                 0
overview             30
original_language     0
vote_count            0
vote_average          0
dtype: int64

### Drop the missing rows 

In [9]:
# Drop 30 rows that contains missing values in overview
df.dropna(inplace=True)

In [10]:
# Check the data types once more
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9970 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   title              9970 non-null   object 
 1   overview           9970 non-null   object 
 2   original_language  9970 non-null   object 
 3   vote_count         9970 non-null   int64  
 4   vote_average       9970 non-null   float64
dtypes: float64(1), int64(1), object(3)
memory usage: 467.3+ KB


### Check the missing values 

In [11]:
# Check the missing values again
df.isnull().sum()

title                0
overview             0
original_language    0
vote_count           0
vote_average         0
dtype: int64

In [12]:
# Describe summary statistics
df.describe()

Unnamed: 0,vote_count,vote_average
count,9970.0,9970.0
mean,1023.820662,6.316028
std,1994.548401,1.331982
min,0.0,0.0
25%,144.0,5.8
50%,334.0,6.5
75%,929.75,7.1
max,25148.0,10.0


### Recommendation

In [14]:
# Vectorizing the 'Overview' column
tfidf = TfidfVectorizer(min_df=4, max_df=0.7)

vectorized_data = tfidf.fit_transform(df['overview'])

In [15]:
# Turn the vectorized_data into a DataFrame that with feature_names as columns and movie title as index
tfidf_df = pd.DataFrame(vectorized_data.toarray(), columns=tfidf.get_feature_names_out())  # Use get_feature_names_out() for TF-IDF
tfidf_df.index = df['title']
tfidf_df

Unnamed: 0_level_0,000,007,10,100,11,11th,12,12th,13,13th,...,zebra,zero,zeus,zoe,zombie,zombies,zone,zones,zoo,zooey
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ad Astra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bloodshot,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bad Boys for Life,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ant-Man,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Percy Jackson: Sea of Monsters,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cargo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Good Night,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The World Is Yours,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Grand Seduction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
cosine_similarity_array = cosine_similarity(tfidf_df)

In [17]:
cosine_sim_df = pd.DataFrame(cosine_similarity_array, index = tfidf_df.index, columns = tfidf_df.index)
cosine_sim_df

title,Ad Astra,Bloodshot,Bad Boys for Life,Ant-Man,Percy Jackson: Sea of Monsters,Birds of Prey (and the Fantabulous Emancipation of One Harley Quinn),Live Free or Die Hard,Cold Blood,Underwater,The Platform,...,Attack on Titan,Pokémon: The Rise of Darkrai,Eagle vs Shark,High Flying Bird,Zapped!,Cargo,The Good Night,The World Is Yours,The Grand Seduction,Woochi: The Demon Slayer
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ad Astra,1.000000,0.032822,0.006165,0.008447,0.006024,0.004912,0.032813,0.020560,0.050126,0.054077,...,0.020882,0.057016,0.011806,0.000000,0.008515,0.068984,0.020412,0.018521,0.014346,0.006578
Bloodshot,0.032822,1.000000,0.029857,0.047118,0.020810,0.019567,0.100522,0.120275,0.009876,0.010735,...,0.005413,0.011970,0.053973,0.093831,0.060153,0.015773,0.154289,0.051140,0.019583,0.033893
Bad Boys for Life,0.006165,0.029857,1.000000,0.048757,0.046121,0.023651,0.011685,0.000000,0.004135,0.007587,...,0.000000,0.005700,0.024760,0.000000,0.008817,0.006573,0.000000,0.036263,0.004249,0.004751
Ant-Man,0.008447,0.047118,0.048757,1.000000,0.019959,0.021598,0.015862,0.061623,0.038073,0.007773,...,0.006014,0.028940,0.013177,0.050987,0.012473,0.016969,0.038265,0.030695,0.089500,0.043655
Percy Jackson: Sea of Monsters,0.006024,0.020810,0.046121,0.019959,1.000000,0.015563,0.040686,0.019989,0.020162,0.027631,...,0.002048,0.031944,0.016348,0.018754,0.004547,0.008968,0.028091,0.018597,0.009282,0.006482
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cargo,0.068984,0.015773,0.006573,0.016969,0.008968,0.025985,0.055242,0.009983,0.046565,0.042709,...,0.024243,0.056612,0.034292,0.000000,0.018119,1.000000,0.023545,0.051968,0.006144,0.024750
The Good Night,0.020412,0.154289,0.000000,0.038265,0.028091,0.053849,0.096785,0.111569,0.026576,0.069315,...,0.013046,0.034897,0.078972,0.083816,0.067363,0.023545,1.000000,0.050389,0.022797,0.011627
The World Is Yours,0.018521,0.051140,0.036263,0.030695,0.018597,0.042879,0.035019,0.015502,0.002651,0.011379,...,0.009083,0.008549,0.009902,0.016353,0.024885,0.051968,0.050389,1.000000,0.000000,0.003046
The Grand Seduction,0.014346,0.019583,0.004249,0.089500,0.009282,0.000000,0.017725,0.025587,0.029325,0.024095,...,0.005938,0.056895,0.011419,0.022504,0.019293,0.006144,0.022797,0.000000,1.000000,0.004441


In [40]:
print(f"Enter your desired movies and see the most similar movies based on the its' overview : {cosine_sim_df.loc['Bad Boys for Life'].sort_values(ascending=False).head()[1:].index}")

Enter your desired movies and see the most similar movies based on the its' overview : Index(['Bad Boys II', 'Scarface', 'Ride Along 2', 'Bad Boys'], dtype='object', name='title')


In [19]:
cosine_sim_df.loc['Bad Boys for Life'].sort_values(ascending=False).head()

title
Bad Boys for Life    1.000000
Bad Boys II          0.329347
Scarface             0.197642
Ride Along 2         0.183544
Bad Boys             0.175233
Name: Bad Boys for Life, dtype: float64

In [37]:
print(f"Enter your desired movies and see the most similar movies based on the its' overview : {cosine_sim_df.loc['The Dark Knight'].sort_values(ascending=False).head()[1:].index}")

Enter your desired movies and see the most similar movies based on the its' overview : Index(['The Dark Knight Rises', 'Batman Returns', 'Batman vs. Two-Face',
       'Batman Forever'],
      dtype='object', name='title')


In [34]:
cosine_sim_df.loc['The Dark Knight'].sort_values(ascending=False).head(20)

title
The Dark Knight                            1.000000
The Dark Knight Rises                      0.303585
Batman Returns                             0.240944
Batman vs. Two-Face                        0.240244
Batman Forever                             0.238760
Batman: The Killing Joke                   0.216723
Batman: Under the Red Hood                 0.210683
Batman                                     0.199441
Batman: Gotham by Gaslight                 0.194339
Batman: Year One                           0.192394
Batman: The Dark Knight Returns, Part 2    0.189867
LEGO DC: Batman - Family Matters           0.183002
The Batman vs. Dracula                     0.177042
Batman: The Dark Knight Returns, Part 1    0.174338
Batman: Assault on Arkham                  0.165647
The Lego Batman Movie                      0.160048
Batman Begins                              0.159692
Batman: Mask of the Phantasm               0.156678
Batman Beyond: Return of the Joker         0.155831
Batman

In [28]:
cosine_sim_df.loc['Ad Astra'].sort_values(ascending=False).head(10)

title
Ad Astra                        1.000000
Jurassic Galaxy                 0.177196
First Man                       0.166499
Deep Impact                     0.164413
Space Battleship Yamato         0.151363
Gravity                         0.141033
Lucy in the Sky                 0.140430
Lost in Space                   0.138648
Mission to Mars                 0.134254
Space Pirate Captain Harlock    0.129462
Name: Ad Astra, dtype: float64