# Netflix Movie and TV Show Recommendation

In this notebook, we are going to build a recommender system based on [Netflix](https://www.kaggle.com/datasets/shivamb/netflix-shows) dataset. There are several types of recommender systems, one of which is Content-Based Filtering. That's what we focus on in this particular notebook since we are not dealing with any user data such as user's rating and review. The idea is to get the features of each item (content) and give the user a recommendation based on the similarity between them

**Let's jump right into the code**




![](https://www.researchgate.net/profile/Lionel-Ngoupeyou-Tondji/publication/323726564/figure/fig5/AS:631605009846299@1527597777415/Content-based-filtering-vs-Collaborative-filtering-Source.png)

### Importing Libraries

In [1]:
import string
import numpy as np
import pandas as pd
import sklearn

import warnings
warnings.filterwarnings('ignore')

### Loading and Understanding the Data

In [2]:
df = pd.read_csv('Data//netflix_titles.csv')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [3]:
df.describe(include='all')

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
count,8807,8807,8807,6173,7982,7976,8797,8807.0,8803,8804,8807,8807
unique,8807,2,8807,4528,7692,748,1767,,17,220,514,8775
top,s1,Movie,Dick Johnson Is Dead,Rajiv Chilaka,David Attenborough,United States,"January 1, 2020",,TV-MA,1 Season,"Dramas, International Movies","Paranormal activity at a lush, abandoned prope..."
freq,1,6131,1,19,19,2818,109,,3207,1793,362,4
mean,,,,,,,,2014.180198,,,,
std,,,,,,,,8.819312,,,,
min,,,,,,,,1925.0,,,,
25%,,,,,,,,2013.0,,,,
50%,,,,,,,,2017.0,,,,
75%,,,,,,,,2019.0,,,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [5]:
pd.DataFrame({'Total missing values':df.isna().sum(),
              'Percentage':(df.isna().sum()/len(df))*100})

Unnamed: 0,Total missing values,Percentage
show_id,0,0.0
type,0,0.0
title,0,0.0
director,2634,29.908028
cast,825,9.367549
country,831,9.435676
date_added,10,0.113546
release_year,0,0.0
rating,4,0.045418
duration,3,0.034064


### Build the Recommender System

We won't use all the columns or features for this notebook. So then, the recommendation that we give to the users will only consider the information contained in the following columns:
- Type
- Director
- Rating
- Listed_in
- Description

In [6]:
new_df = df[['title', 'type', 'director', 'cast', 'rating', 'listed_in', 'description']]
new_df.set_index('title', inplace=True)
new_df.head()

Unnamed: 0_level_0,type,director,cast,rating,listed_in,description
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dick Johnson Is Dead,Movie,Kirsten Johnson,,PG-13,Documentaries,"As her father nears the end of his life, filmm..."
Blood & Water,TV Show,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",TV-MA,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
Ganglands,TV Show,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",TV-MA,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
Jailbirds New Orleans,TV Show,,,TV-MA,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
Kota Factory,TV Show,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",TV-MA,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


If you take a look at the missing values in this dataset, you will realize that the director column has 2634 NaN values which correspond with almost 30 percents of total data in that column. So, we can't just drop the NaN values because we will lose lots of movies to be given, instead we just fill the NaN values with empty string

In [7]:
new_df.fillna('', inplace=True)

In [8]:
# For director, cast, and listed_in
# Because there is more than 1 people and categories
# We don't want if people share the same first or last name consider the same person
# or the word that appear in many categories (TV, etc) consider the same category
def separate(texts):
    t = []
    for text in texts.split(','):
        t.append(text.replace(' ', '').lower())
    return ' '.join(t)

def remove_space(texts):
    return texts.replace(' ', '').lower()

def remove_punc(texts):
    return texts.translate(str.maketrans('','',string.punctuation)).lower()

In [9]:
new_df['type'] = new_df['type'].apply(remove_space)
new_df['director'] = new_df['director'].apply(separate)
new_df['cast'] = new_df['cast'].apply(separate)
new_df['rating'] = new_df['rating'].apply(remove_space)
new_df['listed_in'] = new_df['listed_in'].apply(separate)
new_df['description'] = new_df['description'].apply(remove_punc)

new_df.head()

Unnamed: 0_level_0,type,director,cast,rating,listed_in,description
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dick Johnson Is Dead,movie,kirstenjohnson,,pg-13,documentaries,as her father nears the end of his life filmma...
Blood & Water,tvshow,,amaqamata khosingema gailmabalane thabangmolab...,tv-ma,internationaltvshows tvdramas tvmysteries,after crossing paths at a party a cape town te...
Ganglands,tvshow,julienleclercq,samibouajila tracygotoas samueljouy nabihaakka...,tv-ma,crimetvshows internationaltvshows tvaction&adv...,to protect his family from a powerful drug lor...
Jailbirds New Orleans,tvshow,,,tv-ma,docuseries realitytv,feuds flirtations and toilet talk go down amon...
Kota Factory,tvshow,,mayurmore jitendrakumar ranjanraj alamkhan ahs...,tv-ma,internationaltvshows romantictvshows tvcomedies,in a city of coaching centers known to train i...


In [10]:
new_df['bag_of_words'] = ''

# Combine all the words into 1 column
for i, row in enumerate(new_df.iterrows()):
    string = ''
    for col in new_df.columns:
        if row[1][col] == '':
            continue
        else:
            string += row[1][col] + ' '
            new_df['bag_of_words'][i] = string.strip()

new_df.drop(new_df.columns[:-1], axis=1, inplace=True)

In [11]:
new_df.head()

Unnamed: 0_level_0,bag_of_words
title,Unnamed: 1_level_1
Dick Johnson Is Dead,movie kirstenjohnson pg-13 documentaries as he...
Blood & Water,tvshow amaqamata khosingema gailmabalane thaba...
Ganglands,tvshow julienleclercq samibouajila tracygotoas...
Jailbirds New Orleans,tvshow tv-ma docuseries realitytv feuds flirta...
Kota Factory,tvshow mayurmore jitendrakumar ranjanraj alamk...


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

**TF-IDF** stands for Term Frequency — Inverse Document Frequency. It tells the importance of a word. In a nutshell, The word that appear more frequently in the corpus, it will be considered less importance, hence the tfidf score will be lower. It goes the opposite way with less frequent word|

In [13]:
tfid = TfidfVectorizer()
tfid_matrix = tfid.fit_transform(new_df['bag_of_words'])

#tfid_matrix.vocabulary_

In [14]:
cosine_sim = cosine_similarity(tfid_matrix, tfid_matrix)
cosine_sim

array([[1.00000000e+00, 6.51109225e-03, 2.51508724e-02, ...,
        1.22541028e-02, 2.30160299e-02, 3.60339526e-02],
       [6.51109225e-03, 1.00000000e+00, 1.02129814e-02, ...,
        1.37342009e-03, 0.00000000e+00, 7.44741025e-04],
       [2.51508724e-02, 1.02129814e-02, 1.00000000e+00, ...,
        7.23719722e-03, 7.21477380e-03, 4.24499407e-02],
       ...,
       [1.22541028e-02, 1.37342009e-03, 7.23719722e-03, ...,
        1.00000000e+00, 1.99475996e-02, 5.30430706e-03],
       [2.30160299e-02, 0.00000000e+00, 7.21477380e-03, ...,
        1.99475996e-02, 1.00000000e+00, 4.54842501e-03],
       [3.60339526e-02, 7.44741025e-04, 4.24499407e-02, ...,
        5.30430706e-03, 4.54842501e-03, 1.00000000e+00]])

In [15]:
# Later on we will combine with similarity as a column
final_df = df[['title', 'type']]

In [16]:
def recommendation(title, total_result=5, threshold=0):
    # Get the index
    idx = final_df[final_df['title'] == title].index[0]
    # Create a new column for similarity, the value is different for each title you input
    final_df['similarity'] = cosine_sim[idx]
    sort_final_df = final_df.sort_values(by='similarity', ascending=False)[1:total_result+1]
    
    # You can set a threshold if you want to norrow the result down 
    sort_final_df = sort_final_df[sort_final_df['similarity'] > threshold]
    
    # Is the title a movie or tv show?
    movies = sort_final_df['title'][sort_final_df['type'] == 'Movie']
    tv_shows = sort_final_df['title'][sort_final_df['type'] == 'TV Show']
    
    if len(movies) != 0:
        print('Similar Movie(s) list:')
        for i, movie in enumerate(movies):
            print('{}. {}'.format(i+1, movie))
        print()
    else:
        print('Similar Movie(s) list:')
        print('-\n')
        
    if len(tv_shows) != 0:
        print('Similar TV_show(s) list:')
        for i, tv_show in enumerate(tv_shows):
            print('{}. {}'.format(i+1, tv_show))
    else:
        print('Similar TV_show(s) list:')
        print('-')

### Recommendation Example

In [26]:
recommendation('Breaking Bad')

Similar Movie(s) list:
1. The Show
2. The Book of Sun

Similar TV_show(s) list:
1. Better Call Saul
2. Marvel's The Punisher
3. Dare Me


In [28]:
recommendation('Narcos', total_result=10)

Similar Movie(s) list:
1. The Congress

Similar TV_show(s) list:
1. Narcos: Mexico
2. Wild District
3. El Cartel
4. Miss Dynamite
5. Cocaine Cowboys: The Kings of Miami
6. The Great Heist
7. El Chapo
8. Apaches
9. Ganglands


In [29]:
recommendation('Chappie')

Similar Movie(s) list:
1. Real Steel
2. District 9
3. 2036 Origin Unknown
4. Singularity
5. AlphaGo

Similar TV_show(s) list:
-


In [30]:
recommendation('Stranger Things', threshold=0.2)

Similar Movie(s) list:
-

Similar TV_show(s) list:
1. Beyond Stranger Things
