# Netflix Movie and TV Show Recommendation

In this notebook, we are going to build a recommender system based on [Netflix](https://www.kaggle.com/datasets/shivamb/netflix-shows) dataset. There are several types of recommender systems, one of which is Content-Based Filtering. That's what we focus on in this particular notebook since we are not dealing with any user data such as user's rating and review. The idea is to get the features of each item (content) and give the user a recommendation based on the similarity between them

**Let's jump right into the code**




![](https://www.researchgate.net/profile/Lionel-Ngoupeyou-Tondji/publication/323726564/figure/fig5/AS:631605009846299@1527597777415/Content-based-filtering-vs-Collaborative-filtering-Source.png)

### Importing Libraries

In [3]:
import string
import numpy as np
import pandas as pd
import sklearn

import warnings
warnings.filterwarnings('ignore')

### Loading and Understanding the Data

In [4]:
df = pd.read_csv('../input/netflix-shows/netflix_titles.csv')
df.head()

In [5]:
df.describe(include='all')

In [6]:
df.info()

In [7]:
pd.DataFrame({'Total missing values':df.isna().sum(),
              'Percentage':(df.isna().sum()/len(df))*100})

### Build the Recommender System

We won't use all the columns or features for this notebook. So then, the recommendation that we give to the users will only consider the information contained in the following columns:
- Type
- Director
- Rating
- Listed_in
- Description

In [8]:
new_df = df[['title', 'type', 'director', 'cast', 'rating', 'listed_in', 'description']]
new_df.set_index('title', inplace=True)
new_df.head()

If you take a look at the missing values in this dataset, you will realize that the director column has 2634 NaN values which correspond with almost 30 percents of total data in that column. So, we can't just drop the NaN values because we will lose lots of movies to be given, instead we just fill the NaN values with empty string

In [9]:
new_df.fillna('', inplace=True)

In [10]:
# For director, cast, and listed_in
# Because there is more than 1 people and categories
# We don't want if people share the same first or last name consider the same person
# or the word that appear in many categories (TV, etc) consider the same category
def separate(texts):
    t = []
    for text in texts.split(','):
        t.append(text.replace(' ', '').lower())
    return ' '.join(t)

def remove_space(texts):
    return texts.replace(' ', '').lower()

def remove_punc(texts):
    return texts.translate(str.maketrans('','',string.punctuation)).lower()

In [11]:
new_df['type'] = new_df['type'].apply(remove_space)
new_df['director'] = new_df['director'].apply(separate)
new_df['cast'] = new_df['cast'].apply(separate)
new_df['rating'] = new_df['rating'].apply(remove_space)
new_df['listed_in'] = new_df['listed_in'].apply(separate)
new_df['description'] = new_df['description'].apply(remove_punc)

new_df.head()

In [12]:
new_df['bag_of_words'] = ''

# Combine all the words into 1 column
for i, row in enumerate(new_df.iterrows()):
    string = ''
    for col in new_df.columns:
        if row[1][col] == '':
            continue
        else:
            string += row[1][col] + ' '
            new_df['bag_of_words'][i] = string.strip()

new_df.drop(new_df.columns[:-1], axis=1, inplace=True)

In [13]:
new_df.head()

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

**TF-IDF** stands for Term Frequency — Inverse Document Frequency. It tells the importance of a word. In a nutshell, The word that appear more frequently in the corpus, it will be considered less importance, hence the tfidf score will be lower. It goes the opposite way with less frequent word|

In [15]:
tfid = TfidfVectorizer()
tfid_matrix = tfid.fit_transform(new_df['bag_of_words'])

#tfid_matrix.vocabulary_

In [16]:
cosine_sim = cosine_similarity(tfid_matrix, tfid_matrix)
cosine_sim

In [17]:
# Later on we will combine with similarity as a column
final_df = df[['title', 'type']]

In [18]:
def recommendation(title, total_result=5, threshold=0.5):
    # Get the index
    idx = final_df[final_df['title'] == title].index[0]
    # Create a new column for similarity, the value is different for each title you input
    final_df['similarity'] = cosine_sim[idx]
    sort_final_df = final_df.sort_values(by='similarity', ascending=False)[1:total_result+1]
    
    # You can set a threshold if you want to norrow the result down 
    #sort_final_df = sort_final_df[sort_final_df['similarity'] > threshold]
    
    # Is the title a movie or tv show?
    movies = sort_final_df['title'][sort_final_df['type'] == 'Movie']
    tv_shows = sort_final_df['title'][sort_final_df['type'] == 'TV Show']
    
    if len(movies) != 0:
        print('Similar Movie(s) list:')
        for i, movie in enumerate(movies):
            print('{}. {}'.format(i+1, movie))
        print()
    else:
        print('Similar Movie(s) list:')
        print('-\n')
        
    if len(tv_shows) != 0:
        print('Similar TV_show(s) list:')
        for i, tv_show in enumerate(tv_shows):
            print('{}. {}'.format(i+1, tv_show))
    else:
        print('Similar TV_show(s) list:')
        print('-')

### Recommendation Example

In [19]:
recommendation('Breaking Bad')

In [20]:
recommendation('Narcos')

In [21]:
recommendation('Chappie')

In [22]:
recommendation('Stranger Things')