Netflix is a subscription-based streaming platform that allows users to watch movies and TV shows without advertisements. One of the reasons behind the popularity of Netflix is its recommendation system. Its recommendation system recommends movies and TV shows based on the user’s interest. 


Netflix Recommendation System using Python

The dataset I am using to build a Netflix recommendation system using Python is downloaded from Kaggle.


In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
data = pd.read_csv('C:/Users/saram/Downloads/data science/Ml/archive/netflixData.csv')

In [3]:
data.head()

Unnamed: 0,Show Id,Title,Description,Director,Genres,Cast,Production Country,Release Date,Rating,Duration,Imdb Score,Content Type,Date Added
0,cc1b6ed9-cf9e-4057-8303-34577fb54477,(Un)Well,This docuseries takes a deep dive into the luc...,,Reality TV,,United States,2020.0,TV-MA,1 Season,6.6/10,TV Show,
1,e2ef4e91-fb25-42ab-b485-be8e3b23dedb,#Alive,"As a grisly virus rampages a city, a lone man ...",Cho Il,"Horror Movies, International Movies, Thrillers","Yoo Ah-in, Park Shin-hye",South Korea,2020.0,TV-MA,99 min,6.2/10,Movie,"September 8, 2020"
2,b01b73b7-81f6-47a7-86d8-acb63080d525,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...","Sabina Fedeli, Anna Migotto","Documentaries, International Movies","Helen Mirren, Gengher Gatti",Italy,2019.0,TV-14,95 min,6.4/10,Movie,"July 1, 2020"
3,b6611af0-f53c-4a08-9ffa-9716dc57eb9c,#blackAF,Kenya Barris and his family navigate relations...,,TV Comedies,"Kenya Barris, Rashida Jones, Iman Benson, Genn...",United States,2020.0,TV-MA,1 Season,6.6/10,TV Show,
4,7f2d4170-bab8-4d75-adc2-197f7124c070,#cats_the_mewvie,This pawesome documentary explores how our fel...,Michael Margolis,"Documentaries, International Movies",,Canada,2020.0,TV-14,90 min,5.1/10,Movie,"February 5, 2020"


I can see that the Title column needs preparation as it contains # before the name of the movies or tv shows.

In [4]:
data.isnull().sum()

Show Id                  0
Title                    0
Description              0
Director              2064
Genres                   0
Cast                   530
Production Country     559
Release Date             3
Rating                   4
Duration                 3
Imdb Score             608
Content Type             0
Date Added            1335
dtype: int64

The dataset contains null values, but before removing the null values, let’s select the columns that we can use to build a Netflix recommendation system

In [5]:
data = data[["Title","Description","Content Type","Genres"]]
data.head()

Unnamed: 0,Title,Description,Content Type,Genres
0,(Un)Well,This docuseries takes a deep dive into the luc...,TV Show,Reality TV
1,#Alive,"As a grisly virus rampages a city, a lone man ...",Movie,"Horror Movies, International Movies, Thrillers"
2,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...",Movie,"Documentaries, International Movies"
3,#blackAF,Kenya Barris and his family navigate relations...,TV Show,TV Comedies
4,#cats_the_mewvie,This pawesome documentary explores how our fel...,Movie,"Documentaries, International Movies"


1.The title column contains the titles of movies and TV shows on Netflix

2.Description column describes the plot of the TV shows and movies

3.The Content Type column tells us if it’s a movie or a TV show

4.The Genre column contains all the genres of the TV show or the movie

### dtypes


In [6]:
data.dtypes

Title           object
Description     object
Content Type    object
Genres          object
dtype: object

clean the Title column 

# Data preparation

In [7]:
import nltk
import re
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
from nltk.corpus import stopwords
import string
stopword = set(stopwords.words('english'))

def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]','',text)
    text = re.sub('https?://\S+|www\.\S+','',text)
    text = re.sub('<.*>+','',text)
    text = re.sub('[%s]'% re.escape(string.punctuation),'',text)
    text = re.sub('\n','',text)
    text = re.sub('\w*\d\w','',text)
    text = [word for word in text.split(' ')if word not in stopword]
    text = " ".join(text)
    text = [stemmer.stem(word)for word in text.split(' ')]
    text = " ".join(text)
    return text
data["Title"] = data["Title"].apply(clean)
    

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saram\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
data.Title.sample(10)

370           angelina ballerina
5498    trigger warn killer mike
5294                        warn
4232                     shooter
5280                       untam
1288                        devd
3738                       poppl
2642                      lavend
44                    3 day kill
2297                     ishqiya
Name: Title, dtype: object

# How Netflix Recommendation System Works

The recommendation system of Netflix shows you movies and TV shows according to your interests. Netflix has a lot of data because of its user base. Its recommendation system predicts a personalised catalogue for you based on factors like:

Your viewing history
the viewing history of other users with similar tastes and preferences as yours
genres, category, description, and more information about the content that you watched in the past
The genre of the content is one of the most valuable factors that helps Netflix recommend more content even to new users. I hope you have understood how Netflix recommends content to its users.

Genres column as the feature to recommend similar content to the user. I will use the concept of cosine similarity 


#  cosine similarity

Cosine similarity is used to find similarities between the two documents. It does this by calculating the similarity score between the vectors, which is done by finding the angles between them. The range of similarities is between 0 and 1. If the value of the similarity score between two vectors is 1, it means that there is a greater similarity between the two vectors.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity




In [10]:


# Assuming 'data' is your dataset and "Genres" is a column in the dataset
feature = data["Genres"].tolist()

# Create the TfidfVectorizer
tfidf = TfidfVectorizer(input='content', stop_words="english", max_df=0.85, min_df=0.1)

# Fit and transform the data
tfidf_matrix = tfidf.fit_transform(feature)

# Calculate cosine similarity
similarity = cosine_similarity(tfidf_matrix)


Title column as an index so that we can find similar content by giving the title of the movie or TV show as an input

In [11]:
indices = pd.Series(data.index,index =data['Title']).drop_duplicates()

In [20]:
def netFlix_recommendation(title,similarity = similarity):
    index = indices[title]
    similarity_scores = list(enumerate(similarity[index]))
    similarity_scores = sorted(similarity_scores,key =lambda x: x[1], reverse = True)
    similarity_scores = similarity_scores[0:10]
    movieindices = [i[0] for i in similarity_scores]
    return data['Title'].iloc[movieindices]
print(netFlix_recommendation("girlfriend"))

3                          blackaf
285                     washington
417                 arrest develop
434     astronomi club sketch show
451    aunti donna big ol hous fun
656                      big mouth
752                bojack horseman
805                   brew brother
935                       champion
937                  chappell show
Name: Title, dtype: object
