# **Project Name**    - End to End Machine Learning

Netflix-Movies-and-TV-Shows-Clustering.

##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual

NAME - AMIT GUHA

Email - aguha535@gmail.com

Video Presentation Link -

# **Project Summary -**

Netflix Inc. is an American media company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it operates the over-the-top subscription video on-demand service Netflix brand, which includes original films and television series commissioned or acquired by the company, and third-party content licensed from other distributors.

This dataset consists of tv shows and movies available on Netflix. The dataset is collected from Flexible which is a third-party Netflix search engine. Netflix movies and TV shows clustering is a data analysis and machine learning technique that Netflix uses to group its content into similar categories. This technique involves analyzing the various characteristics of each title, such as genre, cast, and plot, and using algorithms to identify patterns and similarities. In essence, it's a set of algorithms using machine learning to analyze user data and movie ratings. To make it more effective, Netflix has set up 1,300 recommendation clusters based on users viewing preferences. Netflix's target market is young, tech-savvy users and anyone with digital connectivity. The audience of Netflix is from diverse age groups and demographics. However, most of the audience are teenagers, college-goers, entrepreneurs, working professionals, etc. Netflix's target consumers are divided into segments based on demographics, behavioural intents, and psychographic segmentation. Like most licensing agreements, the deal is structured in a traditional form, whereby Netflix pays for each film determined by rate cards on a sliding scale by each title's domestic or worldwide box office receipts. Netflix uses machine learning and algorithms to help break viewers' preconceived notions and find shows that they might not have initially chosen. To do this, it looks at nuanced threads within the content, rather than relying on broad genres to make its predictions.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import matplotlib.cm as cm

from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

#for nlp
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
import scipy.cluster.hierarchy as sch


import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
working_path = "/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv"
df = pd.read_csv(working_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info(memory_usage = 'deep')

In [None]:
df.columns

In [None]:
#  Defining Data Info All
def DataInfoAll(df):
    print(f"Dataset Shape: {df.shape}")
    print("-"*125)
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.iloc[0].values
    summary['Second Value'] = df.iloc[1].values
    return summary
DataInfoAll(df)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df_duplicate = df[df.duplicated()]
print("Let's print all the duplicated rows as a dataframe")       # No duplicate values present in this dataset.
df_duplicate

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
NaN_Checker = pd.DataFrame({"No Of Total Values": df.shape[0] , "No of NaN values": df.isnull().sum(),
                    "%age of NaN values" : round((df.isnull().sum()/ df.shape[0])*100 , 2) })
NaN_Checker.sort_values("No of NaN values" , ascending = False)

director column has highest NaN values 30.7% data is missing

cast column has 9% NaN values

country , date_added , rating this columns also containing missing values

In [None]:
# Visualizing the missing values
# Ploting the null values present in the dataset

In [None]:
plot_nan = df.isna()
plot_nan.head(2)

In [None]:
plt.figure( figsize = (10 , 5))
sns.heatmap(plot_nan)

In [None]:
# Using barplot to check the no of NaN values present in this dataset
# null value distribution
null_counts = df.isnull().sum()/len(df)
plt.figure(figsize=(10,5))
plt.xticks(np.arange(len(null_counts)),null_counts.index,rotation='vertical')
plt.ylabel('fraction of rows with missing data')
plt.bar(np.arange(len(null_counts)),null_counts)

# director and cast contains large number of null values so we will drop it

In [None]:
# Dropping irrelevent features
df.drop(['director','cast'],axis=1, inplace=True)

In [None]:
# NaN values on data_added
data_added_NaN = df[df['date_added'].isna()]
data_added_NaN.head(2)

In [None]:
data_added_NaN.shape

In [None]:
# There are only 10 observations which are containing NaN values in data_added column
print(f"Before dropping the NaN values from date_added the shape was {df.shape}")
df.dropna(subset = [ 'date_added' ], inplace = True)
print(f"After dropping the NaN values from date_added now the shape is {df.shape}")

In [None]:
# looking for unique values
df.nunique()

In [None]:
#Unique values of type column
df['type'].unique()

In [None]:
# Production Growth based on type of the content & release_year
yearly_movies=df[df.type =='TV Show']['release_year'].value_counts().sort_index(ascending=False).head(15)
yearly_shows=df[df.type =='Movie']['release_year'].value_counts().sort_index(ascending=False).head(15)
total_content=df['release_year'].value_counts().sort_index(ascending=False).head(15)
yearly_movies.head()

In [None]:
sns.set(font_scale=1.4)
total_content.plot(figsize=(12, 6), linewidth=2.5, color='green',label="Total Content / year")
yearly_movies.plot(figsize=(12, 6), linewidth=2.5, color='maroon',label="Movies / year",ms=3)
yearly_shows.plot(figsize=(12, 6), linewidth=2.5, color='blue',label="TV Shows / year")
plt.xlabel("Years", labelpad=15)
plt.ylabel("Number", labelpad=15)
plt.legend()
plt.title("Production Growth Yearly", y=1.02, fontsize=22);

In [None]:
 # release_year
 # all unique values present in release_year
 df['release_year'].unique()

In [None]:
# Checking the Datatype of release_year column
type(df['release_year'][0])

In [None]:
# value_count is on release_year
df['release_year'].value_counts().to_frame().T

In [None]:
# Checking outliers on release_year column
sns.boxplot(x= df.release_year)

# As we have seen earlier before 2014 the production growth for Movies & Tv Shows were very less ,that's why it's showing those values(release_year less than 2009) as outliers

In [None]:
type(df.release_year[0])

In [None]:
# Replacing outliers with mean value
release_year_Q1 = df.release_year.quantile(0.25)
release_year_Q3 = df.release_year.quantile(0.75)
release_year_IQR = release_year_Q3 - release_year_Q1
print(f'release_year_Q1 = {release_year_Q1}\nrelease_year_Q3 = {release_year_Q3}\nrelease_year_IQR = {release_year_IQR}')

In [None]:
# we don't have have any release_year which is greater than 2018
release_year_outliers = df[(df.release_year < (release_year_Q1 - 1.5 * release_year_IQR)) |
                          ( df.release_year > (release_year_Q3 + 1.5 * release_year_IQR)) ]

In [None]:
release_year_outliers

In [None]:
# 15% value is 2009
df["release_year"] = np.where(df["release_year"] <2009, df.release_year.mean(),df['release_year'])

In [None]:
# Boxplot for release_year
# stastical data
df.release_year.describe()

In [None]:
sns.boxplot(x= df.release_year)

In [None]:
print(f"Datatype of release_year = ",type(df.release_year.iloc[0]))
df.release_year = df.release_year.astype("int64")
print(f"Datatype of release_year = ",type(df.release_year.iloc[0]))

# **Title**

In [None]:
# No of unique title present in title column
df.title.nunique()

In [None]:
df.shape[0]

In [None]:
# All the values present in Title are unique
#subsetting df
df_wordcloud = df['title']
text = " ".join(word for word in df_wordcloud)
# Create stopword list:
stopwords = set(STOPWORDS)
# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="black").generate(text)
# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Inference:

It seems like words like "Love", "Man", "World", "Story" , "Christmas" are very common in titles.

I have suprised to see "Christmas" ocuured so many time . The reason maybe those movies released on the month of december, but I don't have any information about the release month of movies that's why I am not able to check my hypothesis.

In [None]:
# Countries producing most number of contents
#type
#Plotting pie chart on type feature
plt.figure(figsize=(10, 5))
labels=['TV Show', 'Movie']
plt.pie(df['type'].value_counts().sort_values(),labels=labels,explode=[0.01,0.01],
        autopct='%1.2f%%', startangle=90)
plt.title('Type of Netflix Content')
plt.axis('equal')
plt.show()

Most of the contents are Movies

Less than ⅓ content are Tv Shows

In [None]:
# duration
# Checking NaN values
df.duration.isna().sum()                  # There is no NaN value present.

In [None]:
#  Checking datatype
type(df.duration.iloc[0][0])

In [None]:
# How many unique values present in duration column ??
df.duration.nunique()

In [None]:
 # Using value_count() method
 df.duration.value_counts().to_frame().T

In [None]:
# define convert_seasons_to_min
def convert_seasons_to_min(value):
  """
  This function will calculate no of total mins as per season no.
  Here our assumptions are
    1. on average 5 episodes are there in a season.
    2. each episode avg time is 55 mins.
  """
  no_of_avg_episode = 5
  if "Seasons" in value:
    #containing more than 1 seasons
    value = value.replace("Seasons",'')
    value = value.replace(" ","")
    total_seasons = int(value)
    each_season_mins = ( no_of_avg_episode * 55 )
    total_mins = (total_seasons * each_season_mins)
    return total_mins

  elif "Season" in value:
    # containing only 1 season
    value = value.replace("Season",'')
    value = value.replace(" ","")
    total_mins = (no_of_avg_episode * 55)
    return total_mins

In [None]:
#Checking the function
convert_seasons_to_min("4 Seasons")

"4 Seasons" :

4 Seasons = (45) or 20 episodes

Each episode avg. time is 55 mins.

Total time (in minutes. ) = (5520) min
= 1100 mins

In [None]:
# define all_the_duration_in_minutes
def all_the_duration_in_minutes():
  """
  This function will convert all the duration
  whether it's in minutes or season format to minute
  """
  # replaced all the min with null string
  df['duration'] = df.duration.str.replace(" min" , "")
  # this time_list will contain all the value
  time_list =[]
  for time in df.duration.values:
    if "Season" in time:
      #time is containing Season
      # calling convert_seasons_to_min function to convert
      # season to total min
      time = convert_seasons_to_min(time)
    else:
      #replacing single space with ""
      time = time.replace(" ","")
    #appending time (it's not containing words like min or seasons)
    time_list.append(time)

  #converting all the time into integer format
  time_list = [ int(Time) for Time in time_list]

  #Assigning time_list to df.duration
  df.duration = time_list

In [None]:
df.duration.value_counts().to_frame().T

In [None]:
all_the_duration_in_minutes()

In [None]:
df.duration.value_counts().to_frame().T

In [None]:
# Analysis on the duration of the movies
sns.set(style="darkgrid")
plt.figure(figsize = (8,5))
sns.kdeplot(data = df.duration[df['type'] == 'Movie'] , shade=True)

Most content are about 70 to 120 min duration for movies

In [None]:
# Analysis on the duration of the TV-Shows
df['type'].value_counts()

In [None]:
sns.set(style="darkgrid")
plt.figure(figsize = (10,5))
sns.kdeplot(data = df.duration[df['type'] == 'TV Show'] , shade=True)

In [None]:
# listed_in
 # How many unique values present in listed_in ??
df.listed_in.nunique()

In [None]:
# How many NaN values present in listed_in ?
df.listed_in.isna().sum()

In [None]:
# Value_counts()
df.listed_in.value_counts().to_frame().T

In [None]:
# Making Categories
categories = ", ".join(df['listed_in']).split(", ")
categories[:5]

In [None]:
len(categories)

In [None]:
len(set(categories))

There are 42 unique categories present & in this dataset all the categories occured in total 17051 times

Creating a dictionary ( category_wise_count )
where for each category there will be a value which basically tells us how many times that particular category occured

In [None]:
category_wise_count = {}
for category in set(categories):
  category_wise_count[category] = categories.count(category)

category_wise_count

In [None]:
# Sorting category_wise_count by value
sorted_category_wise_count = sorted(category_wise_count.items(), key=lambda x: x[1])
sorted_category_wise_count[:4]

In [None]:
# Top 5 most occurred category
sorted_category_wise_count[-5:]

In [None]:
# Top 10 most occurred categories
top_10_most_occurred_categories = sorted_category_wise_count[-10:]
top_10_most_occurred_categories

In [None]:
top_10_most_occurred_category_name = []
top_10_most_occurred_category_count = []
for tup in top_10_most_occurred_categories:
  top_10_most_occurred_category_name.append(tup[0])
  top_10_most_occurred_category_count.append(tup[1])

top_10_most_occurred_category_name

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.description.iloc[0]

In [None]:
First_des = df.description.iloc[0]
First_des

In [None]:
# Importing necessary libraries
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

In [None]:
import nltk
nltk.download()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Removing punctuations
def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space,
    # which in effect deletes the punctuation marks
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

In [None]:
df['description'] = df['description'].apply(remove_punctuation)
df.head()

In [None]:
# Removing stopwords
import nltk
nltk.download('stopwords')

In [None]:
# extracting the stopwords from nltk library
sw = nltk.corpus.stopwords.words('english')
# displaying the stopwords
for i in sw:
  print(i , end=',  ')

In [None]:
print("Number of stopwords in english : ", len(sw))

In [None]:
def remove_stopwords(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    #Method 1
    text1 = [word.lower() for word in text.split() if word.lower() not in sw]
    # joining the list of words with space separator
    return " ".join(text1)

In [None]:
df['description'] = df['description'].apply( remove_stopwords )
df.head()

Now all the values of description are punctutation free ans stopword free

In [None]:
# Using CountVectorizer() to count vocabulary items

# Create a count vectorizer object
count_vectorizer = CountVectorizer()
# fit the count vectorizer using the text data
count_vectorizer.fit(df['description'])
# Collect the vocabulary items used in the vectorizer
dictionary = count_vectorizer.vocabulary_.items()

In [None]:
dictionary

In [None]:
vocab = [ ]
count_of_vocab = []
for key , value in dictionary:
  vocab.append( key )
  count_of_vocab.append( value )

In [None]:
# Creating a new DataFrame vocab_before_stemming

# Store the count in panadas dataframe with vocab as index
vocab_before_stemming = pd.DataFrame({"Word": vocab ,
                                      "count" :count_of_vocab})
# Sort the dataframe
vocab_before_stemming = vocab_before_stemming.sort_values("count" ,ascending=False)
vocab_before_stemming.head(4)

In [None]:
vocab_before_stemming.head(20).T

In [None]:
vocab_before_stemming.tail(4)

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 - TOP 10 Most Occurred Category By Count
plt.figure( figsize = (10,5))
color=['lightpink', 'violet', 'green', 'blue', 'cyan' , "lightblue" ,'red', 'pink', 'yellow', 'orange']
plt.barh(top_10_most_occurred_category_name , top_10_most_occurred_category_count ,
        color= color)
plt.title("Top 10 Most Popular Categories",fontsize = 19)
plt.xlabel("Count", fontsize = 14 )
plt.ylabel("Category Name" , fontsize = 14 )
plt.figure( figsize = (10,5))

In [None]:
# Creating a new column no_of_category
# Datatype of listed_in values
type(df.listed_in.iloc[0])

In [None]:
(df.listed_in.iloc[0])

In [None]:
(df.listed_in.iloc[0]).split(",")

In [None]:
len((df.listed_in.iloc[0]).split(","))

In [None]:
no_of_category = []
for categories in df.listed_in.values:
  len_categories = len(categories.split(","))
  no_of_category.append(len_categories)
df['no_of_category'] = no_of_category

In [None]:
df[['listed_in' , 'no_of_category']].head()

#### Chart - 2

In [None]:
# Chart - 2 - Histogram of no_of_category using listed_in
df.no_of_category.unique()

In [None]:
df.no_of_category.value_counts()

In [None]:
sns.set(style="darkgrid")
plt.figure(figsize = (10,5))
plt.hist(df.no_of_category , bins=[1,2,3,4] , range = (1 ,4) , rwidth = 0.85, color ='lightblue')
plt.xlabel("No of categories")
plt.ylabel("Count")

#### Chart - 3

In [None]:
# Chart - 3 - Most popular TV-Shows Rating
df['type'].unique()

In [None]:
df_tv_show = df[df['type']== 'TV Show' ]
df_tv_show.head(2)

In [None]:
#Pointplot on top tv show ratings
tv_ratings = df_tv_show.groupby(['rating'])['show_id'].count().reset_index(name = 'count').sort_values(by = 'count', ascending = False)
fig_dims = (7,4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.pointplot(x='rating',y='count',data=tv_ratings)
plt.title('Top TV Show Ratings Based On Rating System',size='15')
plt.show()

#### Chart - 4

In [None]:
# Chart - 4 - Most popular Movies Rating
df_movies = df[df['type'] == 'Movie' ]
df_movies.head(2)

In [None]:
#Pointplot on top tv show ratings
tv_ratings = df_movies.groupby(['rating'])['show_id'].count().reset_index(name = 'count').sort_values(by = 'count', ascending = False)
fig_dims = (10,5)
fig, ax = plt.subplots(figsize=fig_dims)
sns.pointplot(x='rating',y='count',data=tv_ratings)
plt.title('Top Movie Ratings Based On Rating System',size='20')
plt.show()

Most of the contents got ratings like

TV-MA (For Mature Audiences)
TV-14 ( May be unsuitable for children under 14 )
TV-PG ( Parental Guidance Suggested )
NR ( Not Rated )

In [None]:
# Creating a new DataFrame vocab_before_stemming
# Store the count in panadas dataframe with vocab as index
vocab_before_stemming = pd.DataFrame({"Word": vocab ,
                                      "count" :count_of_vocab})
# Sort the dataframe
vocab_before_stemming = vocab_before_stemming.sort_values("count" ,ascending=False)

#### Chart - 5

In [None]:
# Chart - 5 - TOP 10 most occurred words
top15_most_ocurred_vacab = vocab_before_stemming.head(15)

In [None]:
top15_most_occurred_words = top15_most_ocurred_vacab.Word.values
top15_most_occurred_words

In [None]:
top15_most_occurred_words_count = top15_most_ocurred_vacab['count'].values
top15_most_occurred_words_count

In [None]:
plt.figure( figsize = ( 10,5 ))
plt.xlim(19550, 19600)
plt.barh(top15_most_occurred_words , top15_most_occurred_words_count )

In [None]:
# Now will use SnowballStemmer( 'english' )
# Create an object of stemming function
stemmer = SnowballStemmer("english")

In [None]:
def Apply_stemming(text):
    '''a function which stems each word in the given text'''
    text = [stemmer.stem(word) for word in text.split()]
    return " ".join(text)

In [None]:
#Stemming for description
df['description'] = df['description'].apply( Apply_stemming )
df.head()

In [None]:
# Now again will use TfidfVectorizer (after stemming)
# Create the object of tfid vectorizer
tfid_vectorizer = TfidfVectorizer()

# Fit the vectorizer using the text data
tfid_vectorizer.fit(df['description'])

# Collect the vocabulary items used in the vectorizer
dictionary = tfid_vectorizer.vocabulary_.items()

In [None]:
# Lists to store the vocab and counts
vocab = []
count_of_vocab = []
# Iterate through each vocab and count append the value to designated lists
for key, value in dictionary:
    vocab.append(key)
    count_of_vocab.append(value)

#### Chart - 6

In [None]:
# Chart - 6 -Creating a new DataFrame vocab_after_stemming
# Store the count in panadas dataframe with vocab as index
vocab_after_stemming = pd.DataFrame({"Word": vocab ,
                                      "count" :count_of_vocab})
# Sort the dataframe
vocab_after_stemming = vocab_after_stemming.sort_values("count" ,ascending=False)

In [None]:
top15_most_ocurred_vocab = vocab_after_stemming.head(15)
top15_most_occurred_words = top15_most_ocurred_vocab.Word.values
top15_most_occurred_words

In [None]:
top15_most_occurred_words_count = top15_most_ocurred_vocab['count'].values
top15_most_occurred_words_count

In [None]:
plt.figure( figsize = ( 10,5 ))
plt.xlim(14225, 14241)
plt.barh(top15_most_occurred_words , top15_most_occurred_words_count)

In [None]:
# Adding a new column length which will contain length of description
df['Length(description)'] = df['description'].apply(lambda x: len(x))
df.head(3)

In [None]:
len(df.description.iloc[0])

In [None]:
 # listed_in
 # Removing punctutations
 df.columns

In [None]:
df['listed_in'] = df['listed_in'].apply(remove_punctuation)
df.head()

In [None]:
# Removing stopwords
#Remove stopwords for listed_in
df['listed_in'] = df['listed_in'].apply( remove_stopwords )
df.head( 2 )

In [None]:
# Using CountVectorizer() to count vocabulary items
# Create a count vectorizer object
count_vectorizer = CountVectorizer()
# fit the count vectorizer using the text data
count_vectorizer.fit(df['listed_in'])
# Collect the vocabulary items used in the vectorizer
dictionary = count_vectorizer.vocabulary_.items()
dictionary

In [None]:
vocab = [ ]
count_of_vocab = []
for key , value in dictionary:
  vocab.append( key )
  count_of_vocab.append( value )

In [None]:
listed_in_vocab_before_stem = pd.DataFrame({"Word": vocab , "count" :count_of_vocab})

listed_in_vocab_before_stem = listed_in_vocab_before_stem.sort_values("count" ,ascending=False)

In [None]:
listed_in_vocab_before_stem.head()

In [None]:
listed_in_vocab_before_stem.tail()

#### Chart - 7

In [None]:
# Chart - 7 - TOP 10 most occurred words in listed in
top15_most_ocurred_vocab_listed_in = listed_in_vocab_before_stem.head(15)
top15_most_ocurred_words_listed_in = top15_most_ocurred_vocab_listed_in.Word.values
top15_most_ocurred_words_listed_in
top15_most_occurred_words_in_listed_in_count = top15_most_ocurred_vocab_listed_in['count'].values
top15_most_occurred_words_in_listed_in_count

In [None]:
plt.figure( figsize = ( 10,5 ))
plt.xlim(25, 42 )
plt.barh(top15_most_ocurred_words_listed_in , top15_most_occurred_words_in_listed_in_count )

In [None]:
# Now will use SnowballStemmer( 'english' )
# Stemming for description
df['listed_in'] = df['listed_in'].apply( Apply_stemming )
df.head(3)

In [None]:
# Now will use TfidfVectorizer (after stemming
# Create the object of tfid vectorizer
tfid_vectorizer = TfidfVectorizer()

# Fit the vectorizer using the text data
tfid_vectorizer.fit(df['listed_in'])

# Collect the vocabulary items used in the vectorizer
dictionary = tfid_vectorizer.vocabulary_.items()
dictionary

In [None]:
# Lists to store the vocab and counts
vocab = []
count_of_vocab = []
# Iterate through each vocab and count append the value to designated lists
for key, value in dictionary:
    vocab.append(key)
    count_of_vocab.append(value)

In [None]:
# Creating a new DataFrame vocab_after_stemming_listed_in
vocab_after_stemming_listed_in = pd.DataFrame({"Word": vocab , "count" :count_of_vocab})
# Sort the dataframe by count
vocab_after_stemming_listed_in = vocab_after_stemming_listed_in.sort_values("count" ,ascending=False)

In [None]:
top15_most_ocurred_vocab_lised_in_after_stem = vocab_after_stemming_listed_in.head(15)
top15_most_ocurred_vocab_lised_in_after_stem_word = top15_most_ocurred_vocab_lised_in_after_stem.Word.values
top15_most_ocurred_vocab_lised_in_after_stem_word

In [None]:
top15_most_occurred_words_listed_in_count = top15_most_ocurred_vocab_lised_in_after_stem['count'].values
top15_most_occurred_words_listed_in_count

#### Chart - 8

In [None]:
# Chart - 8 - top vocab present in listed_in (after stemming)
plt.figure( figsize = ( 10,5 ))
plt.xlim(25, 40 )
plt.barh(top15_most_ocurred_vocab_lised_in_after_stem_word , top15_most_occurred_words_listed_in_count )

In [None]:
# Adding a new column length( listed-in ) which will contain length of listed_in
df['Length(listed-in)'] = df['listed_in'].apply(lambda x: len(x))
df.head(3)

In [None]:
df.columns

In [None]:
df[['description', 'Length(description)', 'listed_in' ,'Length(listed-in)' ]].head(3)

In [None]:
X_features_rec = df[['no_of_category' ,'Length(description)','Length(listed-in)']]
stdscaler = preprocessing.StandardScaler()
X_features_rec.describe()

In [None]:
X_rescale=stdscaler.fit_transform(X_features_rec)
X=X_rescale
silhouette_score_ = [  ]
range_n_clusters = [i for i in range(2,16)]

In [None]:
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    silhouette_score_.append([int(n_clusters) , round(score , 3)])
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

In [None]:
temp = pd.DataFrame(silhouette_score_ , columns = ["n clusters" , "silhouette score"])
temp = temp.sort_values( "silhouette score" , ascending = False )
temp.head(14)

NOTE :-
The value of the silhouette coefﬁcient is between [-1, 1]. A score of 1 denotes the best meaning that the data point i is very compact within the cluster to which it belongs and far away from the other clusters. The worst value is -1. Values near 0 denote overlapping clusters

In [None]:
range_n_clusters = [i for i in range(2,16)]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) /n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

Silhouette score is used to evaluate the quality of clusters created using clustering algorithms such as K-Means in terms of how well samples are clustered with other samples that are similar to each other. The Silhouette score is calculated for each sample of different clusters. To calculate the Silhouette score for each observation/data point, the following distances need to be found out for each observations belonging to all the clusters:

Mean distance between the observation and all other data points in the same cluster. This distance can also be called a mean intra-cluster distance. The mean distance is denoted by a.
Mean distance between the observation and all other data points of the next nearest cluster. This distance can also be called a mean nearest-cluster distance. The mean distance is denoted by b.

The Silhouette Coefficient for a sample is  S=(b−a)/max(a,b) .

#### Chart - 9

In [None]:
# Chart - 9 - Elbow Method

sum_of_sq_dist = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(X)
    sum_of_sq_dist[k] = km.inertia_

#Plot the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

#### Chart - 10

In [None]:
# Chart - 10 - Will be using 3 clusters
kmeans = KMeans(n_clusters = 3 )
kmeans.fit(X)
y_kmeans= kmeans.predict(X)

plt.figure(figsize=(10 , 6))
plt.title('description and listed_in')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='spring')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)

#### Chart - 11

In [None]:
# Chart - 11- DBSCAN

from sklearn.cluster import DBSCAN
from sklearn import metrics
y_pred = DBSCAN(eps=0.5, min_samples=15).fit_predict(X)
plt.figure(figsize=( 8 , 5 ))
plt.scatter(X[:,1], X[:,2], c=y_pred)           # The black colour dots(*) are nois

#### Chart - 12

In [None]:
# Chart - 12- Dendrogram
# Let's import sch
import scipy.cluster.hierarchy as sch
plt.figure(figsize=(10,5))
dendrogram = sch.dendrogram(sch.linkage(X, method = 'single'))
plt.title('Dendrogram')
plt.xlabel('Content')
plt.ylabel('Euclidean Distances')
plt.show() # find largest vertical distance we can make without crossing any other horizontal line

The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the threshold

No. of Cluster = 3

#### Chart - 13

In [None]:
# Chart - 13 - Agglomerative Clustering
# Let's  import AgglomerativeClustering
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

In [None]:
# Visualizing the clusters (three dimensions only)
plt.figure(figsize=(10,5))
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = '1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = '2')
# plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = '3')

plt.title('Clusters of content')

plt.legend()
plt.show()

# **Conclusion**

Conclusion

1. Director and cast contains a large number of null values so we will drop these 2 columns.

2. In this dataset there are two types of contents where 30.86% includes TV shows and the remaining 69.14% carries Movies.

3. We have reached a conclusion from our analysis from the content added over years that Netflix is focusing movies and TV shows (Fom 2016 data we get to know that Movies is increased by 80% and TV shows is increased by 73% compare)

4. From the dataset insights we can conclude that the most number of TV Shows released in 2017 and for Movies it is 2020.

5. On Netflix USA has the largest number of contents. And most of the countries preferred to produce movies more than TV shows.

6. Most of the movies are belonging to 3 categories
7. TOP 3 content categories are International movies , dramas , comedies.
8. In text analysis (NLP) I used stop words, removed punctuations , stemming & TF-IDF vectorizer and other functions of NLP.

9. Applied different clustering models like Kmeans, hierarchical, Agglomerative clustering, DBSCAN on data we got the best cluster arrangements.

10. By applying different clustering algorithms to our dataset .we get the optimal number of cluster is equal to 3Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***