# **Project Name**    - Netflix Movies and TV Shows Clustering



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** Dipak Someshwar

# **Project Summary -**

The goal of this project is to analyze the Netflix catalog of movies and TV shows, which was sourced from the third-party search engine Flixable, and group them into relevant clusters. This will aid in enhancing the user experience and prevent subscriber churn for the world's largest online streaming service provider, Netflix, which currently boasts over 220 million subscribers as of 2022-Q2. The dataset, which includes movies and TV shows as of 2019, will be analyzed to uncover new insights and trends in the rapidly growing world of streaming entertainment.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Netflix is a streaming service that offers a wide variety of television shows and movies for viewers to watch at their convenience. With a monthly subscription, users have access to a vast library of content, including original series and films produced by Netflix. The platform also allows users to create multiple profiles, making it easy for family members or roommates to have their own personalized viewing experience. Additionally, Netflix allows users to download content to watch offline, making it a great option for those who travel frequently or have limited internet access. Overall, Netflix is a convenient and cost-effective way to access a wide variety of entertainment.**

**As of 2022-Q2, more than 220 million people had signed up for Netflix's online streaming service, making it the largest OTT provider worldwide. To improve the user experience and prevent subscriber churn, they must efficiently cluster the shows hosted on their platform.**

**By creating clusters, we will be able to comprehend the shows that are alike and different from one another. These clusters can be used to provide customers with individualized show recommendations based on their preferences.**

**This project aims to classify and group Netflix shows into specific clusters in such a way that shows in the same cluster are similar to one another and shows in different clusters are different.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Libraries that are used for analysis and visualization
import numpy as np
import pandas as pd
from numpy import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from matplotlib.pyplot import figure
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px
from datetime import datetime as dt

# Word Cloud library
from wordcloud import WordCloud, STOPWORDS
import re, string, unicodedata

# Libraries used to process textual data
import nltk
from bs4 import BeautifulSoup
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
string.punctuation
nltk.download('omw-1.4')
from nltk.tokenize import TweetTokenizer

# Libraries used to implement clusters
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

# Library of warnings would assist in ignoring warnings issued
import warnings
warnings.filterwarnings('ignore')

### Mount the drive & Dataset Loading

In [None]:
# Let's mount the google drive for import the dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
netflix_df = pd.read_csv('/content/drive/MyDrive/AlmaBetter/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
netflix_df.head()

In [None]:
# Dataset last 5 rows
netflix_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(netflix_df.shape)

### Dataset Information

In [None]:
# Dataset Info
netflix_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate or Non Duplicate Value Count
netflix_df.duplicated().value_counts()

In [None]:
# Dataset Duplicate Value Count
len(netflix_df[netflix_df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(netflix_df.isnull().sum())

In [None]:
# Missing Values Percentage
round(netflix_df.isna().sum()/len(netflix_df)*100, 2)

In [None]:
# Visualizing the missing values
sns.heatmap(netflix_df.isnull())

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(netflix_df, color='orange',sort='ascending', figsize=(10,4), fontsize=15)

### What did you know about your dataset?

The given dataset is from the online streaming industry. Our task is to examine the dataset, build the clustering methods and content based recommendation system.

Clustering is a technique used in machine learning and data mining to group similar data points together. A clustering algorithm is a method or technique used to identify clusters within a dataset. These clusters represent natural groupings of the data, and the goal of clustering is to discover these groupings without any prior knowledge of the groupings.

There are 7787 rows and 12 columns in the dataset. In the director, cast, country, date_added, and rating columns, there are missing values. The dataset does not contain any duplicate values.

Every row of information we have relates to a specific movie. Therefore, we are unable to use any method to impute any null values. Additionally, due to the small size of the data, we do not want to lose any data, so after analyzing each column, we simply impute numeric values using an empty string in the following procedure.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
netflix_df.columns

In [None]:
# Dataset Describe
netflix_df.describe(include='all').T.style.background_gradient()

### Variables Description

**show_id** : Unique ID for every Movie/Show

**type** : Identifier - Movie/Show

**title** : Title of the Movie/Show

**director** : Director of the Movie/Show

**cast** : Actors involved in the Movie/Show

**country** : Country where the Movie/Show was produced

**date_added** : Date it was added on Netflix

**release_year** : Actual Release year of the Movie/Show

**rating** : TV Rating of the Movie/Show

**duration** : Total Duration - in minutes or number of seasons

**listed_in** : Genre

**description** : The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Number of Unique Values for each variable.
for i in netflix_df.columns:
  print("No. of unique values in ",i,"is",netflix_df[i].nunique(),".")

In [None]:
# Check Unique Values for each variable.
for i in netflix_df.columns:
  print(str(i) + ' : ' + str(netflix_df[i].unique()))
  print('____________________________________________')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#total null values
netflix_df.isnull().sum().sum()

In [None]:
# Handling the missing values
netflix_df[['director','cast','country']] = netflix_df[['director','cast','country']].fillna('Unknown')
netflix_df['rating'] = netflix_df['rating'].fillna(netflix_df['rating'].mode()[0])
netflix_df.dropna(axis=0, inplace = True)

In [None]:
#again checking is there any null values are not
netflix_df.isnull().sum()

In [None]:
# Dataset Rows & Columns count
netflix_df.shape

## Feature engineering

In [None]:
# Top countries
netflix_df.country.value_counts()

In [None]:
# Genre of shows
netflix_df.listed_in.value_counts()

In [None]:
# Choosing the primary country and primary genre to simplify the analysis
netflix_df['country'] = netflix_df['country'].apply(lambda x: x.split(',')[0])
netflix_df['listed_in'] = netflix_df['listed_in'].apply(lambda x: x.split(',')[0])

In [None]:
# contry in which a movie was produced
netflix_df.country.value_counts()

In [None]:
# genre of shows
netflix_df.listed_in.value_counts()

In [None]:
# Splitting the duration column, and changing the datatype to integer
netflix_df['duration'] = netflix_df['duration'].apply(lambda x: int(x.split()[0]))

In [None]:
# Number of seasons for tv shows
netflix_df[netflix_df['type']=='TV Show'].duration.value_counts()

In [None]:
# Movie length in minutes
netflix_df[netflix_df['type']=='Movie'].duration.unique()

## Date_added column string to Datetime format

In [None]:
# Typecasting 'date_added' from string to datetime
netflix_df['date_added '] = pd.to_datetime(netflix_df['date_added'])

In [None]:
netflix_df['date_added'] = pd.to_datetime(netflix_df['date_added'], errors='coerce')

In [None]:
# first and last date on which a show was added on Netflix
netflix_df.date_added.min(),netflix_df.date_added.max()

In [None]:
# Adding new attributes month and year of date added

netflix_df['month_added'] = netflix_df['date_added'].dt.month
netflix_df['year_added'] = netflix_df['date_added'].dt.year
netflix_df.drop('date_added', axis=1, inplace=True)

In [None]:
# Dataset Rows & Columns count
netflix_df.shape

In [None]:
# Dataset Info
netflix_df.info()

### What all manipulations have you done and insights you found?

Missing values are there in director, cast, country, date_added and rating column.

In manipulations,

* I have handle the missing values of director, cast and country column and fillna values with Unknown and fill rating column na values with mode and drop date_added na values.

* Typecasting 'date_added' from string to datetime.

* Adding new features month and year of date_added.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Number of Movies and TV Shows in the dataset
plt.figure(figsize=(7,5))
netflix_df.type.value_counts().plot(kind='pie',autopct='%1.2f%%')
plt.ylabel('')
plt.title('Movies and TV Shows in the dataset')

##### 1. Why did you pick the specific chart?

A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the feature variable.

##### 2. What is/are the insight(s) found from the chart?

 From the above pie plot we can say that movies 69.14% are more than TV shows 30.86% in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that while the distribution of movies and TV shows in the dataset provides insights, the ultimate impact on business depends on various factors such as content quality, viewer satisfaction, pricing, marketing strategies, and competition within the industry. Continuous analysis, market research, and adapting to viewer preferences are key to maintaining positive business impact in the dynamic streaming industry.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Top 10 directors in the dataset
plt.figure(figsize=(7,5))
netflix_df[~(netflix_df['director']=='Unknown')].director.value_counts().nlargest(10).plot(kind='bar')
plt.title('Top 10 directors by number of shows directed')

##### 1. Why did you pick the specific chart?

A barplot (or barchart) is one of the most common types of graphic. It shows the relationship between a numeric and a categoric variable. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value.

##### 2. What is/are the insight(s) found from the chart?

In this Bar plot there are showing Top 10 directors by number of shows directed.

From the above bar plot we can say that Raul Campos jan Suter are the top 1 director an Rayan Polito is the top 10 director.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors such as content quality, viewer preferences, marketing strategies, and competition within the industry.

Netflix should aim for a balance between featuring shows from established directors and fostering a diverse range of content to cater to different viewer interests. Regular analysis, market research, and maintaining a pulse on viewer preferences are crucial for sustaining positive business impact in the dynamic streaming industry.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Top 10 countries with the highest number movies / TV shows in the dataset
plt.figure(figsize=(7,5))
netflix_df[~(netflix_df['country']=='Unknown')].country.value_counts().nlargest(10).plot(kind='bar')
plt.title('Top 10 countries with the highest number of shows')

In [None]:
# % share of movies / tv shows by top 3 countries
netflix_df.country.value_counts().nlargest(3).sum()/len(netflix_df)*100

In [None]:
# % share of movies / tv shows by top 10 countries
netflix_df.country.value_counts().nlargest(10).sum()/len(netflix_df)*100

##### 1. Why did you pick the specific chart?

A barplot (or barchart) is one of the most common types of graphic. It shows the relationship between a numeric and a categoric variable. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value.

##### 2. What is/are the insight(s) found from the chart?

In that bar graph showing Top 10 countries with the highest number movies / TV shows in the dataset.**

1.United States is the top one country of highest number movies / TV shows in the dataset.

2.Australia is the top 10 country of highest number movies / TV shows in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, cultural relevance, marketing strategies, and competition within each region.

Netflix should use the insight to inform its content acquisition and localization efforts, ensuring a balance between popular regions and the diversification of content offerings. Continuous analysis, market research, and adapting to regional viewer preferences are crucial for sustaining positive business impact in the global streaming industry.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Visualizing the year in which the movie / tv show was released
plt.figure(figsize=(7,5))
sns.histplot(netflix_df['release_year'])
plt.title('distribution by released year')

##### 1. Why did you pick the specific chart?

In statistics, a histogram is a graphical representation of the distribution of data. The histogram is represented by a set of rectangles, adjacent to each other, where each bar represent a kind of data.

##### 2. What is/are the insight(s) found from the chart?

From the above histogram we found release_year 2000 to 2020 more then the movie / tv show was released eariler.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the release year distribution to inform content planning, acquisition, and engagement strategies. Regular analysis, market research, and staying attuned to viewer preferences are key to maintaining positive business impact in the dynamic streaming industry.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Length distribution of movies
movie_df = netflix_df[netflix_df['type']=='Movie']

plt.figure(figsize=(7, 5))

sns.distplot(movie_df['duration'], bins=30,color='Blue').set(ylabel=None)

plt.title('Length distribution of movies', fontsize=16,fontweight="bold")
plt.xlabel('Duration', fontsize=14)
plt.show()

##### 1. Why did you pick the specific chart?

Distplot is used basically for univariant set of observations and visualizes it through a histogram i.e. only one observation and hence we choose one particular column of the dataset.

##### 2. What is/are the insight(s) found from the chart?

In this plot we found the Length distribution of movies duration.we see 0 to 100 duration movies are increases and 100 to 200 duration duration are decreases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the length distribution to curate a diverse content library, offer personalized recommendations, and ensure viewer satisfaction. Regular analysis, market research, and adapting to viewer preferences are crucial for maintaining positive business impact in the dynamic streaming industry.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Age ratings for shows in the dataset
plt.figure(figsize=(10,5))
sns.countplot(x='rating',data=netflix_df)

##### 1. Why did you pick the specific chart?

Countplot() method is used to Show the counts of observations in each categorical bin using bars.

##### 2. What is/are the insight(s) found from the chart?

In this plot we found all rating count. TV-MA is the highest rating and second highest is TV-14 and third highest is TV-PG rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from age ratings to curate a diverse content library, offer personalized recommendations, and ensure viewer satisfaction across various age groups. Regular analysis, market research, and adapting to viewer preferences are crucial for maintaining positive business impact in the dynamic streaming industry.

In [None]:
# Changing the values in the rating column
rating_map = {'TV-MA':'Adults',
              'R':'Adults',
              'PG-13':'Teens',
              'TV-14':'Young Adults',
              'TV-PG':'Older Kids',
              'NR':'Adults',
              'TV-G':'Kids',
              'TV-Y':'Kids',
              'TV-Y7':'Older Kids',
              'PG':'Older Kids',
              'G':'Kids',
              'NC-17':'Adults',
              'TV-Y7-FV':'Older Kids',
              'UR':'Adults'}

netflix_df['rating'].replace(rating_map, inplace = True)
netflix_df['rating'].unique()

In [None]:
netflix_df['principal_country'] = netflix_df['country'].apply(lambda x: x.split(",")[0])
netflix_df['principal_country'].head()

country_order = netflix_df['principal_country'].value_counts()[:11].index
content_data = netflix_df[['type', 'principal_country']].groupby('principal_country')['type'].value_counts().unstack().loc[country_order]
content_data['sum'] = content_data.sum(axis=1)
content_data_ratio = (content_data.T / content_data['sum']).T[['Movie', 'TV Show']].sort_values(by='Movie',ascending=False)[::-1]

In [None]:
netflix_df['count'] = 1
data = netflix_df.groupby('principal_country')[['principal_country','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
data = data['principal_country']

Flix_df_heatmap = netflix_df.loc[netflix_df['principal_country'].isin(data)]
Flix_df_heatmap = pd.crosstab(Flix_df_heatmap['principal_country'], Flix_df_heatmap['rating'],normalize = "index").T
Flix_df_heatmap

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Age ratings for shows in the dataset
plt.figure(figsize=(7,5))
sns.countplot(x='rating',data=netflix_df)
plt.title('Number of shows on Netflix for different age groups')

##### 1. Why did you pick the specific chart?

Countplot() method is used to Show the counts of observations in each categorical bin using bars.

##### 2. What is/are the insight(s) found from the chart?

 In this count plot we found that Number of shows on Netflix for different age groups and Adults are the highest number of shows and second highest is Young Adults and third highest is older Kids shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from age ratings to curate a diverse content library, offer personalized recommendations, and ensure viewer satisfaction across various age groups. Regular analysis, market research, and adapting to viewer preferences are crucial for maintaining positive business impact in the dynamic streaming industry.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Top 10 genres
plt.figure(figsize=(7,5))
netflix_df.listed_in.value_counts().nlargest(10).plot(kind='bar')
plt.title('Top 10 genres')

In [None]:
# Share of top 3 genres
netflix_df.listed_in.value_counts().nlargest(3).sum()/len(netflix_df)*100

In [None]:
# Share of top 10 genres
netflix_df.listed_in.value_counts().nlargest(10).sum()/len(netflix_df)*100

##### 1. Why did you pick the specific chart?

A bar graph is a graphical representation of data in which we can highlight the category with particular shapes like a rectangle. The length and heights of the bar chart represent the data distributed in the dataset. In a bar chart, we have one axis representing a particular category of a column in the dataset and another axis representing the values or counts associated with it. Bar charts can be plotted vertically or horizontally. A vertical bar chart is often called a column chart.

##### 2. What is/are the insight(s) found from the chart?

In this bar plot we found that Dramas are large value_counts of shows and movies and Comedies are the second large value_counts of shows and Movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the top genres to curate a diverse content library, offer personalized recommendations, and ensure viewer satisfaction. Regular analysis, market research, and staying attuned to viewer preferences are key to maintaining positive business impact in the dynamic streaming industry.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Number of shows added on different months
plt.figure(figsize = (7,5))
sns.countplot(x = netflix_df['month_added'])
plt.title('Shows added each month over the years')
plt.xlabel('')

##### 1. Why did you pick the specific chart?

Countplot() method is used to Show the counts of observations in each categorical bin using bars.

##### 2. What is/are the insight(s) found from the chart?

i. In this count plot we found Number of shows added on different months.

ii. December are the highest number of shows added on and Octobor are the second highest number of shows added on. we see all month minority difference between them.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the distribution of shows added on different months to plan content releases, optimize marketing efforts, and ensure viewer satisfaction. Regular analysis, market research, and adapting to viewer preferences are crucial for maintaining positive business impact in the dynamic streaming industry.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Number of shows added over the years
plt.figure(figsize = (7,5))
sns.countplot(x = netflix_df['year_added'])
plt.title('Number of shows added each year')
plt.xlabel('')

##### 1. Why did you pick the specific chart?

Countplot() method is used to Show the counts of observations in each categorical bin using bars.

##### 2. What is/are the insight(s) found from the chart?

i. In this count plot we found Number of shows added on each different years.

ii. 2019 are the highest number of shows added on and 2020 are the second highest number of shows added on and 2018 are the third highest number of shows added on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the number of shows added over the years to ensure continuous growth, maintain content quality, and cater to evolving viewer preferences. Regular analysis, market research, and adapting to viewer demands are crucial for maintaining positive business impact in the dynamic streaming industry.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Number of movies and TV shows added over the years
plt.figure(figsize=(12,5))
p = sns.countplot(x='year_added',data=netflix_df, hue='type')
plt.title('Number of movies and TV shows added over the years')
plt.xlabel('')
for i in p.patches:
  p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

##### 1. Why did you pick the specific chart?

A bivariate plot graphs the relationship between two variables that have been measured on a single sample of subjects. Such a plot permits you to see at a glance the degree and pattern of relation between the two variables.


##### 2. What is/are the insight(s) found from the chart?

In this graph we found that Number of movies and TV shows added over the years.2019 are the 1497 movies and 656 TV shows are there. 2020 year 1312 movies and 697 TV shows are there and 2018 year 1255 movies and 430 TV shows are there.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the number of movies and TV shows added over the years to ensure continuous growth, maintain content quality, and cater to evolving viewer demands. Regular analysis, market research, and adapting to viewer preferences are crucial for maintaining positive business impact in the dynamic streaming industry.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Number of shows released each year since 2008
order = range(2008,2022)
plt.figure(figsize=(12,5))
p = sns.countplot(x='release_year',data=netflix_df, hue='type',
                  order = order)
plt.title('Number of shows released each year since 2008 that are on Netflix')
plt.xlabel('')
for i in p.patches:
  p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

##### 1. Why did you pick the specific chart?

A bivariate plot graphs the relationship between two variables that have been measured on a single sample of subjects. Such a plot permits you to see at a glance the degree and pattern of relation between the two variables.

##### 2. What is/are the insight(s) found from the chart?

In this graph we found that Number of shows released each year since 2008 that are on Netflix.see this graph Before 2019 Movies are highest number of released but 2020 and 2021 TV shows are the highest number of released.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the number of shows released each year to ensure continuous growth, maintain content quality, and cater to evolving viewer demands. Regular analysis, market research, and adapting to viewer preferences are crucial for maintaining positive business impact in the dynamic streaming industry.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Seasons in each TV show
plt.figure(figsize=(12,5))
p = sns.countplot(x='duration',data=netflix_df[netflix_df['type']=='TV Show'])
plt.title('Number of seasons per TV show distribution')

for i in p.patches:
  p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

##### 1. Why did you pick the specific chart?

A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable. The basic API and options are identical to those for barplot() , so you can compare counts across nested variables.

##### 2. What is/are the insight(s) found from the chart?

In this graph we found Number of per Seasons in each TV show count.first seasons are 1608 and second seasons are 378 and third seasons are 183 number of counting are there.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the number of seasons per TV show to plan content acquisitions, production strategies, and viewer engagement initiatives. Regular analysis, market research, and adapting to viewer demands are crucial for maintaining positive business impact in the dynamic streaming industry.

#### Chart - 14

In [None]:
# Chart - 14 visualization code
# Top 10 genre for movies
plt.figure(figsize=(7,5))
netflix_df[netflix_df['type']=='TV Show'].listed_in.value_counts().nlargest(10).plot(kind='bar')
plt.title('Top 10 genres for movies')

##### 1. Why did you pick the specific chart?

A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent. It is often used to compare between values of different categories in the data.

##### 2. What is/are the insight(s) found from the chart?

In this graph we found Top 10 genres for movies. International TV Shows are the highest number of genres type of TV shows and Crime TVShows and Kid's Shows are approximately same for generes type of TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the top genres for movies to curate a diverse movie collection, offer personalized recommendations, and ensure viewer satisfaction. Regular analysis, market research, and adapting to viewer preferences are crucial for maintaining positive business impact in the dynamic streaming industry.

#### Chart - 15

In [None]:
# Chart - 15 visualization code
# Top actors for movies
plt.figure(figsize=(7,5))
netflix_df[~(netflix_df['cast']=='Unknown') & (netflix_df['type']=='Movie')].cast.value_counts().nlargest(10).plot(kind='barh')
plt.title('Actors who have appeared in highest number of movies')

##### 1. Why did you pick the specific chart?

 A horizontal bar plot is a plot that presents quantitative data with rectangular bars with lengths proportional to the values that they represent. A bar plot shows comparisons among discrete categories. One axis of the plot shows the specific categories being compared, and the other axis represents a measured value.

##### 2. What is/are the insight(s) found from the chart?

In this graph we found Top actors for movies. Top one actor in movies are Samuel West and second highest actor are jeff Dunham and third highest actor of movies are Craig Sechler and Kevin Hart.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the top actors for movies to curate a diverse movie collection, promote movies with renowned actors, and ensure viewer satisfaction. Regular analysis, market research, and adapting to viewer preferences are crucial for maintaining positive business impact in the dynamic streaming industry.

#### Chart - 16

In [None]:
# Chart - 16 visualization code
# Top actors for TV shows
plt.figure(figsize=(7,5))
netflix_df[~(netflix_df['cast']=='Unknown') & (netflix_df['type']=='TV Show')].cast.value_counts().nlargest(10).plot(kind='barh')
plt.title('Actors who have appeared in highest number of TV shows')

##### 1. Why did you pick the specific chart?

A horizontal bar plot is a plot that presents quantitative data with rectangular bars with lengths proportional to the values that they represent. A bar plot shows comparisons among discrete categories. One axis of the plot shows the specific categories being compared, and the other axis represents a measured value.

##### 2. What is/are the insight(s) found from the chart?

 In this graph we found Actors who have appeared highest number of TV shows. Top one actor of TV Shows is David Attenborough and second highest actor in TV shows many are there like Michela Luci,Jamie Watson,Anna Claire Bartlam,Dante Zee and Eric peterson.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the top actors for TV shows to curate a diverse TV show collection, promote shows with renowned actors, and ensure viewer satisfaction. Regular analysis, market research, and adapting to viewer preferences are crucial for maintaining positive business impact in the dynamic streaming industry.

#### Chart - 17

In [None]:
# Chart - 17 visualization code
# Building a wordcloud for the movie descriptions
comment_words = ''
stopwords = set(STOPWORDS)

# iterate through the csv file
for val in netflix_df.description.values:

    # typecaste each val to string
    val = str(val)

    # split the value
    tokens = val.split()

    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()

    comment_words += " ".join(tokens)+" "

wordcloud = WordCloud(width = 700, height = 700,
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10).generate(comment_words)


# plot the WordCloud image
plt.figure(figsize = (7,5), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

##### 1. Why did you pick the specific chart?

Word clouds or tag clouds are graphical representations of word frequency that give greater prominence to words that appear more frequently in a source text. The larger the word in the visual the more common the word was in the document(s).

##### 2. What is/are the insight(s) found from the chart?

In this graph we found life,family,love,friend,world,new,find words are many time uses.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the wordcloud for movie descriptions to improve content discovery, enhance personalized recommendations, and ensure viewer satisfaction. Regular analysis, market research, and adapting to viewer preferences are crucial for maintaining positive business impact in the dynamic streaming industry.

#### Chart - 18 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Plotting the heatmap
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
country_order2 = ['United States', 'India', 'United Kingdom', 'Canada', 'Japan', 'France', 'South Korea', 'Spain',
       'Mexico']

age_order = ['Adults', 'Teens', 'Young Adults', 'Older Kids', 'Kids']

sns.heatmap(Flix_df_heatmap.loc[age_order,country_order2],cmap="jet",square=True, linewidth=2.5,cbar=False,
            annot=True,fmt='2.0%',vmax=.6,vmin=0.05,ax=ax,annot_kws={"fontsize":15})

ax.spines['top'].set_visible(True)

fig.text(.76,.765, 'Target ages proportion of total content by country', fontweight='bold', fontfamily='serif', fontsize=15,ha='right')

ax.set_yticklabels(ax.get_yticklabels(), fontfamily='serif', rotation = 0, fontsize=11)
ax.set_xticklabels(ax.get_xticklabels(), fontfamily='serif', rotation=90, fontsize=11)

ax.set_ylabel('')
ax.set_xlabel('')
ax.tick_params(axis=u'both', which=u'both',length=0)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coefficients, i used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

In this correlation Heatmap graph we found the Target ages proportion of total content by country.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's important to note that the impact on business depends on various factors, including content quality, viewer preferences, market trends, marketing strategies, and competition within the industry.

Netflix should leverage insights from the correlation heatmap to make data-driven decisions, improve targeted marketing efforts, and enhance personalized recommendations. Regular analysis, market research, and adaptation to viewer preferences are crucial for maintaining positive business impact in the dynamic streaming industry.

In [None]:
# Using the original dataset for clustering since
# it does not require handling missing values
dataset = netflix_df.copy()
dataset.fillna('',inplace=True)

In [None]:
# Combining all the clustering attributes into a single column

dataset['clustering'] = (dataset['director'] + ' ' +
                                dataset['cast'] +' ' +
                                dataset['country'] +' ' +
                                dataset['listed_in'] +' ' +
                                dataset['description'])

In [None]:
# Select the 100 number of clusters for the dataset
dataset['clustering'][5]

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#Missing values are there in director, cast, country, date_added and rating column.

#In manipulations I have handle the missing values of director, cast and country column and fillna values with Unknown and fill rating column na values with mode and drop date_added na values.

#### What all missing value imputation techniques have you used and why did you use those techniques?

In manipulations I have handle the missing values of director, cast and country column and fillna values with Unknown and fill rating column na values with mode and drop date_added na values.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

#Except for the release year, almost all of the data are presented in text format.
#The textual format contains the data we need to build a cluster/building model. Therefore, there is no need to handle outliers.

##### What all outlier treatment techniques have you used and why did you use those techniques?

Except for the release year, almost all of the data are presented in text format.

The textual format contains the data we need to build a cluster/building model. Therefore, there is no need to handle outliers.

### 3. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Removing Punctuations

In [None]:
# Remove Punctuations
# Function to remove punctuations
def remove_punctuation(text):
    '''a function for removing punctuation'''
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

In [None]:
# Removing punctuation marks
dataset['clustering'] = dataset['clustering'].apply(remove_punctuation)

In [None]:
# Select the 100 number of clusters for the dataset
dataset['clustering'][100]

#### 2. Removing Non-ASCII characters

In [None]:
# Remove Non-Ascii characters
# Function to remove non-ascii characters

def remove_non_ascii(words):
    """Function to remove non-ASCII characters"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

In [None]:
# Remove non-ascii characters
dataset['clustering'] = remove_non_ascii(dataset['clustering'])

In [None]:
# Select the 100 number of clusters for the dataset
dataset['clustering'][100]

#### 3. Removing stopwords

In [None]:
# Extracting the stopwords from nltk library
import nltk
from nltk.corpus import stopwords
sentences = stopwords.words('english')

# displaying the stopwords
np.array(sentences)

In [None]:
# function to remove stop words
def remove_stopwords(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in sentences]
    # joining the list of words with space separator
    return " ".join(text)

In [None]:
# Removing stop words
dataset['clustering'] = dataset['clustering'].apply(remove_stopwords)

In [None]:
# Select the 100 number of clusters for the dataset
dataset['clustering'][100]

#### 4. Removing Lemmatization

In [None]:
# Function to lemmatize the corpus
def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

In [None]:
# Lemmatization
dataset['clustering'] = lemmatize_verbs(dataset['clustering'])

In [None]:
# Select the 100 number of clusters for the dataset
print(dataset['clustering'][100])

#### 5. Tokenization

In [None]:
# Create a reference variable for Class TweetTokenizer
tokenizer = TweetTokenizer()

In [None]:
# Create text column based on dataset
dataset['clustering'] = dataset['clustering'].apply(lambda x: tokenizer.tokenize(x))

In [None]:
# Select the 100 number of Tokenization for the dataset
print(dataset['clustering'][100])

#### 6. Text Vectorization

In [None]:
# clustering tokens saved in a variable
clustering_vectorization = dataset['clustering']

In [None]:
# Tokenization
def tokenizer(text):
  return text

# Using TFIDF vectorizer to vectorize the corpus
# max features = 20000 to prevent system from crashing
tfidf = TfidfVectorizer(tokenizer=tokenizer, stop_words='english', lowercase=False, max_features = 20000)
x = tfidf.fit_transform(clustering_vectorization)

In [None]:
# Dataset Rows & Columns count
x.shape

In [None]:
# convert X into array form for clustering
X = x.toarray()

In [None]:
# Check the matrix
X

##### Which text vectorization technique have you used and why?

I have used TF-IDF vectorization technique in this dataset.

TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer is a commonly used technique in natural language processing (NLP) to represent text data numerically. It converts a collection of documents into a matrix of numerical features, capturing the importance of words within the documents.

### 4. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

PCA to reduce the dimensionality of the dataset. PCA identifies the directions (principal components) along which the data varies the most. These components are ordered by the amount of variance they explain in the data.

In [None]:
# DImensionality Reduction (If needed)
# Using PCA to reduce dimensionality
pca = PCA(random_state=40)
pca.fit(X)

In [None]:
# Explained variance for different number of components
plt.figure(figsize=(7,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title('PCA - Cumulative explained variance vs number of components')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

In [None]:
# reducing the dimensions to 4000 using pca
pca = PCA(n_components=4000,random_state=40)
pca.fit(X)

In [None]:
# transformed features
x_pca = pca.transform(X)

In [None]:
# shape of transformed vectors
x_pca.shape

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

PCA can extract the most relevant features from a dataset. It transforms the original features into a new set of uncorrelated variables called principal components. These components are linear combinations of the original features and capture the maximum amount of variation present in the data.

## ***7. ML Model Implementation***

### ML Model - K-Means Clustering

In [None]:
# Elbow method to find the optimal value of k
wcss=[]
for i in range(1,31):
  kmeans = KMeans(n_clusters=i,init='k-means++',random_state=33)
  kmeans.fit(x_pca)
  wcss_iter = kmeans.inertia_
  wcss.append(wcss_iter)

number_clusters = range(1,31)
plt.figure(figsize=(10,5))
plt.plot(number_clusters,wcss)
plt.title('The Elbow Method - KMeans clustering')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')

In [None]:
# Plotting Silhouette score for different number of clusters
range_n_clusters = range(2,31)
silhouette_avg = []
for num_clusters in range_n_clusters:
  # initialize kmeans
  kmeans = KMeans(n_clusters=num_clusters,init='k-means++',random_state=33)
  kmeans.fit(x_pca)
  cluster_labels = kmeans.labels_

  # silhouette score
  silhouette_avg.append(silhouette_score(x_pca, cluster_labels))

plt.figure(figsize=(10,5))
plt.plot(range_n_clusters,silhouette_avg)
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k - KMeans clustering')
plt.show()

In [None]:
# Clustering the data into 19 clusters
kmeans = KMeans(n_clusters=6,init='k-means++',random_state=40)
kmeans.fit(x_pca)

In [None]:
# Adding a kmeans cluster number attribute
dataset['kmeans_cluster'] = kmeans.labels_

In [None]:
# Evaluation metrics - distortion, Silhouette score
kmeans_distortion = kmeans.inertia_
kmeans_silhouette_score = silhouette_score(x_pca, kmeans.labels_)

print((kmeans_distortion,kmeans_silhouette_score))

In [None]:
# Number of movies and tv shows in each cluster
plt.figure(figsize=(7,5))
q = sns.countplot(x='kmeans_cluster',data=dataset, hue='type')
plt.title('Number of movies and TV shows in each cluster - Kmeans Clustering')
for i in q.patches:
  q.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

In [None]:
# Building a wordcloud for the movie descriptions
def kmeans_worldcloud(cluster_num):
  comment_words = ''
  stopwords = set(STOPWORDS)

  # iterate through the csv file
  for val in dataset[dataset['kmeans_cluster']==cluster_num].description.values:

      # typecaste each val to string
      val = str(val)

      # split the value
      tokens = val.split()

      # Converts each token into lowercase
      for i in range(len(tokens)):
          tokens[i] = tokens[i].lower()

      comment_words += " ".join(tokens)+" "

  wordcloud = WordCloud(width = 700, height = 700,
                  background_color ='white',
                  stopwords = stopwords,
                  min_font_size = 10).generate(comment_words)

In [None]:
plt.figure(figsize = (7,5), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

K-means clustering is a popular unsupervised machine learning algorithm used for clustering analysis. It is primarily used for partitioning a dataset into distinct groups or clusters based on their similarities.

It's important to note that the effectiveness of K-means clustering depends on the data quality, appropriate feature selection, and careful consideration of the number of clusters. Additionally, the algorithm assumes that the clusters have a spherical shape and similar variances, so it may not perform well on datasets with irregular shapes or varying cluster sizes.

### ML Model - Hierarchical clustering

In [None]:
# Building a dendogram to decide on the number of clusters
plt.figure(figsize=(12, 7))
dend = shc.dendrogram(shc.linkage(x_pca, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Netflix Shows')
plt.ylabel('Distance')
plt.axhline(y= 3.8, color='r', linestyle='--')

In [None]:
# Fitting hierarchical clustering model
hierarchical = AgglomerativeClustering(n_clusters=12, affinity='euclidean', linkage='ward')
hierarchical.fit_predict(x_pca)

In [None]:
# Adding a kmeans cluster number attribute
dataset['hierarchical_cluster'] = hierarchical.labels_

In [None]:
# Number of movies and tv shows in each cluster
plt.figure(figsize=(10,5))
q = sns.countplot(x='hierarchical_cluster',data=dataset, hue='type')
plt.title('Number of movies and tv shows in each cluster - Hierarchical Clustering')
for i in q.patches:
  q.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

In [None]:
# Building a wordcloud for the movie descriptions
def hierarchical_worldcloud(cluster_num):
  comment_words = ''
  stopwords = set(STOPWORDS)

  # iterate through the csv file
  for val in dataset[dataset['hierarchical_cluster']==cluster_num].description.values:

      # typecaste each val to string
      val = str(val)

      # split the value
      tokens = val.split()

      # Converts each token into lowercase
      for i in range(len(tokens)):
          tokens[i] = tokens[i].lower()

      comment_words += " ".join(tokens)+" "

  wordcloud = WordCloud(width = 700, height = 700,
                  background_color ='white',
                  stopwords = stopwords,
                  min_font_size = 10).generate(comment_words)
  return hierarchical_worldcloud

In [None]:
# plot the WordCloud image
plt.figure(figsize = (7,5), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Hierarchical clustering is a popular unsupervised machine learning algorithm used for clustering analysis. It is primarily used for grouping similar data points into hierarchical structures or dendrograms.

It's important to note that hierarchical clustering can be computationally intensive, especially for large datasets, as it involves calculating distances between all pairs of data points. Additionally, the choice of distance metric and linkage method can impact the clustering results, and different approaches may be more suitable for different types of data. Careful consideration of these factors is necessary to achieve meaningful and accurate clustering results.

## ***8. Using Content Based Recommender System***

In [None]:
# Changing the index of the df from show id to show title
dataset['show_id'] = dataset.index

In [None]:
# converting tokens to string
def convert(lst):
  return ' '.join(lst)

dataset['clustering'] = dataset['clustering'].apply(lambda x: convert(x))

In [None]:
# setting title of movies/Tv shows as index
dataset.set_index('title',inplace=True)

In [None]:
# Count vectorizer
CV = CountVectorizer()
converted_matrix = CV.fit_transform(dataset['clustering'])

In [None]:
# Cosine similarity
cosine_similarity = cosine_similarity(converted_matrix)

In [None]:
# Dataset Rows & Columns count
cosine_similarity.shape

In [None]:
# Developing a function to get 10 recommendations for a show
indices = pd.Series(dataset.index)

def Dataset_10(title, cosine_sim = cosine_similarity):
  try:
    recommend_content = []
    idx = indices[indices == title].index[0]
    series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    top10 = list(series.iloc[1:11].index)
    # list with the titles of the best 10 matching movies
    for i in top10:
      recommend_content.append(list(dataset.index)[i])
    print("If you liked '"+title+"', you may also enjoy:\n")
    return recommend_content

  except:
    return 'Invalid Entry'

In [None]:
# Recommendations for 'A Man Called God'
Dataset_10('Golmaal: Fun Unlimited')

In [None]:
# Recommendations for 'Stranger Things'
Dataset_10('Stranger Things')

## ***9.*** ***Future Work (Optional)***

Integrating this dataset with external sources such as IMDB ratings,books clsutering ,Plant based Type clustering can lead to numerous intriguing discoveries.

By incorporating additional data, a more comprehensive recommender system could be developed, offering enhanced recommendations to users. This system could then be deployed on the web for widespread usage.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

1.In this project, I have worked on a text clustering problem where in we had to classify/group the Netflix shows into certain clusters such that the shows within a cluster are similar to each other and the shows in different clusters are dissimilar to each other.

2.The dataset contained about 7787 records, and 12 attributes. We began by dealing with the dataset's missing values and doing exploratory data analysis (EDA).

3.It was found that Netflix hosts more movies than TV shows on its platform, and the total number of shows added on Netflix is growing exponentially. Also, majority of the shows were produced in the United States, and the majority of the shows on Netflix were created for adults and young adults age group.

4.Once obtained the required insights from the EDA, we start with Pre-processing the text data by removing the punctuation, and, stop words. This filtered data is passed through TF - IDF Vectorizer since we are conducting a text-based clustering and the model needs the data to be vectorized in order to predict the desired results.

5.It was decided to cluster the data based on the attributes: director, cast, country, genre, and description. The values in these attributes were tokenized, preprocessed, and then vectorized using TFIDF vectorizer.

6.Through TFIDF Vectorization, we created a total of 20000 attributes. We used Principal Component Analysis (PCA) to handle the curse of dimensionality. 4000 components were able to capture more than 80% of variance, and hence, the number of components were restricted to 4000. We first built clusters using the k-means clustering algorithm, and the optimal number of clusters came out to be 6. This was obtained through the elbow method and Silhouette score analysis.

7.Then clusters were built using the Agglomerative clustering algorithm, and the optimal number of clusters came out to be 12. This was obtained after visualizing the dendrogram.

8.A content based recommender system was built using the similarity matrix obtained after using cosine similarity. This recommender system will make 10 recommendations to the user based on the type of show they watched.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***