<a href="https://colab.research.google.com/github/Rmohanty385/Netflix-Movies-and-TV-Shows-Clustering/blob/main/Netflix_Movies_and_TV_Shows_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Netflix Movies and TV Shows Clustering



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

*   Netflix, is an American subscription streaming service and production company. It was founded in 1997 by Reed Hastings and Marc Randolph in Scott’s Valley, California.

*   It offers a library of films and television series through distribution deals as well as its own productions, known as Netflix Originals.

*   Our objective is to conduct an Exploratory Data Analysis to understand what content is available in different countries and if Netflix has been increasingly focusing on TV rather than movies in recent years. And use these insights to cluster similar content by matching text-based features.

*   After loading the data, we start by observing the first and last five values to understand the dataset. Next, we treat the null values by dropping them if the respective variables contain <1% of null values. This is followed by feature engineering to extract new variables from the datetime variable date_added.

*   This cleaned data is then used to conduct EDA in order to understand it better and identify the underlying trends.

*   Once obtained the required insights from the EDA, we start with Pre-processing the text data by removing the punctuation, and, stop words. This filtered data is passed through TF - IDF Vectorizer since we are conducting a text-based clustering and the model needs the data to be vectorized in order to predict the desired results.

*   Finally, K–Means clustering is utilized to form 10 distinct clusters with similar data points.

*   Using the data provided, we also implemented a simple recommender system that successfully generates Ten similar Movies or Tv-Shows for the given title.


# **GitHub Link -**

https://github.com/Rmohanty385/Netflix-Movies-and-TV-Shows-Clustering

# **Problem Statement**


This dataset consists of tv shows and movies available on Netflix as of 2021. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

*   Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

*   In this project, you are required to do Exploratory Data Analysis.

*   Understanding what type content is available in different countries.

*   Is Netflix has increasingly focusing on TV rather than movies in recent years.

*   Clustering similar content by matching text-based features.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries and modules

# libraries that are used for analysis and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# libraries used to process textual data
import string
string.punctuation
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
import missingno as msno

import datetime as dt

# libraries used to implement clusters
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

# libraries that are used to construct a recommendation system
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Library of warnings would assist in ignoring warnings issued
import warnings;warnings.filterwarnings("ignore")
import warnings;warnings.simplefilter('ignore')
#import stats 
from scipy import stats
     

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows=df.shape[0]
columns=df.shape[1]
print(f"The no of rows is {rows} and no of columns is {columns}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f'The number of duplicate rows are {df.duplicated().sum()}')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(14, 5))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.xlabel("column_name", size=14, weight="bold")
plt.title("missing values in column",fontweight="bold",size=17)
plt.show()

### What did you know about your dataset?

*   There are 7787 rows and 12 columns in the dataset. In the director, cast, country, date_added, and rating columns, there are missing values. The dataset does not contain any duplicate values.

*   Every row of information we have relates to a specific movie. Therefore, we are unable to use any method to impute any null values. Additionally, due to the small size of the data, we do not want to lose any data, so after analyzing each column, we simply impute numeric values using an empty string in the following procedure.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description 

**show_id :** Unique ID for every Movie / Tv Show

**type :** Identifier - A Movie or TV Show

**title :** Title of the Movie / Tv Show

**director :** Director of the Movie

**cast :** Actors involved in the movie / show

**country :** Country where the movie / show was produced

**date_added :** Date it was added on Netflix

**release_year :** Actual Releaseyear of the movie / show

**rating :** TV Rating of the movie / show

**duration :** Total Duration - in minutes or number of seasons

**listed_in :** Genere

**description :** The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Top countries
df.country.value_counts()

In [None]:
# Genre of shows
df.listed_in.value_counts()

In [None]:
# Handling the missing values

# The missing values in the director, cast, and country attributes can be replaced with 'Unknown'
df[['director','cast','country']] = df[['director','cast','country']].fillna('Unknown')

# The missing values in rating can be imputed with its mode, since this attribute is discrete.
df['rating'] = df['rating'].fillna(df['rating'].mode()[0])

# Drop missing value in date_added column.
df.dropna(axis=0, inplace = True)

In [None]:
# Choosing the primary country and primary genre to simplify the analysis
df['country'] = df['country'].apply(lambda x: x.split(',')[0])
df['listed_in'] = df['listed_in'].apply(lambda x: x.split(',')[0])

In [None]:
# contry in which a movie was produced
df.country.value_counts()

In [None]:
# genre of shows
df.listed_in.value_counts()

In [None]:
# Splitting the duration column, and changing the datatype to integer
df['duration'] = df['duration'].apply(lambda x: int(x.split()[0]))

In [None]:
# Number of seasons for tv shows
df[df['type']=='TV Show'].duration.value_counts()

In [None]:
# Movie length in minutes
df[df['type']=='Movie'].duration.unique()

In [None]:
# datatype of duration
df.duration.dtype

In [None]:
# Converting 'date_added' from string to datetime
df["date_added"] = pd.to_datetime(df['date_added'])

In [None]:
# first and last date on which a show was added on Netflix
df.date_added.min(),df.date_added.max()

In [None]:
# Adding new attributes month_added & year_added from date_added column
df['month_added'] = df['date_added'].dt.month
df['year_added'] = df['date_added'].dt.year

# Drop date_added column
df.drop('date_added', axis=1, inplace=True)

In [None]:
# New Dataset Rows & Columns 
df.shape

In [None]:
# Changing the values in the rating column
rating_map = {'TV-MA':'Adults',
              'R':'Adults',
              'PG-13':'Teens',
              'TV-14':'Young Adults',
              'TV-PG':'Older Kids',
              'NR':'Adults',
              'TV-G':'Kids',
              'TV-Y':'Kids',
              'TV-Y7':'Older Kids',
              'PG':'Older Kids',
              'G':'Kids',
              'NC-17':'Adults',
              'TV-Y7-FV':'Older Kids',
              'UR':'Adults'}

df['rating'].replace(rating_map, inplace = True)
df['rating'].unique()

In [None]:
# Dataset Info
df.info()

### What all manipulations have you done and insights you found?

*   There are some movies / TV shows that were filmed in multiple countries, have multiple genres associated with it.
*   To simplify the analysis, let's consider only the primary country where that respective movie / TV show was filmed. Also, let's consider only the primary genre of the respective movie / TV show and changed datatype.
*   The shows were added on Netflix between 1st January 2008 and 16th January 2021.
*   Now our dataset is completely ready for EDA.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Type (Univariate)

In [None]:
# Chart - 1 visualization code
df['type'].value_counts()

#countplot to visualize the number of movies and tv_shows in type column
sns.countplot(df['type'])
plt.show()

##### 1. Why did you pick the specific chart?

*   Countplot show the frequency, counts of values for the different levels of a categorical or nominal variable.

*   To show the count of TV Show and Movie, I have used Countplot.


##### 2. What is/are the insight(s) found from the chart?

Netflix has 5377 movies and 2400 TV shows, there are more number movies on Netflix than TV shows.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*   No, the gained insights help a positive business impact.

*   No, there are no such insights that lead to negative growth.


#### Chart - 2 : Director

In [None]:
# Chart - 2 visualization code
# Top 10 directors in the dataset
plt.figure(figsize=(15,6))
df[~(df['director']=='Unknown')].director.value_counts().nlargest(10).plot(kind='bar')
plt.title('Top 10 directors by number of shows directed')

##### 1. Why did you pick the specific chart?

A barplot (or barchart) is one of the most common types of graphic. It shows the relationship between a numeric and a categoric variable.

##### 2. What is/are the insight(s) found from the chart?

*   Raul Campos and Jan Suter together have directed 18 movies / TV shows, which is higher than anyone in the top 10 director in dataset.
*   Ryan polito & david dhawan have lowest no of directed movies / TV shows which is 8 & 9 respectively in top 10 director dataset.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*   No, the gained insights didn't create a positive business impact.
*   No, there are no such insights that lead to negative growth.


#### Chart - 3 : Release year of TV Shows/movies

In [None]:
# Chart - 3 visualization code
#Analysing how many movies released per year in last 20 years
movies_year =df['release_year'].value_counts().sort_index(ascending=False)
movies_year[:20]
plt.figure(figsize=(15,5))
sns.countplot(y=df['release_year'],data=df,order=df['release_year'].value_counts().index[0:20])
plt.show()


##### 1. Why did you pick the specific chart?

*   Countplot show the frequency, counts of values for the different levels of a categorical or nominal variable.

*   To show the count of release year of Tv show /movies, I have used Countplot.


##### 2. What is/are the insight(s) found from the chart?

*   highest number of Tv show /movies released in 2017 and 2018.

*   We saw a huge increase in the number of Tv show /movies after 2015.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*   No, the gained insights didn't create a positive business impact.
*   No, there are no such insights that lead to negative growth.

#### Chart - 4 :  Country

In [None]:
# Chart - 4 visualization code
# Top 10 countries with the highest number movies / TV shows in the dataset
plt.figure(figsize=(15,5))
df[~(df['country']=='Unknown')].country.value_counts().nlargest(10).plot(kind='barh')
plt.title(' Top 10 countries with the highest number of shows')

In [None]:
# % share of movies / tv shows by top 3 countries
df.country.value_counts().nlargest(3).sum()/len(df)*100

In [None]:
# % share of movies / tv shows by top 10 countries
df.country.value_counts().nlargest(10).sum()/len(df)*100

##### 1. Why did you pick the specific chart?

A barplot (or barchart) is one of the most common types of graphic. It shows the relationship between a numeric and a categoric variable.

##### 2. What is/are the insight(s) found from the chart?

*   United States are the top one country of highest number of movies / TV shows in the dataset and then after India & United Kingdom are 2 & 3 Number in dataset

*   Mexico & Australia have 9 & 10 position in Top 10 countries with the highest number of movies / TV shows in the dataset.

*   The top 3 countries together account for about 56% of all movies and TV shows in the dataset.

*   Where the top ten countries together account for about 78% of all movies and TV shows in the dataset.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*   No, the gained insights didn't create a positive business impact.

*   No, there are no such insights that lead to negative growth.


#### Chart - 5 : Release month

In [None]:
# Chart - 5 visualization code
# Plotting the Countplot 
plt.figure(figsize=(12,10))
ax=sns.countplot(df['month_added'],data= df)
plt.show()

##### 1. Why did you pick the specific chart?

*   Countplot show the frequency, counts of values for the different levels of a categorical or nominal variable.

*   To show the count of month , I have used Countplot.


##### 2. What is/are the insight(s) found from the chart?

From October to January, maximum number of movies and TV shows were added.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*   No, the gained insights didn't create a positive business impact.

*   No, there are no such insights that lead to negative growth.


#### Chart - 6 : Age

In [None]:
# Chart - 6 visualization code
# Age ratings for shows in the dataset
plt.figure(figsize=(13,5))
sns.countplot(x='rating',data=df)
plt.xlabel('Different Age Groups',fontsize=12)
plt.ylabel('Number of Shows',fontsize=12)
plt.title('Number of shows on Netflix for different age groups')

##### 1. Why did you pick the specific chart?

countplot() method is used to Show the counts of observations in each categorical bin using bars.

##### 2. What is/are the insight(s) found from the chart?

*   Adults are the highest number of shows and second highest is Young Adults and third highest is older Kids shows.
*   Teens and kids are less number of shows on netflix as compare to other.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*   Yes, the gained insights didn't create a positive business impact.

*   No, there are no such insights that lead to negative growth.


#### Chart - 7 : Genre

In [None]:
# Chart - 7 visualization code
# Top 10 genres 
plt.figure(figsize=(13,5))
df.listed_in.value_counts().nlargest(10).plot(kind='bar')
plt.ylabel('Number of Shows',fontsize=12)
plt.title('Top 10 genres')

##### 1. Why did you pick the specific chart?

A barplot (or barchart) is one of the most common types of graphic. It shows the relationship between a numeric and a categoric variable.

##### 2. What is/are the insight(s) found from the chart?

*   The dramas is the most popular genres then comedies genres is second highest and documentries is third highest.

*   Genres of stand up comedy and horror movies are ninth and tenth highest in dataset.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*   Yes, the gained insights didn't create a positive business impact.

*   No, there are no such insights that lead to negative growth.

#### Chart - 8 : Actors

In [None]:
# Chart - 8 visualization code
# seperating actor from cast column
cast =df[~(df['cast']=='Unknown')].cast.str.split(', ', expand=True).stack().reset_index(level=1, drop=True)
plt.figure(figsize=(13,5))
cast.value_counts().nlargest(10).plot(kind='bar')
plt.ylabel('Number of Shows',fontsize=12)
plt.title('Top 10 Actors')

##### 1. Why did you pick the specific chart?

A barplot (or barchart) is one of the most common types of graphic. It shows the relationship between a numeric and a categoric variable.

##### 2. What is/are the insight(s) found from the chart?

*   Anupam kher play highest role in movie/show where shah rukh khan play secong highest role in movie/show.
*   Paresh rawal, yuki kaji play ninth & tenth highest role in movie/show.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*   Yes, the gained insights didn't create a positive business impact.

*   No, there are no such insights that lead to negative growth.

#### Chart - 9 : movies and TV shows added over the years (Bivariate)

In [None]:
# Chart - 9 visualization code

# Number of movies and TV shows added over the years
plt.figure(figsize=(20,6))
p = sns.countplot(x='year_added',data=df, hue='type')
plt.xlabel('Adding Years',fontsize=12)
plt.ylabel('Number of Shows',fontsize=12)
plt.title('Number of movies and TV shows added over the years')
for i in p.patches:
  p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

##### 1. Why did you pick the specific chart?

countplot() method is used to Show the counts of observations in each categorical bin using bars.

##### 2. What is/are the insight(s) found from the chart?

*   Most of the years have more movies adding as compare to TV shows in Netflix.
*   Also Over the last few years from 2015 to 2021 Netflix has consistently focused on adding more shows in its platform.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*  Yes, the gained insights didn't create a positive business impact.

*  No, there are no such insights that lead to negative growth.

#### Chart - 10 : Number of shows since 2008

In [None]:
# Chart - 10 visualization code
# Number of shows released each year since 2008
order = range(2008,2022)
plt.figure(figsize=(15,7))
p = sns.countplot(x='release_year',data=df, hue='type',
                  order = order)
plt.xlabel('Release Years',fontsize=12)
plt.ylabel('Number of Shows',fontsize=12)
plt.title('Number of shows released each year since 2008 that are on Netflix')

for i in p.patches:
  p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')

##### 1. Why did you pick the specific chart?

countplot() method is used to Show the counts of observations in each categorical bin using bars.

##### 2. What is/are the insight(s) found from the chart?

*   From 2008 to 2018 Number of releaing Movies & TV Shows gradually increasing in each year.

*   from year 2008 to 2019 movies releasing is higher compare to TV shows.

*   Where year of 2017 & 2018 have highest number of releasing movies which is 744 & 734 respectively.

*   Where year of 2019 & 2020 have highest number of releasing TV Show which is 414 and 457 respectively


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.



*   No, the gained insights didn't create a positive business impact.
*   No, there are no such insights that lead to negative growth.


#### Chart - 11 : Number of seasons per TV show

In [None]:
# Chart - 11 visualization code
# Seasons in each TV show
plt.figure(figsize=(15,7))
p = sns.countplot(x='duration',data=df[df['type']=='TV Show'])
plt.xlabel('Number of Seasons',fontsize=12)
plt.ylabel('Number of Shows',fontsize=12)
plt.title('Number of seasons per TV show distribution')

##### 1. Why did you pick the specific chart?

A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable. The basic API and options are identical to those for barplot() , so you can compare counts across nested variables.

##### 2. What is/are the insight(s) found from the chart?

*   Total season of tv shows is upto 16.

*   Most of the season of tv shows is one This might mean that the majority of TV shows has only recently begun, and that further seasons are on the way.

*   There are very few TV shows that have more than 8 seasons.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*   No, the gained insights didn't create a positive business impact.

*   No, there are no such insights that lead to negative growth.

#### Chart - 12 : Length distribution of movies

In [None]:
# Chart - 12 visualization code
# Length distribution of movies
movie_df = df[df['type']=='Movie']

plt.figure(figsize=(15, 7))

sns.distplot(movie_df['duration'], bins=30,color='Blue').set(ylabel=None)

plt.title('Length distribution of movies', fontsize=16,fontweight="bold")
plt.xlabel('Duration', fontsize=14)
plt.show()

##### 1. Why did you pick the specific chart?

We use a displot (also known as a distribution plot) to represent data in histogram form. It is a univariant set of collected data, which means the data distribution of one variable will be shown against another variable and it represents the overall distribution of continuous data variables.

##### 2. What is/are the insight(s) found from the chart?

*   The length of a movie may range from 3 min to 312 minutes.
*   Most of the movie length between 50 to 150 min and there are very less movies hows length is above 200 min hence the distribution is almost normally distributed.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*   Yes, the gained insights didn't create a positive business impact.

*   No, there are no such insights that lead to negative growth.

#### Chart - 13 : Actors for TV Shows

In [None]:
# Chart - 13 visualization code
# Top actors for TV Shows
plt.figure(figsize=(15,6))
df[~(df['cast']=='Unknown') & (df['type']=='TV Show')].cast.value_counts().nlargest(10).plot(kind='barh')
plt.xlabel('Number of TV Shows',fontsize=12)
plt.ylabel('Actor Name For TV Shows',fontsize=12)
plt.title('Actors who have appeared in highest number of TV Shows')

##### 1. Why did you pick the specific chart?

A horizontal bar plot is a plot that presents quantitative data with rectangular bars with lengths proportional to the values that they represent. A bar plot shows comparisons among discrete categories. One axis of the plot shows the specific categories being compared, and the other axis represents a measured value.

##### 2. What is/are the insight(s) found from the chart?

*   David Attenborough has appeared in 13 TV shows, followed by Michela Luci, Jamie Watson, Anna Claire Bartlam, Dante Zee, Eric Peterson with 4 TV shows.
*   Remaining actors from third to tenth position have equally appeared 2 Tv Shows.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*   No, the gained insights didn't create a positive business impact.

*   No, there are no such insights that lead to negative growth.


#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
#Checking correlation of all the columns using heatmap
plt.figure(figsize=(18,10))
correlation = df.corr()
sns.heatmap(abs(correlation),annot=True,linewidth=3,cmap='coolwarm')
plt.title('Pearson correlation of Features',fontsize=20)
plt.show()

In [None]:
# Preparing data for heatmap
df['count'] = 1
data = df.groupby('country')[['country','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
data = data['country']


df_heatmap = df.loc[df['country'].isin(data)]
df_heatmap = pd.crosstab(df_heatmap['country'],df_heatmap['rating'],normalize = "index").T
df_heatmap

In [None]:
# Plotting the heatmap
fig, ax = plt.subplots(1, 1, figsize=(12, 12))

country_order2 = ['United States', 'India', 'United Kingdom', 'Canada', 'Japan', 'France', 'South Korea', 'Spain',
       'Mexico']

age_order = ['Adults', 'Teens', 'Older Kids', 'Kids']

sns.heatmap(df_heatmap.loc[age_order,country_order2],cmap="YlGnBu",square=True, linewidth=2.5,cbar=False,
            annot=True,fmt='1.0%',vmax=.6,vmin=0.05,ax=ax,annot_kws={"fontsize":12})
plt.show()

##### 1. Why did you pick the specific chart?

*   A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. *   A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

*   Thus to know the correlation between all the variables along with the correlation coeficients, i used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

*   In first heatmap nothing to see because there is only 2 integer columns. So, i plot another heatmap on target ages and country.

*   The US and UK are closely aligned with their Netflix target ages, but radically different from, example, India or Japan!

*   Also, Mexico and Spain have similar content on Netflix for different age groups.


#### Chart - 15 - Pair Plot 

In [None]:
df.head()

In [None]:
# Pair Plot visualization code
# Commenting this code because it take too much time to run
 '''
sns.pairplot(df,hue='Response')
plt.title('Relation between each variables',fontsize=20)
plt.show()

 '''

##### 1. Why did you pick the specific chart?

*   Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters.

*   Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

Here, we can see that the release year is left skewed and month is normally distributed.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

*   Movies rating for kids and older kids are at least two hours long.
*   The duration which is more than 90 minutes are movies.

In [None]:
#Calculation of mean and std
def meanandstd(A,B):
  #mean and std. calutation for a and b variables
  M1 = A.mean()
  S1 = A.std()

  M2= B.mean()
  S2 = B.std()

  return (M1,S1,M2,S2)

In [None]:
#Calculation of t-value
def findtvalue(A,B,M1,S1,M2,S2):
  #length of groups and DOF
  n1 = len(A)
  n2= len(B)

  dof = n1+n2-2

  sp_2 = ((n2-1)*S1**2  + (n1-1)*S2**2) / dof

  sp = np.sqrt(sp_2)

  #tvalue
  t_val = (M1-M2)/(sp * np.sqrt(1/n1 + 1/n2))
  
  return(dof,t_val)

In [None]:
#making copy of df_clean_frame
df_hypothesis=df.copy()
#head of df_hypothesis
df_hypothesis.head()

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

HO:movies rated for kids and older kids are at least two hours long.

H1:movies rated for kids and older kids are not at least two hours long.


#### 2. Perform an appropriate statistical test.

In [None]:
#filtering movie from Type_of_show column
df_hypothesis1 = df_hypothesis[df_hypothesis["type"] == "Movie"]

In [None]:
#let's see unique target ages 
df_hypothesis1['rating'].unique()

In [None]:
#Another category is target_ages (4 classes).
df_hypothesis1['rating'] = pd.Categorical(df_hypothesis1['rating'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])
#from duration feature extractin string part and after extracting Changing the object type to numeric
df_hypothesis1['duration']= df_hypothesis['duration'].astype(str).str.extract('(\d+)')
df_hypothesis1['duration'] = pd.to_numeric(df_hypothesis1['duration'])
#head of df_
df_hypothesis1.head(3)

In [None]:
#group_by duration and rating                 
group_by_= df_hypothesis1[['duration','rating']].groupby(by='rating')
#mean of group_by variable
group=group_by_.mean().reset_index()
group

In [None]:
#In A and B variable grouping values 
A= group_by_.get_group('Kids')
B= group_by_.get_group('Older Kids')

In [None]:
# Perform Statistical Test

#Getting mean and std
M1,S1,M2,S2 = meanandstd(A,B)

#Getting t-value
dof,t_value = findtvalue(A,B,M1,S1,M2,S2)
print('-----'*6+f'\nt-value\n'+'-----'*6)
print(t_value)
print('====='*6)


#t-distribution
print('-----'*6+f'\nt-distribution\n'+'-----'*6)
print(stats.t.ppf(0.05,dof))
print(stats.t.ppf(0.95,dof))

##### Which statistical test have you done to obtain P-Value?

*   Because the t-value is not in the range, the null hypothesis is rejected.

*   As a result, movies rated for kids and older kids are not at least two hours long.


##### Why did you choose the specific statistical test?

Here , we use t-test instead of z-test because in a t-test is primarily used for research with limited sample sizes whereas a z-test is deployed for hypothesis testing that requires researchers to look at a population size that's larger than 30.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

HO:The duration which is more than 90 mins are NOT movies.

H1:The duration which is more than 90 mins are movies.

#### 2. Perform an appropriate statistical test.

In [None]:
df_hypothesis['duration']= df_hypothesis['duration'].astype(str).str.extract('(\d+)')
df_hypothesis['duration'] = pd.to_numeric(df_hypothesis['duration'])

In [None]:
df_hypothesis['type'] = pd.Categorical(df_hypothesis['type'], categories=['Movie','TV Show'])
#head of df_
df_hypothesis.head(3)

In [None]:
#group_by duration and TYPE                 
group_by_= df_hypothesis[['duration','type']].groupby(by='type')
#mean of group_by variable
group=group_by_.mean().reset_index()
group

In [None]:
#In A and B variable grouping values 
A= group_by_.get_group('Movie')
B= group_by_.get_group('TV Show')

In [None]:
# Perform Statistical Test

#Getting mean and std
M1,S1,M2,S2 = meanandstd(A,B)

#Getting t-value
dof,t_value = findtvalue(A,B,M1,S1,M2,S2)
print('-----'*6+f'\nt-value\n'+'-----'*6)
print(t_value)
print('====='*6)


#t-distribution
print('-----'*6+f'\nt-distribution\n'+'-----'*6)
print(stats.t.ppf(0.05,dof))
print(stats.t.ppf(0.95,dof))

##### Which statistical test have you done to obtain P-Value?

*   Because the t-value is not in the range, the null hypothesis is rejected.

*   As a result, The duration which is more than 90 mins are movies.



##### Why did you choose the specific statistical test?

Here , we use t-test instead of z-test because in a t-test is primarily used for research with limited sample sizes whereas a z-test is deployed for hypothesis testing that requires researchers to look at a population size that's larger than 30.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***