<a href="https://colab.research.google.com/github/Amitish/Unsupervised-ML---Netflix-Movies-and-TV-Shows-Clustering/blob/main/Unsupervised_ML_Netflix_Movies_and_TV_Shows_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Unsupervised ML - Netflix Movies and TV Shows Clustering



##### **Project Type**    - EDA/Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**

Netflix is an American subscription video on-demand over-the-top streaming service. The service primarily distributes original and acquired films and television shows from various genres, and it is available internationally in multiple languages.

Get ready to unlock hidden potential! This project meticulously addresses data quality issues, transforming raw information into a powerful resource.

This project harnesses machine learning to group Netflix's vast library of 7,000+ movies and shows by content, helping users discover hidden gems and navigate the platform effortlessly.

To gain insights into the diverse content offered by Netflix, we are going to analyze the dataset containing details about movies and TV shows. We will employee descriptive statistics to understand the distribution of key variables and create visualizations like scatterplots, histograms, line charts, heatmaps etc to explore relationships between them.

This multi-faceted approach will help us uncover valuable patterns and trends within the dataset. Moreover we will identify the key anamolies and try to work upon it.

A concluding statement will not only summarize our findings but also empower audiences to derive value and fuel their own projects with these actionable insights.


# **GitHub Link -**

https://github.com/Amitish/Unsupervised-ML---Netflix-Movies-and-TV-Shows-Clustering

# **Problem Statement**


Netflix welcomes new members with a personalized onboarding journey, guiding them through account setup, preference selection, and curated recommendations based on their viewing history. Walking through such a huge textual data can be impractical and resource-intensive taking in lot of time and efforts.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np
import pandas as pd
from scipy.stats import norm

# visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# nltk library
import nltk
nltk.download('stopwords')  # downloads stopwords like "a","an","the"
nltk.download('punkt')      # for tokenization - breaking text into individual sentence

# downloads stopwords from nltk library if corpus is available
from nltk.corpus import stopwords

# used for matching of string ascii, punctuations, digits
import string

# Importing regex library for comparison
import re

# For web scrapping
from bs4 import BeautifulSoup

# Method for reducing words to their base forms
from nltk.stem.porter import PorterStemmer

# More accurate than simple stemming
from nltk.stem import WordNetLemmatizer

# Import TfidfVectorizer )for counting word occurance)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Creating and displaying formatted tables (beautify)
from tabulate import tabulate

# KElbowVisualizer
from yellowbrick.cluster import KElbowVisualizer,SilhouetteVisualizer
from yellowbrick.cluster.elbow import kelbow_visualizer

# Importing clustering Evaluation metrics
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

# Install yellowbrick library
!pip install yellowbrick
!pip install contractions

# For Cross-Validation and Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

# Importing algorithams for building model
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.decomposition import LatentDirichletAllocation

# To ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import drive                                    # Links the Google drive with Colab notebook, to extract the desired dataset
drive.mount("/content/drive")


In [None]:
net = pd.read_csv("/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")   # Extracting datafile

### Dataset First View

In [None]:
# Dataset First Look
net

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
no_of_rows = net.shape[0]
no_of_columns = net.shape[1]

print("no_of_rows: ",no_of_rows)
print("no_of_columns: ",no_of_columns)

### Dataset Information

In [None]:
# Dataset Info
net.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicates = net.duplicated().sum()
duplicates

# 0 indicates that no duplicate rows found in entire dataset

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

null = net.isnull().sum().reset_index()
null


In [None]:
# Visualizing the missing values

nulls= pd.DataFrame({'Column Name':net.columns,'Null Values':net.isnull().sum(),'Percentage %':round(net.isnull().sum()*100/len(net),2)})
nulls.set_index('Column Name').sort_values(by='Percentage %', ascending = False)

### What did you know about your dataset?

Findings by far:
1. No duplicate rows identified in the dataset.
2. **director, cast and country** column holds the most no.of nulls whereas **date_added** and **rating** column having the least.
3. **release_year** holds numeric data where all other columns are categorical.
4. As data cleaning requires replacing null values but replacing null values can sometime mislead the dataset.
5. So, it must be worked upon cautiously only were its required.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
net.columns

In [None]:
# Dataset Describe
net.describe()

# since release_year is the only numeric datatype column

In [None]:
categorical = [i for i in net.describe(include="object")]
print(">>>", categorical)
print(" ")
print("No.of Categorical columns: ", len(categorical))

In [None]:
numeric = [j for j in net.columns if j not in categorical]
print(">>>", numeric)
print(" ")
print("No.of Numeric columns: ", len(numeric))

### Variables Description

We can conclude that:
1. No.of rows x columns =  **7787 x 12**
2. No.of categorical columns = **11**
3. No.of numeric columns = **1**

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for i in net.columns:
  d = net[i].unique()                           # unique() to get values and nunique() to get number
  print("Column ---------", i)
  print(d)
  print("No.of unique values ---------", len(d))
  print("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

net.fillna({"director": "unknown", "cast": "unknown", "country": "unknown"}, inplace=True)

net.dropna(subset=["date_added", "rating"], inplace=True)

# Converting the data types of features date_added and release_year to the appropriate data types
net['date_added'] = net['date_added'].str.strip()
net['date_added'] = pd.to_datetime(net['date_added'], format='%B %d, %Y', errors='coerce')
net.release_year = net.release_year.astype('int64')

# Renaming name of column listed_in to genre
net.rename = net.rename(columns={'listed_in':'genre'}, inplace=True)

# Breaking down the date_added column based on dd/mm/yy
net['year_added']= net['date_added'].dt.year
net['month_added']= net['date_added'].dt.month
net['day_added']= net['date_added'].dt.day

# Deleting column date_added
net.drop("date_added", axis=1, inplace=True)

#convert int32 into int64
net['year_added'] = net['year_added'].astype(np.int64)
net['month_added']= net['month_added'].astype(np.int64)
net['day_added']= net['day_added'].astype(np.int64)

# Partitioning and creating new dataset based on type of show
tv_shows_data = net[net["type"]=='TV Show']
movie_shows_data = net[net["type"]=='Movie']

In [None]:
net.isnull().sum().reset_index()

### What all manipulations have you done and insights you found?

1. Then we performed **isnull()** to find null values and concluded that we have **5** columns with some null values.
2. We initiated **drop_duplicates()** method but didn't find any duplicate values in the dataset.
3. The 2 coulmns i.e., **date_added** and **rating** had very less number of null values, so we removed those null values.
4. The 3 columns i.e., **director, cast** and **country** had huge number of null values so dropping these columns couldn't be afforded hence we replaced null values with **"unknown"**.
5. Converted the data types of features date_added and release_year to the appropriate data types.
6. Renamed name of column **listed_in** to **genre**.
7. Broke down the **date_added** column based on **dd/mm/yy** and deleted the previous column.
8. We now have **14 rows** instead of 12.
9. Partitioned and creating new dataset based on type of show.

In [None]:
net.shape

.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
ax = sns.countplot(x=net["type"])

for p in ax.patches:
    count = p.get_height()  # Get the count for each bar
    x = p.get_x() + p.get_width() / 2  # Get the center x-coordinate of each bar
    y = count
    ax.annotate(f"{count}", (x, y), ha='center')


##### 1. Why did you pick the specific chart?

- The chart reflects the count of show type that is available. This gives a kind of understanding as to which show is most displayed and hence helps the user in making the right decision before watching.


##### 2. What is/are the insight(s) found from the chart?

The bar chart gives a count of Show Type:
- Movie : **5372**
- TV Show : **2398**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- We can drive insite from the chart above that **Movie** is the most displayed/watched show type. This will eventually give users an idea about the popularity of show type  and hence assist them in making right choice.

- The insite will definitely not just benefit users in their decision making process but in turn will also benefit directors to know about the users interest.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

shows_produced = net["director"].value_counts()

# Filter out "unknown" values, keeping remaining top 10
shows_produced = shows_produced.loc[shows_produced.index != "unknown"].head(10).sort_values(ascending=True)

colors = ["red", "blue", "pink", "black", "green"]

plt.barh(shows_produced.index, shows_produced.values, color=colors)

for i, v in enumerate(shows_produced.values):
    plt.text(v + 0.05, i, str(v), va="center")

plt.title("Top 10 Directed Shows")
plt.xlabel("Count")
plt.ylabel("Directors")
plt.show()


##### 1. Why did you pick the specific chart?

- The reason of selecting this chart is to get an overview of top 10 Directors  with most number of shows count.



##### 2. What is/are the insight(s) found from the chart?

- We can state that:
1. **Raul Campos, Jan Suter** has directed most no.of shows i.e, **18** amongst top 10.
2. **Robert Rodriguez** has directed **8** shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Yes indeed the gained insights will help in creating positive business impact as it portrays an understanding of maximum shows directed by a Director. Based on the experience right director can be approached.

- As such no negative growth can be fetched from above dataset.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

avg_year = net["release_year"].mean()

plt.hist(net["release_year"], bins=20, edgecolor="black")

plt.axvline(x=avg_year, color="red", linestyle="dashed", linewidth=1, label="Average Release Year")

plt.title("Number of Shows Released")
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.legend()
plt.show()


##### 1. Why did you pick the specific chart?

- Selecting the above chart conveys a visual of how shows have been released in an interval of time on an average.

##### 2. What is/are the insight(s) found from the chart?

- Following information can be driven out:
1. Release of shows has seen an **incremental** growth with time.
2. **Average** no.of shows releases are in the year **2013**.
3. **Maximum** no.of shows releases are in the year **2021**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- The above insight will to some extend help in gaining positive business impact as we can proclaim with every increasing year the no.of shows release is increasing.
- As of we can't see any negative growth from the data above as all the figure that we have seems to be upto the mark, adding some positivity to the data.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

ratings = net["rating"].value_counts()

colors = ["red", "blue", "pink", "black", "green"]
plt.bar(ratings.index, ratings.values, color=colors)

# Add count labels above each bar
for i, v in enumerate(ratings.values):
    plt.text(i, v + 50, str(v), ha="center")

# Customize the chart
plt.title("Shows Rating Distribution")
plt.xlabel("Ratings")
plt.ylabel("Counts")
plt.show()


##### 1. Why did you pick the specific chart?

- The logic of driving bar chart is to conceive the insight of ratings count being majorly provided.

##### 2. What is/are the insight(s) found from the chart?

- The above graph states that:
1. **TV-MA** is the **highest** rating in the list with count of **2861**.
2. **NC-17** is almost the **least** rating in the list with count of **3**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- In short we can for sure say that the above analysis will lay a positive impact on business, as they will get a clear picture of ratings count.
- One negative point is that lesser the rating lesser is the popularity of the show.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

shows_share = net["type"].value_counts()

plt.pie(shows_share.values, autopct='%1.1f%%', labels=shows_share.index)

plt.title("Shows Share Percentage")

plt.show()

##### 1. Why did you pick the specific chart?

- The motive of selecting this specific chart is to get a clear picture of show type distribyution.

##### 2. What is/are the insight(s) found from the chart?

- From the above chart we can say that:
1. **Movie** show type constitutes for **69.1%** overall.
2. **TV** Show constitutes for **30.9%**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Yes the gained insight will surely create a positive business impact as user will get to know the weightage shared by each show type.
- Moreover directors will also have a better understanding of the public demand.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

shows_prod = tv_shows_data["director"].value_counts()

# Filter out "unknown" values, keeping remaining top 10
shows_prod = shows_prod.loc[shows_prod.index != "unknown"].head(10).sort_values(ascending=True)

colors = ["red", "blue", "pink", "black", "green"]

plt.barh(shows_prod.index, shows_prod.values, color=colors)

for i, v in enumerate(shows_prod.values):
    plt.text(v + 0.05, i, str(v), va="center")

plt.title("Top 10 Directed TV Shows")
plt.xlabel("Count")
plt.ylabel("Directors")
plt.show()


##### 1. Why did you pick the specific chart?

- The label graph reflects distribution of Top 10 TV shows directed by Directors.

##### 2. What is/are the insight(s) found from the chart?

- We can state following conclusion:
1. **Alastair** has directed **3** TV shows amongst top 10.
2. There are many director with 2 or less no.of TV shows directed.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- More no.of TV shows directed gives an idea about how experienced the director is a particular TV show type.
- The data doesnt interpret any negative growth as such.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

mov_prod = movie_shows_data["director"].value_counts()

# Filter out "unknown" values, keeping remaining top 10
mov_prod = mov_prod.loc[mov_prod.index != "unknown"].head(10).sort_values(ascending=True)

colors = ["black", "green", "orange", "blue", "grey"]

plt.barh(mov_prod.index, mov_prod.values, color=colors)

for i, v in enumerate(mov_prod.values):
    plt.text(v + 0.2, i, str(v), va="center")

plt.title("Top 10 Directed Movies")
plt.xlabel("Count")
plt.ylabel("Directors")
plt.show()


##### 1. Why did you pick the specific chart?

- The label graph potrays distribution of Top 10 TV movies directed by respective directors.


##### 2. What is/are the insight(s) found from the chart?

1. **Raul** and **Jan** has directed most no.of movies i.e., **18** each.
2. **Johnnie** has directed **8** movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- The number of films directed by a director can give some indication of their experience in a particular movie type.
- By far we don't find any insights that lead to negative growth.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

mov_con = movie_shows_data["country"].value_counts()

# Filter out "unknown" values, keeping remaining top 10
mov_con = mov_con.loc[mov_con.index != "unknown"].head(10).sort_values(ascending=True)

colors = ["black", "red", "grey"]

plt.barh(mov_con.index, mov_con.values, color=colors)

for i, v in enumerate(mov_con.values):
    plt.text(v + 0.2, i, str(v), va="center")

plt.title("Top 10 Countries with most Movies Production")
plt.xlabel("Count")
plt.ylabel("Countries")
plt.show()


##### 1. Why did you pick the specific chart?

- This visualization highlights the movie with the highest production count.

##### 2. What is/are the insight(s) found from the chart?

- We can conclude that:
1. Most no.of movies production Country is **United States** with count **1847**.
2. **India** is the **2nd** largest movie producing Country with count **852**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- The stats clearly shows that United States is the highest movie producing country which reflects its the first choice for every director.
- Apart from United States and India other countries need to work on production capacity.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

tv_show_con = tv_shows_data["country"].value_counts()

# Filter out "unknown" values, keeping remaining top 10
tv_show_con = tv_show_con.loc[tv_show_con.index != "unknown"].head(10).sort_values(ascending=True)

colors = ["blue", "green", "orange"]

plt.barh(tv_show_con.index, tv_show_con.values, color=colors)

for i, v in enumerate(tv_show_con.values):
    plt.text(v + 0.2, i, str(v), va="center")

plt.title("Top 10 Countries with most TV Shows Production")
plt.xlabel("Count")
plt.ylabel("Countries")
plt.show()


##### 1. Why did you pick the specific chart?

- This graph pinpoints the countries TV shows with the highest number involved in its production.

##### 2. What is/are the insight(s) found from the chart?

We can deduce following conclusion:
1. **United States** is the highest TV show producing country with count of **699**.
2. **United Kingdom** is **2nd** highest TV show producing country with count of **203**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Statistics reveal that the United States leads the world in TV Show  production, suggesting it may be a preferred destination for many directors.
- Except United States all other countries have shown a pretty small no.of production capacity.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

from wordcloud import WordCloud
plt.subplots(figsize=(10,8))

casting = net[net['cast'] != 'unknown']

wordcloud = WordCloud(background_color='white').generate(','.join(casting.cast))

plt.imshow(wordcloud, interpolation="bilinear")
plt.show()


##### 1. Why did you pick the specific chart?

- The word cloud gives a visual glimpse of which actor/actress has performed in most no.of shows.

##### 2. What is/are the insight(s) found from the chart?

- We can deduce following information from above visuals:
1. The word with bigger text size depicts that those casts has been casted most no.of movies like Michael, Lee etc.
2. The ones with smaller text size depicts that those casts has been casted in less no.of movies  like Tim, Yu etc.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Yes surely the insight will help directors to make the right decision while casting the actor/actress based on their experience.
- No negative insight can be predicted from the above visual.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

tv_shows_yr = tv_shows_data['release_year'].value_counts().sort_index(ascending=False)
movie_shows_yr = movie_shows_data['release_year'].value_counts().sort_index(ascending=False)

movie_shows_yr.plot(figsize=(10, 5), linewidth=2, color='red')
tv_shows_yr.plot(figsize=(10, 5), linewidth=2, color='blue')
plt.xlabel("Years", labelpad=5)
plt.ylabel("Count", labelpad=5)
plt.title("Movies vs TV_shows Release Year Analysis")


##### 1. Why did you pick the specific chart?

- The line chart shows time trend analysis of movies vs TV shows release year.

##### 2. What is/are the insight(s) found from the chart?

- It is clear from above graph that:
1. Both TV shows and movie release year has seen an incremental growth over years.
2. 2020-2021 has seen maximum releases for both type of shows.





.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Definitely the above states delivers a positive buiness impact by providing a clear picture of shows release year.
- Release year for both type of TV shows has seen an incremental groth over years so no negative growth can be stated.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

fig, ax = plt.subplots(figsize=(10, 5))

palette = sns.color_palette("Set1")
sns.countplot(x="month_added", hue="type", lw=2, data=net, palette=palette)

plt.title("Movies vs TV_shows added every month")
plt.show()


##### 1. Why did you pick the specific chart?

- The bar chart above give a pictorial representation of how shows are added month on month.

##### 2. What is/are the insight(s) found from the chart?

- We can state from above information that:
1. Movie show type has shown more no.of month on month addition as compared to TV show type.
2. Both show type has seem a fluctuating trend in addition.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- We can signify from above stats that though for both the show types month on month addition is not stable, it shows fluctuating result but movie show type has seen more additions.
- The only near to negative fact about the stats is that TV show type has seen lesser growth as compared to movie show type.


#### Chart - 13

In [None]:
# Chart - 13 visualization code

plt.figure(figsize=(14,6))
plt.title('Top 10 Genre of Shows Type')
palette = sns.color_palette("Set2")
sns.countplot(y=net['genre'], data=net, order=net['genre'].value_counts().index[0:10], palette=palette)


##### 1. Why did you pick the specific chart?

- The label chart shows how choice of genre is prferred amongst directors.

##### 2. What is/are the insight(s) found from the chart?

- The above figure states that:
1. Documenatries is the best choice of directors amongst all genre type.
2. There is a close call between stand up comedy and dramas genre type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- The gained insight will for sure help creating a positive businesss impact from directors point of view as they can focus on the genre which is mostly preferred.
- As far no negative insight can be deduced from the above states.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

sns.heatmap(data=net[["release_year","year_added","month_added", "day_added"]].corr(), annot= True)

plt.title("Overall detailed Analysis")

plt.show()


##### 1. Why did you pick the specific chart?

- The selection of heatmap chart is to give a clear insight of how each and every series in the data set are linked to each other, how collaborative the bonds are and how loose are they.

##### 2. What is/are the insight(s) found from the chart?

- We can say from above observation that:
1. The one's with lightest color and indicated with numeric value 1 shares the strongest bond i.e., the series when compared with itself serves the deep collaboration.
2. As the color keeps darkening, the strength of the bonds keeps on deteriorating and the numeric values justifies it all.
3. The series who's joint venture depicts the darkest color and have least numeric value claims to have the weakest bond.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

sns.pairplot(net)

plt.title("Overall detailed Analysis")

plt.show()

##### 1. Why did you pick the specific chart?

- The motive of selecting pair plot is to get a summarized insight of relationship shared between different type of columns in the dataset.

##### 2. What is/are the insight(s) found from the chart?

- Following analysis can be drawn from the above data:
1. Out of all the columns we have extracted out the ones which are are numeric in nature and add some meaningful data to the analysis.
2. The relationship charts give clear highlights of how each column in the dataset shares a relation with another.
3. The peaks in the chart signifies a strong relation when a specified column intersects with itself.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1. **Hypothetical Statement 1:** Viewer Rating and Mature Content
2. **Hypothetical Statement 2:** Release Year Impact on Show Volume
3. **Hypothetical Statement 3:** Genre of Movies

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- **Hypothesis:** The number of shows released on Netflix has significantly increased **after 2010** compared to **before 201**0.

- **Null Hypothesis (Ho)**: The number of shows released **before 2010** is the same as or more than the number released after 2010.
- **Alternative Hypothesis (Ha)**: The number of shows released **after 2010** is more than the number released before 2010

In [None]:
# Shows released after 2010
net_after_2010 = net[net["year_added"]>2010]
print("Shows_released_after_2010: ", net_after_2010.shape[0])

In [None]:
# Shows released before 2010
net_before_2010 = net[net["year_added"]<=2010]
print("Shows_released_before_2010: ", net_before_2010.shape[0])

In [None]:
mean1 = net_after_2010["year_added"].mean()
print("mean_net_after_2010: ", round(mean1,0))
mean2 = net_before_2010["year_added"].mean()
print("mean_net_before_2010: ", round(mean2,0))
overall_mean = net["year_added"].mean()
print("mean_overall: ",  round(overall_mean,0))

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# We will perform Z Test Statistics

from statsmodels.stats.proportion import proportions_ztest

ts= net.shape[0]

try:
    count = np.array([net_after_2010, net_before_2010])
    print("Array:", count)
except ValueError as e:
    print("Error:", e)

nobs = np.array([ts, ts])  # Total observations

stat, pval = proportions_ztest(count, nobs, alternative='larger')

print(" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -")
print('p-value:', pval)


##### Which statistical test have you done to obtain P-Value?

- He have choosen **Z Test** to determine the **p value** for the above hypothesis.
- The **p value** for the above scenerio rolls out to be **0.5**.
- **p value 0.5** clearly stats that its **too high** and falls under **type II error**.
- Moreover from above observation its visible that **population mean** is close to **Ha hypothesis**.
- Hence we conclude that based on the result, we will **reject the Ho hypothesis**.

##### Why did you choose the specific statistical test?

- **Z-score hypothesis** testing could indeed be suited for **analyzing the impact** of a specific year (2010) on the volume of shows released on Netflix, as outlined in your previous scenario about whether there has been a significant increase in the number of shows released after 2010 compared to before.
- The **Z-test** is generally **preferred** when the sample size is large (**typically n > 30**).
- In hypothesis testing, converting the **difference between** the **sample mean** and the **population mean** into a Z-score allows for easy comparison against standard normal distribution values.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- **Hypothesis:** Shows classified as "**TV-MA**" (Mature Audience) have a **higher** average viewer rating than shows classified as "**TV-PG**" (Parental Guidance).

- **Null Hypothesis (Ho)**: Mean viewer rating for **TV-MA** shows **equals** the mean viewer rating for **TV-PG** shows
- **Alternative Hypothesis (Ha)**: Mean viewer rating for **TV-MA** shows is **greater than** the mean viewer rating for **TV-PG** shows.

In [None]:
# rating count for TV_MA
net_rating_TV_MA = net[net["rating"] == "TV-MA"]
print("net_rating_TV_MA: ", net_rating_TV_MA.shape[0])

In [None]:
# rating count for TV_PG
net_rating_TV_PG = net[net["rating"] == "TV-PG"]
print("net_rating_TV_PG: ", net_rating_TV_PG.shape[0])

In [None]:
net_mean_rating_TV_MA = 2861/net.shape[0]
print("net_mean_rating_TV_MA: ", round(net_mean_rating_TV_MA, 2))
net_mean_rating_TV_PG = 804/net.shape[0]
print("net_mean_rating_TV_PG: ", round(net_mean_rating_TV_PG,2))
mean2 = (2861+804)/net.shape[0]
print("Overall_mean: ", round(mean2,2))

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# We will perform Z Test Statistics

from statsmodels.stats.proportion import proportions_ztest

ts= net.shape[0]

try:
    count = np.array([net_rating_TV_MA, net_rating_TV_PG])
    print("Array:", count)
except ValueError as e:
    print("Error:", e)

nobs2 = np.array([ts, ts])  # Total observations

stat2, pval2 = proportions_ztest(count, nobs, alternative='larger')

print(" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -")
print('p-value:', pval2)


##### Which statistical test have you done to obtain P-Value?

- He have choosen **Z Test** to calculate the **p value** for the above testing.  
- **p value 0.5** points out to be **too high** and falls under **type II error**.
- Moreover from above observation its visible that **population mean** is close to **Ha hypothesis**.
- Hence we conclude that based on the result, we will **reject the Ho hypothesis**.

##### Why did you choose the specific statistical test?

- **Z-score hypothesis** testing could indeed be suited for **analyzing the rating count** of specifically 2 types of rating i.e., "**TV-MA**" (Mature Audience) and "**TV-PG**" (Parental Guidance).
- The **Z-test** is generally **preferred** when the sample size is large (**typically n > 30**).
- In hypothesis testing, converting the **difference between** the **sample mean** and the **population mean** into a Z-score allows for easy comparison against standard normal distribution values.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- **Hypothesis:** The average count of movies labeled as "**Dramas**" is less than the average count of movies labeled as "**Comedies**."

- **Null Hypothesis (Ho)**: The average count of **Dramas** is the **same** more than **Comedies**.
- **Alternative Hypothesis (Ha)**: The average count of **Dramas** is same or less than **Comedies**.

In [None]:
# Before performing any hypothesis testing we will first explode the genre column and seggregate all the genres seperately

net2= net.copy()
net2["genre"] = net2["genre"].str.split(",")
net2 = net2.explode('genre')
net2["genre"] = net2["genre"].apply(lambda x: x.strip())
net2["genre"]

In [None]:
net_2 = pd.DataFrame(net2)

In [None]:
# Count for genre Dramas
net_2_genre_dramas = net_2[net_2["genre"] == "Dramas"]
print("net_2_genre_dramas: ", net_2_genre_dramas.shape[0])

In [None]:
# Count for genre Comedies
net_2_genre_comedy = net_2[net_2["genre"] == "Comedies"]
print("net_2_genre_comedy: ", net_2_genre_comedy.shape[0])

In [None]:
net_mean_dramas = 2105/net.shape[0]
print("net_mean_dramas: ", round(net_mean_dramas, 2))
net_mean_comedy = 1471/net.shape[0]
print("net_mean_comedy: ", round(net_mean_comedy,2))
mean2 = (2105+1471)/net.shape[0]
print("Overall_mean: ", round(mean2,2))

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# We will perform Z Test Statistics

from statsmodels.stats.proportion import proportions_ztest

ts= net.shape[0]

try:
    count = np.array([net_2_genre_comedy, net_2_genre_dramas])
    print("Array:", count)
except ValueError as e:
    print("Error:", e)

nobs3 = np.array([ts, ts])  # Total observations

stat3, pval3 = proportions_ztest(count, nobs, alternative='larger')

print(" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -")
print('p-value:', pval3)


##### Which statistical test have you done to obtain P-Value?

- He have choosen **Z Test** to determine the **p value** for the above hypothesis.
- The **p value** for the above scenerio rolls out to be **0.5**.
- **p value 0.5** clearly stats that its **too high** and falls under **type II error**.
- Moreover from above observation its visible that **population mean** is close to **Ho hypothesis**.
- Hence we conclude that based on the result, we will **fail to reject the Ho hypothesis**.

##### Why did you choose the specific statistical test?

- **Z-score hypothesis** testing could indeed be suited for **analyzing the count** of specific genre like comedies and dramas.
- The **Z-test** is generally **preferred** when the sample size is large (**typically n > 30**).
- In hypothesis testing, converting the **difference between** the **sample mean** and the **population mean** into a Z-score allows for easy comparison against standard normal distribution values.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
netflix = pd.read_csv("/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")

In [None]:
# Handling Missing Values & Missing Value Imputation

netflix.fillna({"director": "unknown", "cast": "unknown", "country": "unknown"}, inplace=True)

netflix.dropna(subset=["date_added", "rating"], inplace=True)

# Converting the data types of features date_added and release_year to the appropriate data types
netflix['date_added'] = netflix['date_added'].str.strip()
netflix['date_added'] = pd.to_datetime(netflix['date_added'], format='%B %d, %Y', errors='coerce')
netflix.release_year = netflix.release_year.astype('int64')

# Renaming name of column listed_in to genre
netflix.rename(columns={'listed_in':'genre'}, inplace=True)

# Breaking down the date_added column based on dd/mm/yy
netflix['year_added']= netflix['date_added'].dt.year
netflix['month_added']= netflix['date_added'].dt.month
netflix['day_added']= netflix['date_added'].dt.day

# Deleting column date_added
netflix.drop("date_added", axis=1, inplace=True)

#convert int32 into int64
netflix['year_added'] = netflix['year_added'].astype(np.int64)
netflix['month_added']= netflix['month_added'].astype(np.int64)
netflix['day_added']= netflix['day_added'].astype(np.int64)

# Partitioning and creating new dataset based on type of show
tv_shows_data1 = netflix[netflix["type"]=='TV Show']
movie_shows_data1 = netflix[netflix["type"]=='Movie']

In [None]:
netflix.isnull().sum().reset_index()

#### What all missing value imputation techniques have you used and why did you use those techniques?

1. Then we performed **isnull()** to find null values and concluded that we have **5** columns with some null values.
2. We initiated **drop_duplicates()** method but didn't find any duplicate values in the dataset.
3. The 2 coulmns i.e., **date_added** and **rating** had very less number of null values, so we removed those null values.
4. The 3 columns i.e., **director, cast** and **country** had huge number of null values so dropping these columns couldn't be afforded hence we replaced null values with **"unknown"**.
5. Converted the data types of features **date_added** and **release_year** to the appropriate data types.
6. Renamed name of column **listed_in** to **genre**.
7. Broke down the **date_added** column based on **dd/mm/yy** and deleted the previous column.
8. We now have **14 rows** instead of **12**.


### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

netflix.describe()

##### What all outlier treatment techniques have you used and why did you use those techniques?

- **Treating outlier** is one of the **cautious task** in EDA as removing any uncertain values can misinterpret the data for further observation.
- The simplest way to observe outlier in a numeric datatype columns is to implement **describe()** functions which gives min and max values, through which we can compare the difference b/w 2 and then accordingly decide whether we have any outlier or not.
- For detecting outliers in categorical dataype columns we have produced certain charts.
- Based on the close analysis **no outlier** could be identified.
- Hence we can say the dataset is **free from any outlier**.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

one_hot_encoded = pd.get_dummies(netflix, columns=['type'])
one_hot_encoded.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

- We have applied **One-hot encoding** technique on above dataset.
- One of the most common methods, **one-hot encoding**, converts each category value into a new categorical column and assigns a **1 or 0 (True/False)**.
- we have applied this technique on **type** column to **distinguish** b/w **type_movie** and **type_tv_show**.
- All the shows that falls under respective type of show either **type_movie** or **type_tv_show** will set to **True** or **False**.  



### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

- Contractions are words or combinations of words that are shortened by dropping letters and replacing them with an apostrophe, such as "don't" for "do not" or "I'm" for "I am".

In [None]:
# Expand Contraction

import contractions

# Function to expand contractions in a text string
def expand_contractions(text):
    return contractions.fix(text)

# Apply the function to the description column
netflix['expanded_description'] = netflix['description'].apply(expand_contractions)
print(netflix[['description', 'expanded_description']].head())


#### 2. Lower Casing

In [None]:
# Lower Casing

# List of text columns to convert to lowercase
text_columns = ['country', 'description', 'director', 'cast']

# Convert each column to lowercase
for col in text_columns:
    if col in netflix.columns:
        netflix[col] = netflix[col].str.lower()
    else:
        print(f"Column {col} not found in the dataset.")

print(netflix.head())


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

# Define a regex pattern for all punctuation
punctuation_pattern = f"[{string.punctuation}]"

# Columns to clean
text_columns = ['title', 'description', 'director', 'cast']

# Remove punctuation from these columns
for col in text_columns:
    if col in netflix.columns:
        netflix[col] = netflix[col].str.replace(punctuation_pattern, '', regex=True)
    else:
        print(f"Column {col} not found in the dataset.")

print(netflix.head())


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

def remove_words_with_digits(text):
    # Pattern to remove words containing digits
    word_with_digits_pattern = r'\w*\d\w*'
    return re.sub(word_with_digits_pattern, '', text, flags=re.MULTILINE)

text_columns = ['description', 'title']

for col in text_columns:
    if col in netflix.columns:
        netflix[col] = netflix[col].astype(str).apply(remove_words_with_digits)
    else:
        print(f"Column {col} not found in the dataset.")

print(netflix.head())


#### 5. Removing Stopwords & Removing White spaces

In [None]:
nltk.download('stopwords')

In [None]:
# Remove Stopwords

# Set of English stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    filtered_words = [word for word in tokens if word.lower() not in stop_words]
    # Join words back into one string separated by space
    return " ".join(filtered_words)

# Applying the function to the description column
netflix['clean_description'] = netflix['description'].astype(str).apply(remove_stopwords)
print(netflix[['description', 'clean_description']].head())


In [None]:
# Remove White spaces

def remove_whitespace(text):
    # Remove leading and trailing whitespaces
    text = text.strip()
    # Replace internal multiple spaces with a single space
    text = " ".join(text.split())
    return text

# List of text columns you want to clean (adjust as necessary)
text_columns = ['description', 'title', 'director', 'cast']

# Apply the whitespace removal function to these columns
for col in text_columns:
    if col in netflix.columns:
        netflix[col] = netflix[col].astype(str).apply(remove_whitespace)
    else:
        print(f"Column {col} not found in the dataset.")

print(netflix.head())

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

def tokenize_text(text):
    # Use NLTK's word_tokenize to split the text into words
    tokens = word_tokenize(text)
    return tokens

# Apply the tokenization function
netflix['description_tokens'] = netflix['description'].apply(tokenize_text)
print(netflix[['description', 'description_tokens']].head())


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

import nltk
nltk.download('wordnet')

def stem_text(text):
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in text]
    return stemmed_tokens

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in text]
    return lemmatized_tokens

# Tokenize text (assuming 'description' is the column containing text)
netflix['tokens'] = netflix['description'].apply(word_tokenize)
netflix['stemmed_tokens'] = netflix['tokens'].apply(stem_text)
netflix['lemmatized_tokens'] = netflix['tokens'].apply(lemmatize_text)

print(netflix[['tokens', 'stemmed_tokens', 'lemmatized_tokens']].head())

##### Which text normalization technique have you used and why?

- Here we have applied both normalization technique i.e., **stemming** and **lemmatizing**.
- **Stemming** is the process of reducing words to their root or base form by removing suffixes. For example, "running" and "runs" would both be reduced to "run".
- **Lemmatization** is similar to stemming, but it aims to reduce words to their dictionary or lemma form, which is a canonical, meaningful form of the word. For example, "running" would be lemmatized to "run".

#### 9. Part of speech tagging

In [None]:
# POS Taging

nltk.download('averaged_perceptron_tagger')

def pos_tag_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Apply POS tagging
    pos_tags = nltk.pos_tag(tokens)
    return pos_tags

# Ensure there are no missing values
netflix['description'] = netflix['description'].fillna('')

# Apply the function to the description column
netflix['pos_tags'] = netflix['description'].apply(pos_tag_text)
print(netflix['pos_tags'].head())


#### 10. Text Vectorization

In [None]:
# Vectorizing Text

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english', ngram_range=(1, 2))

# Fit and transform the text data
tfidf_matrix = tfidf_vectorizer.fit_transform(netflix['description'])

# View shape of the TF-IDF matrix
print(tfidf_matrix.shape)

# Convert the TF-IDF matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print(tfidf_df.head())


##### Which text vectorization technique have you used and why?

- We have taken **TF-IDF vectorization** technique into consideration.
- **TF-IDF vectorization** is the process of converting text into numerical data, enabling machine learning algorithms to process text data effectively.
- Hence we have a numeric set of data to get a more clear picture of dataset.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# Compute the correlation matrix
corr_matrix = netflix[["release_year","year_added","month_added", "day_added"]].corr()

# Display the correlation matrix
print(corr_matrix)

threshold = 0.8
high_corr = [(i, j) for i in corr_matrix.columns for j in corr_matrix.columns if (i!=j and abs(corr_matrix[i][j]) > threshold)]
print("Highly correlated pairs:", high_corr)


In [None]:
# Length of the description as a new feature
netflix['description_length'] = netflix['description'].apply(lambda x: len(x.split()))
netflix['description_length']

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

import pandas as pd
from sklearn.preprocessing import StandardScaler

# PCA
features = ['year_added', 'month_added', 'day_added']
X = netflix[features]
X_scaled = StandardScaler().fit_transform(X)

from sklearn.decomposition import PCA

# Initialize PCA, choose the number of components such as to cover >85% of the variance
pca = PCA(n_components=0.85)
X_pca = pca.fit_transform(X_scaled)

# Check how many components were selected
print("Number of components selected:", pca.n_components_)
print(" ")

# Look at the explained variance to understand how much info each component holds
print("Explained variance by component: %s" % pca.explained_variance_ratio_)
print(" ")

# Create a DataFrame with the loadings
loadings = pd.DataFrame(pca.components_.T, columns=['PC%d' % (i + 1) for i in range(pca.n_components_)], index=features)

# Display the loadings
print(loadings)
print(" ")

# Select features based on loadings on the first principal component
important_features = loadings.abs().nlargest(5, 'PC1').index.tolist()
print("Important features based on PCA loadings:", important_features)


##### What all feature selection methods have you used  and why?

- We have applied **PCA** to reduce dimensionality in an **unsupervised** manner.
- **PCA** is particularly **useful** in **unsupervised scenarios** where a **target variable** is **not defined**. It **reduces** the **dimensionality** of the data by **transforming** the **original variables** into a **new** set of **variables** (principal components), which are linear combinations of the original variables.


##### Which all features you found important and why?

- From the above analysis we have found **3 features** that have a good **correlation** b/w them.
- Since all the feature add important information to the dataset, hence we will consider **all** the **3 features** for analysis.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

- Yes, to some extent the data needs to be transformed and we have applied various transformation techniques to the dataset.
- We have tried **One hot encoding** feature engineering technique and applied on **type column** of the dataset. This step basically creates 2 new columns with **movie_show** and **tv_show** and sets **True/False** accordingly.
- Through **one hot encodin**g technique we have derived a new feature **description_length**.
- We have also initiated **PCA** dimensionality reduction technique over **3 columns** and created **3 new features**. Though after analysis we found that all the **3 features** are **crucial** and adds some information.


### 6. Data Scaling

In [None]:
# Since we have 2 categories of shows tv_show (duration as seasons) and movie (duration as min) so we will pick tv_show for analysis
tv_shows_data1 = netflix[netflix["type"]=='TV Show']
movie_shows_data1 = netflix[netflix["type"]=='Movie']

In [None]:
# We will then split duration column to extract numeric values
tv_shows_data1[["seasons","text"]] = tv_shows_data1["duration"].str.split(" ", expand=True)
tv_shows_data1["seasons"] = tv_shows_data1["seasons"].astype('int')
tv_shows_data1.head(1)

In [None]:
# Scaling your data

tv_shows_data1[['seasons','description_length']].describe()

In [None]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(0, 1))

tv_shows_data1[['seasons','description_length']]= min_max_scaler.fit_transform(tv_shows_data1[['seasons','description_length']])

print(tv_shows_data1[['seasons','description_length']].head())
print(" ")
print(tv_shows_data1[['seasons','description_length']].describe())


##### Which method have you used to scale you data and why?

- Here first we have done a slight modification in the data to extract out numeric columns and apply scaling technique on it.
- We have choosen **Min-Max scaling** technique over here.
- The agenda of choosing **min-max scaling** technique is to interpret the data into a scale of **0 to 1** for better visual understanding.
- It is **useful** for algorithms that require data within **bounded intervals**.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

- **Dimensionality reduction** is a critical technique in data preprocessing, especially **when dealing** with **high-dimensional data**.
= It **involves reducing** the **number of random variables** under consideration, by obtaining a set of principal variables.
- Since **we don't have** a **highly dimensional data**, so we can proceed **without** initiating any dimensionality reduction technique.
- **Dimensionality reduction** technique like **PCA** can significantly reduce the outcome and is **difficult to interpret**. So, for a smaller dataset its better not to apply.


In [None]:
# Dimensionality Reduction (If needed)

Although there is no need of "Dimensionality Reduction" but still we have applied PCA above.


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

- Although there is **no need** of **Dimensionality Reduction** but still we have applied **PCA** above.


### 8. Data Splitting

- Since **train-test split** is generally **performed** on **supervised dataset** where we have a **target column** but still just for a visual analysis **we are going** to **perform train-test split** on **above dataset**.

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

# Define the features and the target
X = tv_shows_data1.drop('seasons', axis=1)  # Assuming 'seasons' is the target
y = tv_shows_data1['seasons']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Train set size:", X_train.shape)
print("Test set size:", X_test.shape)

# Check the distribution of the target in the training set
sns.histplot(y_train, kde=True, stat="density", linewidth=0)

# Check the distribution of the target in the test set
sns.histplot(y_test, kde=True, stat="density", linewidth=0, color='red')

##### What data splitting ratio have you used and why?

- Here we have done **70-30** ratio **split**. **70%** for **train data** and **30%** for **test data**.
- The objective of performing **70-30** ration split is that it **gives better result**.
- It is quite visible from the above graph that the **unseen test data** fits considerably **well** with **respect** to **train data**.
- Hence we can state that **train data has performed well**.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***