<a href="https://colab.research.google.com/github/Amitish/Unsupervised-ML---Netflix-Movies-and-TV-Shows-Clustering/blob/main/Unsupervised_ML_Netflix_Movies_and_TV_Shows_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Unsupervised ML - Netflix Movies and TV Shows Clustering



##### **Project Type**    - EDA/Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**

Netflix is an American subscription video on-demand over-the-top streaming service. The service primarily distributes original and acquired films and television shows from various genres, and it is available internationally in multiple languages.

Get ready to unlock hidden potential! This project meticulously addresses data quality issues, transforming raw information into a powerful resource.

This project harnesses machine learning to group Netflix's vast library of 7,000+ movies and shows by content, helping users discover hidden gems and navigate the platform effortlessly.

To gain insights into the diverse content offered by Netflix, we are going to analyze the dataset containing details about movies and TV shows. We will employee descriptive statistics to understand the distribution of key variables and create visualizations like scatterplots, histograms, line charts, heatmaps etc to explore relationships between them.

This multi-faceted approach will help us uncover valuable patterns and trends within the dataset. Moreover we will identify the key anamolies and try to work upon it.

A concluding statement will not only summarize our findings but also empower audiences to derive value and fuel their own projects with these actionable insights.


# **GitHub Link -**

https://github.com/Amitish/Unsupervised-ML---Netflix-Movies-and-TV-Shows-Clustering

# **Problem Statement**


Netflix welcomes new members with a personalized onboarding journey, guiding them through account setup, preference selection, and curated recommendations based on their viewing history. Walking through such a huge textual data can be impractical and resource-intensive taking in lot of time and efforts.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np
import pandas as pd
from scipy.stats import norm

# visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# nltk library
import nltk
nltk.download('stopwords')  # downloads stopwords like "a","an","the"
nltk.download('punkt')      # for tokenization - breaking text into individual sentence

# downloads stopwords from nltk library if corpus is available
from nltk.corpus import stopwords

# used for matching of string ascii, punctuations, digits
import string

# Importing regex library for comparison
import re

# For web scrapping
from bs4 import BeautifulSoup

# Method for reducing words to their base forms
from nltk.stem.porter import PorterStemmer

# More accurate than simple stemming
from nltk.stem import WordNetLemmatizer

# Import TfidfVectorizer )for counting word occurance)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Creating and displaying formatted tables (beautify)
from tabulate import tabulate

# KElbowVisualizer
from yellowbrick.cluster import KElbowVisualizer,SilhouetteVisualizer
from yellowbrick.cluster.elbow import kelbow_visualizer

# Importing clustering Evaluation metrics
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

# Install yellowbrick library
!pip install yellowbrick

# For Cross-Validation and Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

# Importing algorithams for building model
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.decomposition import LatentDirichletAllocation

# To ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import drive                                    # Links the Google drive with Colab notebook, so that we could extract the desired dataset
drive.mount("/content/drive")


In [None]:
net = pd.read_csv("/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")   # Extracting datafile

### Dataset First View

In [None]:
# Dataset First Look
net

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
no_of_rows = net.shape[0]
no_of_columns = net.shape[1]

print("no_of_rows: ",no_of_rows)
print("no_of_columns: ",no_of_columns)

### Dataset Information

In [None]:
# Dataset Info
net.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicates = net.duplicated().sum()
duplicates

# 0 indicates that no duplicate rows found in entire dataset

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

null = net.isnull().sum().reset_index()
null


In [None]:
# Visualizing the missing values

nulls= pd.DataFrame({'Column Name':net.columns,'Null Values':net.isnull().sum(),'Percentage %':round(net.isnull().sum()*100/len(net),2)})
nulls.set_index('Column Name').sort_values(by='Percentage %', ascending = False)

### What did you know about your dataset?

Findings by far:
1. No duplicate rows identified in the dataset.
2. **director, cast and country** column holds the most no.of nulls whereas **date_added** and **rating** column having the least.
3. **release_year** holds numeric data where all other columns are categorical.
4. As data cleaning requires replacing null values but replacing null values can sometime mislead the dataset.
5. So, it must be worked upon cautiously only were its required.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
net.columns

In [None]:
# Dataset Describe
net.describe()

# since release_year is the only numeric datatype column

In [None]:
categorical = [i for i in net.describe(include="object")]
print(">>>", categorical)
print(" ")
print("No.of Categorical columns: ", len(categorical))

In [None]:
numeric = [j for j in net.columns if j not in categorical]
print(">>>", numeric)
print(" ")
print("No.of Numeric columns: ", len(numeric))

### Variables Description

We can conclude that:
1. No.of rows x columns =  **7787 x 12**
2. No.of categorical columns = **11**
3. No.of numeric columns = **1**

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for i in net.columns:
  d = net[i].unique()                           # unique() to get values and nunique() to get number
  print("Column ---------", i)
  print(d)
  print("No.of unique values ---------", len(d))
  print("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

net.fillna({"director": "unknown", "cast": "unknown", "country": "unknown"}, inplace=True)

net.dropna(subset=["date_added", "rating"], inplace=True)

# Converting the data types of features date_added and release_year to the appropriate data types
net.date_added = pd.to_datetime(net['date_added'])
net.release_year = net.release_year.astype('int64')

# Renaming name of column listed_in to genre
net.rename = net.rename(columns={'listed_in':'genre'}, inplace=True)

# Breaking down the date_added column based on dd/mm/yy
net['year_added']= net['date_added'].dt.year
net['month_added']= net['date_added'].dt.month
net['day_added']= net['date_added'].dt.day

# Deleting column date_added
net.drop("date_added", axis=1, inplace=True)

# Partitioning and creating new dataset based on type of show
tv_shows_data = net[net["type"]=='TV Show']
movie_shows_data = net[net["type"]=='Movie']

In [None]:
net.isnull().sum().reset_index()

### What all manipulations have you done and insights you found?

1. Then we performed **isnull()** to find null values and concluded that we have 5 columns with some null values.
2. As we initiated **drop_duplicates()** method but didn't find any duplicate values in the dataset.
3. The 2 coulmns i.e., **date_added** and **rating** had very less number of null values, so we removed those null values.
4. The 3 columns i.e., **director, cast** and **country** had huge number of null values so dropping these columns couldn't be afforded hence we replaced null values with **"unknown"**.
5. Converted the data types of features date_added and release_year to the appropriate data types.
6. Renamed name of column **listed_in** to **genre**.
7. Broke down the **date_added** column based on **dd/mm/yy** and deleted the previous column.
8. We now have **14 rows** instead of 12.
9. Partitioned and creating new dataset based on type of show.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Create the bar chart
show_type = net["type"].value_counts()
plt.bar(show_type.index, show_type.values)

# Add count labels above each bar
for i, v in enumerate(show_type.values):
    plt.text(i, v + 50, str(v), ha="center")  # Positions text above the bar

# Customize the chart
plt.title("Show Type Distribution")
plt.xlabel("Show Type")
plt.ylabel("Counts")
plt.show()


##### 1. Why did you pick the specific chart?

The chart reflects the count of show type that is available. This gives a kind of understanding as to which show is most displayed and hence helps the user in making the right decision before watching.


##### 2. What is/are the insight(s) found from the chart?

The bar chart gives a count of Show Type:
- Movie : 5372
- TV Show : 2398

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- We can drive insite from the chart above that **Movie** is the most displayed/watched show type. This will eventually give users an idea about the popularity of show type  and hence assist them in making right choice.

- The insite will definitely not just benefit users in their decision making process but in turn will also benefit directors to know about the users interest.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

shows_produced = net["director"].value_counts()

# Filter out "unknown" values, keeping remaining top 10
shows_produced = shows_produced.loc[shows_produced.index != "unknown"].head(10).sort_values(ascending=True)

colors = ["red", "blue", "pink", "black", "green"]

plt.barh(shows_produced.index, shows_produced.values, color=colors)

for i, v in enumerate(shows_produced.values):
    plt.text(v + 0.05, i, str(v), va="center")

plt.title("Top 10 Directed Shows")
plt.xlabel("Count")
plt.ylabel("Directors")
plt.show()


##### 1. Why did you pick the specific chart?

- The reason of selecting this chart is to get an overview of top 10 Directors  with most number of shows count.



##### 2. What is/are the insight(s) found from the chart?

- We can state that:
1. **Raul Campos, Jan Suter** has directed most no.of shows i.e, **18** amongst top 10.
2. **Robert Rodriguez** has directed **8** shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Yes indeed the gained insights will help in creating positive business impact as it portrays an understanding of maximum shows directed by a Director. Based on the experience right director can be approached.

- As such no negative growth can be fetched from above dataset.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

avg_year = net["release_year"].mean()

plt.hist(net["release_year"], bins=20, edgecolor="black")

plt.axvline(x=avg_year, color="red", linestyle="dashed", linewidth=1, label="Average Release Year")

plt.title("Number of Shows Released")
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.legend()
plt.show()


##### 1. Why did you pick the specific chart?

- Selecting the above chart conveys a visual of how shows have been released in an interval of time on an average.

##### 2. What is/are the insight(s) found from the chart?

- Following information can be driven out:
1. Release of shows has seen an **incremental** growth with time.
2. **Average** no.of shows releases are in the year **2013**.
3. **Maximum** no.of shows releases are in the year **2021**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- The above insight will to some extend help in gaining positive business impact as we can proclaim with every increasing year the no.of shows release is increasing.
- As of we can't see any negative growth from the data above as all the figure that we have seems to be upto the mark, adding some positivity to the data.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

ratings = net["rating"].value_counts()

colors = ["red", "blue", "pink", "black", "green"]
plt.bar(ratings.index, ratings.values, color=colors)

# Add count labels above each bar
for i, v in enumerate(ratings.values):
    plt.text(i, v + 50, str(v), ha="center")

# Customize the chart
plt.title("Shows Rating Distribution")
plt.xlabel("Ratings")
plt.ylabel("Counts")
plt.show()


##### 1. Why did you pick the specific chart?

- The logic of driving bar chart is to conceive the insight of ratings count being majorly provided.

##### 2. What is/are the insight(s) found from the chart?

- The above graph states that:
1. **TV-MA** is the **highest** rating in the list with count of **2861**.
2. **NC-17** is almost the **least** rating in the list with count of **3**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- In short we can for sure say that the above analysis will lay a positive impact on business, as they will get a clear picture of ratings count.
- One negative point is that lesser the rating lesser is the popularity of the show.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

shows_share = net["type"].value_counts()

plt.pie(shows_share.values, autopct='%1.1f%%', labels=shows_share.index)

plt.title("Shows Share Percentage")

plt.show()

##### 1. Why did you pick the specific chart?

- The motive of selecting this specific chart is to get a clear picture of show type distribyution.

##### 2. What is/are the insight(s) found from the chart?

- From the above chart we can say that:
1. **Movie** show type constitutes for **69.1%** overall.
2. **TV** Show constitutes for **30.9%**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Yes the gained insight will surely create a positive business impact as user will get to know the weightage shared by each show type.
- Moreover directors will also have a better understanding of the public demand.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

shows_prod = tv_shows_data["director"].value_counts()

# Filter out "unknown" values, keeping remaining top 10
shows_prod = shows_prod.loc[shows_prod.index != "unknown"].head(10).sort_values(ascending=True)

colors = ["red", "blue", "pink", "black", "green"]

plt.barh(shows_prod.index, shows_prod.values, color=colors)

for i, v in enumerate(shows_prod.values):
    plt.text(v + 0.05, i, str(v), va="center")

plt.title("Top 10 Directed TV Shows")
plt.xlabel("Count")
plt.ylabel("Directors")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

mov_prod = movie_shows_data["director"].value_counts()

# Filter out "unknown" values, keeping remaining top 10
mov_prod = mov_prod.loc[mov_prod.index != "unknown"].head(10).sort_values(ascending=True)

colors = ["black", "green", "orange", "blue", "grey"]

plt.barh(mov_prod.index, mov_prod.values, color=colors)

for i, v in enumerate(mov_prod.values):
    plt.text(v + 0.2, i, str(v), va="center")

plt.title("Top 10 Directed Movies")
plt.xlabel("Count")
plt.ylabel("Directors")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

mov_con = movie_shows_data["country"].value_counts()

# Filter out "unknown" values, keeping remaining top 10
mov_con = mov_con.loc[mov_con.index != "unknown"].head(10).sort_values(ascending=True)

colors = ["black", "red", "grey"]

plt.barh(mov_con.index, mov_con.values, color=colors)

for i, v in enumerate(mov_con.values):
    plt.text(v + 0.2, i, str(v), va="center")

plt.title("Top 10 Countries with most Movies Production")
plt.xlabel("Count")
plt.ylabel("Countries")
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

tv_show_con = tv_shows_data["country"].value_counts()

# Filter out "unknown" values, keeping remaining top 10
tv_show_con = tv_show_con.loc[tv_show_con.index != "unknown"].head(10).sort_values(ascending=True)

colors = ["blue", "green", "orange"]

plt.barh(tv_show_con.index, tv_show_con.values, color=colors)

for i, v in enumerate(tv_show_con.values):
    plt.text(v + 0.2, i, str(v), va="center")

plt.title("Top 10 Countries with most TV Shows Production")
plt.xlabel("Count")
plt.ylabel("Countries")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***