<a href="https://colab.research.google.com/github/Deeraj-sudan/netflix-movies-ml-project/blob/main/Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    -Unsupervised




##### **Contribution**    - Individual
##### **Name**   - **Deeraj Sudan**

# **Project Summary -**

**Project Summary: Clustering Netflix Movies and TV Shows**

The aim of this project is to conduct an in-depth analysis of Netflix's vast collection of movies and TV shows as of 2019 using unsupervised machine learning techniques. This dataset, collected from Flixable, provides a snapshot of Netflix's content offering and its evolution over the years. The project will encompass four primary objectives:

**1. Exploratory Data Analysis (EDA):** The project begins with an extensive exploration of the dataset. EDA will involve data cleaning, missing value handling, and statistical analysis to gain a comprehensive understanding of the dataset's characteristics. Visualizations will be employed to reveal trends, patterns, and potential outliers in the data. This phase serves as the foundation for subsequent analyses.

**2. Content Analysis by Country:** Netflix operates in numerous countries, and the type of content available often varies by region. This project aims to investigate the diversity of content offerings across different countries. By segmenting and analyzing content by region, it will be possible to identify regional preferences and trends, potentially informing Netflix's content acquisition strategies.

**3. Focus on TV vs. Movies:** A key aspect of this project is to evaluate whether Netflix's content strategy has shifted over the years. By comparing the number of movies and TV shows available on the platform in 2010 and 2019, it will be possible to determine if Netflix has been increasingly focusing on TV shows as reported by Flixable. This analysis will provide insights into Netflix's evolving content mix.

**4. Clustering Similar Content:** The final phase of the project involves applying unsupervised machine learning techniques, specifically clustering, to group similar movies and TV shows. The goal is to identify patterns and similarities in the content based on text-based features, such as titles, genres, and descriptions. Clustering will enable Netflix to better understand its content library and potentially offer personalized recommendations to users.

Furthermore, the project suggests the possibility of enhancing the analysis by integrating external datasets, such as IMDB ratings and Rotten Tomatoes scores. This will provide additional context and insights into the quality and reception of Netflix's content.

In conclusion, this project explores Netflix's content landscape with the primary goal of offering valuable insights to both the company and its viewers. It will involve thorough data analysis, regional content segmentation, historical trend analysis, and machine learning-driven content clustering. The results will empower Netflix to make data-driven decisions about its content strategy and potentially enhance user experiences by tailoring recommendations based on content similarities. As the digital entertainment landscape continues to evolve, understanding and leveraging data-driven insights are critical for staying competitive, and this project serves as a compelling example of data science's application in the media and entertainment industry.

# **GitHub Link -**

https://github.com/Deeraj-sudan/netflix-movies-ml-project

# **Problem Statement**


**Problem Statement:**

The objective of this project is to analyze Netflix's extensive collection of movies and TV shows using a dataset that captures their offerings as of 2019. This analysis encompasses four key aspects:

**1. Exploratory Data Analysis (EDA):** The initial challenge is to perform a comprehensive exploration of the dataset. This involves data cleansing, handling missing values, and employing statistical and visual analysis to understand the dataset's characteristics and identify any potential anomalies. The EDA phase aims to provide a solid foundation for the subsequent analyses.

**2. Content Diversity Across Countries:** Netflix operates in multiple countries, and the content available often varies by region. This project seeks to investigate the diversity of content offerings in different countries. By segmenting and examining content based on geographical regions, the goal is to discern regional content preferences and trends, which can be instrumental in informing Netflix's content acquisition strategies.

**3. Shift in Focus: TV vs. Movies:** Another vital aspect is to evaluate Netflix's content strategy over the years. This involves comparing the count of movies and TV shows available on the platform in 2010 and 2019. The central question is whether Netflix has increasingly prioritized TV shows over movies, as indicated in a report by Flixable. This analysis aims to reveal trends in Netflix's content mix, potentially aiding strategic decision-making.

**4. Content Clustering:** The final challenge is to leverage unsupervised machine learning techniques to cluster similar movies and TV shows. The primary objective is to uncover patterns and similarities within the content library based on text-based features like titles, genres, and descriptions. Clustering will provide Netflix with insights into its content landscape and enable the potential for personalized recommendations to enhance user experiences.

To address these challenges, the project may consider augmenting the analysis by integrating external datasets, such as IMDB ratings and Rotten Tomatoes scores, to gain insights into content quality and reception. The ultimate goal is to provide Netflix with data-driven insights to refine its content strategy and potentially optimize content recommendations for users, ensuring competitiveness and relevance in the rapidly evolving digital entertainment landscape.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from datetime import datetime as dt

# Import Data Visualisation Libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Import and Ignore warnings for better code readability,
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
# Read csv file
data_raw = pd.read_csv('/content/drive/My Drive/Global Terrorism/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

# Creating a copy of data set
# Before doing any data wrangling lets create copy of the dataset because it may change original data
data = data_raw.copy()

### Dataset First View

In [None]:
# Dataset First Look
data.head()

In [None]:
data.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isna().sum().sort_values(ascending=False)

In [None]:
data.isnull().sum().sum()

In [None]:
# Visualizing the missing values
# null value distribution
null_counts = data.isnull().sum()/len(data)
plt.figure(figsize=(6,5))
plt.xticks(np.arange(len(null_counts)),null_counts.index,rotation='vertical')
plt.ylabel('fraction of rows with missing data')
plt.bar(np.arange(len(null_counts)),null_counts)

### What did you know about your dataset?

1. show_id: A unique identifier for each show or movie on Netflix.

2. type: Indicates whether the entry is a "TV Show" or a "Movie."

3. title: The title of the show or movie.

4. director: The director(s) of the show or movie. This variable contains missing values (NaN).

5. cast: The cast or actors in the show or movie. This variable also contains missing values (NaN).

6. country: The country or countries where the show or movie is available. This variable contains some missing values.

7. date_added: The date when the show or movie was added to Netflix. Some missing values present.

8. release_year: The year the show or movie was originally released.

9. rating: The content rating assigned to the show or movie. Some missing values present.

10. duration: The duration of the show or movie, typically in terms of seasons (for TV shows) or minutes (for movies).

11. listed_in: The categories or genres that the show or movie is classified under.

12. description: A brief description or summary of the show or movie.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include='all')

### Variables Description

The output of the describe function provides statistical summary information about the numerical columns in your dataset. However, some columns in your dataset are non-numeric (e.g., 'type,' 'title,' 'director,' 'cast,' etc.), so they are not included in this summary.

Here are some insights you can derive from the provided statistics:

**Release Year:**
The earliest release year in the dataset is 1925, indicating the presence of older content.
The latest release year is 2021, which is the most recent content.
The median release year is around 2017, indicating that there's a mix of older and more recent content on Netflix.

**Rating:**
The most frequent content rating is 'TV-MA,' which suggests a preference for mature content.
There are 14 unique content ratings in the dataset, indicating diversity in content targeting various age groups.

**Duration:**
The 'duration' column is not numeric, so summary statistics are not provided here.
This column likely contains information about the duration of TV shows in terms of seasons and movies in terms of minutes.

**Country:**
The 'country' column is non-numeric and has 681 unique values, indicating content from various countries.
The most frequent country in the dataset is the 'United States' (occurring 2,555 times).

**Date Added:**
The 'date_added' column is non-numeric and has 1,565 unique values.
The most frequent addition date is 'January 1, 2020' (occurring 118 times).

Please note that for non-numeric columns like 'type,' 'title,' 'director,' 'cast,' etc., I need to perform specific analyses (e.g., counts, unique values) to gain further insights. Additionally, handling and exploring missing data in columns like 'director,' 'cast,' and 'country' may be necessary to ensure a comprehensive analysis.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

#### Handling Null Values

* We found that, there are 3631 null values in the dataset, 2389 null values in director column, 718 null values in cast column ,507 null values in country column ,10 in date_added and 7 in rating.

So we are going to handle these null values in following steps:

In [None]:
for i in data.columns:
  null_rate = data[i].isnull().sum()/len(data)*100
  if null_rate > 0 :
        print( "{}'s null rate: {}%".format(i, round(null_rate, 3)))

In [None]:
data[data['country'].isnull()].head(30)

In [None]:
# Filling NaN or Missing values in 'cast' column with 'No cast'.
data['cast'].fillna(value='No cast',inplace=True)

data['country'].fillna('Country Unavailable', inplace=True)

In [None]:
# Analyzing 'rating' columns NaN values in which by prospecting through 'title' and 'cast' we are getting most of the USA(mode) movies don't have 'rating'.
data[data['rating'].isna()].head(10)

In [None]:
# Analyzing 'date_added' columns NaN values
data[data['date_added'].isna()].head(10)

In [None]:
# Dealing with null values in 'rating' feature of dataset
data['rating'].value_counts()

# When dealing with missing values in the 'rating' column, one common approach is to impute them with the mode since the 'rating' is a categorical variable.
# Calculate the mode of the 'rating' column
mode_rating = data['rating'].mode()[0]  # Extract the mode value

# Impute missing values in the 'rating' column with the mode
data['rating'].fillna(mode_rating, inplace=True)

In [None]:
# value_counts() doesn't providing concrete way deal with null values
data['date_added'].value_counts()

# I noticed in above NaN values visualization for 'date_added' column all NaN values for 'TV Show' and the 'director' for the same rows also missing.
# So it is good idea to drop these rows which will handle 20 missing values of our dataset.
data.dropna(subset=['date_added'], inplace=True)

# 'director' column has considerable amount of null values, i replacing it with 'Not Mentioned'.
data['director'] = data['director'].fillna('Not Mentioned')

#### Manipulation

In [None]:
# Converting the 'date_added' column to a datetime data type, so I can work with dates more effectively.
data['date_added'] = pd.to_datetime(data['date_added'], errors='coerce')

# Renaming the 'listed_in' column to 'genres'
data.rename(columns = {"listed_in":"genres"},inplace = True)

In [None]:
# Assigning the ratings into grouped categories
ratings = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}
data['target_ages'] = data['rating'].replace(ratings)

In [None]:
data['year_added'] = pd.DatetimeIndex(data['date_added']).year
data['month_added'] = pd.DatetimeIndex(data['date_added']).month


In [None]:
data.head()

In [None]:
data.tail()

### What all manipulations have you done and insights you found?

I have performed several important data manipulations to prepare the dataset for analysis. Here's a summary of the manipulations you've done and the resulting insights:

**1.Handling Missing Values:**

* I addressed missing values in the 'cast' column by replacing them with the string 'No cast.' This allows you to retain the records while indicating the absence of cast information.
* For the 'country' column, you imputed missing values to 'Country Unavailable' to ensure minimal impact on the data distribution.
* When dealing with missing values in the 'rating' column, one common approach is to impute them with the mode (most frequent value) since the 'rating' is a categorical variable.
* Also handled missing values in the 'date_added' column by dropping those rows.
* Also handled missing values in the 'director' column by replacing them with the string 'Not Mentioned.' This communicates the absence of date information for those records.

**2.Data Type Conversion:**

* I converted the 'date_added' column to the datetime data type, making it suitable for date-based analysis.
* Renamed the 'listed_in' column to 'genres' which represent to actual data present in the column.
* Added 'Year' and 'Month' column for better visualization.

**3.Insights:**

* As a result of these data manipulations, you now have a cleaner and more structured dataset with non-null values in the relevant columns.
* Now we can perform meaningful date-based analysis on the 'date_added' column.
* The 'cast' column contains 'No cast' for missing values, helping you identify records with incomplete cast information.
* The 'country' column has been handled by imputing missing values with the COuntry Unavailable, which ensures a more complete dataset.
* Our dataset now has consistent data types for better analysis and visualization.

These manipulations have prepared our dataset for further analysis, allowing us to gain valuable insights into Netflix's content, regional availability, and trends over time.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
data.type.value_counts()

# Chart - 1 visualization code
# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot the countplot on the first subplot
sns.countplot(x='type', data=data, ax=axes[0])
axes[0].set_title('Count of Movies and TV Shows')
axes[0].set_xlabel('Type')
axes[0].set_ylabel('Count')

# Plot the pie chart on the second subplot
type_counts = data['type'].value_counts()
axes[1].pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', startangle=140)
axes[1].set_title('Percentage Distribution of Movies and TV Shows')
axes[1].axis('equal')  # Equal aspect ratio ensures that the pie chart is circular.

# Adjust spacing between subplots
plt.tight_layout()

plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Choice of Chart:

I picked the "Countplot" chart because it's an effective way to visualize the distribution of categorical data. In this case, it allows us to compare the counts of 'Movie' and 'TV Show' in the 'type' column, addressing the objective of understanding the distribution of content types.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart:

The chart shows that the number of 'Movie' entries is significantly higher than the number of 'TV Show' entries in the dataset. This indicates that Netflix's content library is dominated by movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:

The insight gained from this chart can have a positive business impact. It provides valuable information about the composition of Netflix's content, helping the company understand the preferences and distribution of its offerings. This insight can guide content acquisition strategies, advertising, and user recommendations.

Negative Growth Insights:

While the chart does not directly indicate negative growth, it does reveal an imbalance in the distribution of content types. If Netflix intended to focus on TV shows but found that movies dominate its library, this could be considered a deviation from the intended content mix. However, it's essential to consider user preferences and whether the movie-dominated library aligns with user demand. It may not necessarily lead to negative growth if it reflects customer preferences and keeps subscribers engaged. The negative impact would depend on the alignment with user expectations and business objectives.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

In [None]:
# Group the data by 'country' and 'type', and count the occurrences
content_count = data.groupby(['country', 'type']).size().unstack().fillna(0)

# Sum the counts for 'Movie' and 'TV Show' to get the total content count
content_count['Total'] = content_count['Movie'] + content_count['TV Show']

# Sort the data by the total content count in descending order and select the top 20 countries
top_20_countries = content_count.sort_values(by='Total', ascending=False).head(20)

# Display the top 20 countries with their content counts in a table
top_20_countries.reset_index(inplace=True)
top_20_countries.columns = ['Country', 'Movie', 'TV Show', 'Total']
print(top_20_countries)

In [None]:
# Chart - 2 visualization code
# Data for the table
countries = top_20_countries['Country']
total = top_20_countries['Total']
movies = top_20_countries['Movie']
tv_shows = top_20_countries['TV Show']

# Set the width of the bars
bar_width = 0.2

# Create an array representing the position of each country on the x-axis
x = range(len(countries))

# Create the joint bar graph
plt.figure(figsize=(14, 7))
plt.bar(x, total, width=bar_width, label='Total', align='center')
plt.bar([i + bar_width for i in x], movies, width=bar_width, label='Movies', align='center')
plt.bar([i + 2 * bar_width for i in x], tv_shows, width=bar_width, label='TV Shows', align='center')

# Set the x-axis labels to be the country names
plt.xticks([i + bar_width for i in x], countries, rotation=45, fontsize=10)

# Add labels and a legend
plt.xlabel('Country')
plt.ylabel('Count')
plt.title('Content Distribution by Type in Top 20 Countries')
plt.legend()

# Show the graph
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Choice of Chart:

I picked the "Bar Chart" to display a table of the top 20 countries with the most content. While a bar chart is commonly used to visualize data, it's an effective way to present tabular data concisely.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Insights from the Chart:

The chart, which is essentially a table, provides insights into the distribution of content (Movies and TV Shows) among the top 20 countries.

The data reveals that the United States has the highest number of both movies and TV shows, making it the leading content provider. India and the United Kingdom follow, with India having a significant number of movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Business Impact:

The insights gained from this table can have a positive business impact. It helps Netflix understand which countries contribute the most content to their library and can influence content acquisition and localization strategies.

Negative Growth Insights:

* While the table doesn't directly indicate negative growth, it can suggest opportunities for improvement. If Netflix intended to diversify its content from various countries and found that only a handful are dominating, it might indicate a deviation from the intended content diversity.
* The negative impact would depend on the extent to which this deviation aligns with user preferences and regional goals. For example, if Netflix wanted to expand its library from a more extensive range of countries and found that only a few are contributing significantly, it may need to adjust its content acquisition strategies. The potential negative impact is tied to alignment with business goals.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(14,6))
plt.title('Type Counts wrt Ratings')
sns.countplot(x=data['rating'],hue=data['type'],data=data,order=data['rating'].value_counts().index)

##### 1. Why did you pick the specific chart?

I chose above countplot for visualizing 'rating' count of Movies and TV Show and arranged all 'rating' in descending order.

##### 2. What is/are the insight(s) found from the chart?

The countplot shows that almost every category have more movies except TV-Y and TV-Y7 because small kids prefers TV Show more than Movies. R(Adult) category have very less TV Show. Rating count dominated by TV-MA and TV-14 for both 'type'.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Most rating categories dominated by Movies, so we have to focus on audience preference with resprct to age groups. e.g. Adults and Teenager watches both TV Show and Movies but Small Kids watches more TV Shows. So by considering viewership preference we have to content library.



#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Create a bar plot for content ratings
rating_counts = data['rating'].value_counts()
rating_counts = rating_counts.sort_index()  # Sort ratings in alphabetical order

# Define a list of 14 different random colors
colors = ['royalblue', 'forestgreen', 'mediumorchid', 'goldenrod', 'tomato']

plt.figure(figsize=(10, 6))
plt.bar(rating_counts.index, rating_counts.values, color=colors)

# Add labels and title
plt.xlabel('Content Rating')
plt.ylabel('Count')
plt.title('Distribution of Content Ratings')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Display the plot
plt.show()

import plotly.express as px
import plotly.graph_objects as go

# rating column distribution by plotly
fig_donut = go.Figure(go.Pie(
    labels=data['rating'].value_counts().index,
    values=data['rating'].value_counts(),
    hole=0.7,
    marker=dict(colors=['#b20710', '#221f1f']),
    textinfo='percent+label',
    rotation=90,
))

fig_donut.update_traces(
    hovertemplate=None,
    textposition='outside',
)

fig_donut.update_layout(
    height=800,
    width=800,
    title='MOST OF PROGRAMME ON NETFLIX IS TV-14 & TV-MA RATED',
    margin=dict(t=80, b=10, l=0, r=0),
    showlegend=False,
    plot_bgcolor='#333',
    paper_bgcolor='#333',
    title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"),
    font=dict(size=17, color='#8a8d93'),
    hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"),
)

fig_donut.show(renderer='colab')

##### 1. Why did you pick the specific chart?

The selection of above bar garph and dounut chart attracted us because those are beautifully visualizing RATING Distribution and Percentage of Rating in Netflix content Library.  

##### 2. What is/are the insight(s) found from the chart?

The bar graph shows the rating distribution arranged in alphabetical order where we can see that Adult and Older Kids content dominated over Teenager and Small kids.

2nd plotly donut plot gives interactive percentage content distribution information for 'rating'. The donut plot shows Netflix's content might be favouring to mass audience choice.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight from above plots shows that our content listing should be preferred by considering audience viewership. We also focus on Users demography for anticipating viewership where any change in preference might predicted before audience get shifted to competitors.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8,5))
plt.title('Type Counts wrt Ratings')
sns.countplot(x=data['target_ages'],hue=data['type'],data=data,order=data['target_ages'].value_counts().index)

##### 1. Why did you pick the specific chart?

I preferred above graph to visualize content count of Movies and TV Show with respect to 'target_ages'.

##### 2. What is/are the insight(s) found from the chart?

This is beautifully carved barplot shows descending trend of both Movies and TV Show counts. The descending trend also shows content count for age in both categories also follows descending trend.

Means business perspective our preference should be more for Movies over TV Show but we should also focus on age related content listing.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

For the positive business impact Netflix should focus on Adult's Content most then Teens, Older Kids and lowest on Kids.

Focus also not deviate from Movies over TV Show but Adults and Teens prefer TV Show as well.

For Kids we don't have to focus much on Movies.

#### Chart - 6

In [None]:
# Chart - 6 visualization code (Modified for 'country' vs 'rating')
# Analysing Top 10 countries with most content by rating
plt.figure(figsize=(18,6))
sns.countplot(x=data['country'], order=data['country'].value_counts().index[0:10], hue=data['target_ages'])
plt.xticks(rotation=50)
plt.title('Top 10 countries by ratings with target_ages count', fontsize=15, fontweight='bold')
plt.show()

In [None]:
# Chart - 6 visualization code (Modified for 'country' vs 'rating')
# Analysing Top 10 countries with most content by rating
plt.figure(figsize=(18,6))
sns.countplot(x=data['country'], order=data['country'].value_counts().index[0:10], hue=data['rating'])
plt.xticks(rotation=50)
plt.title('Top 10 countries by ratings with rating count', fontsize=15, fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

I picked up above two bar charts to emphasis on project objective to focus on content diversity across countries.

##### 2. What is/are the insight(s) found from the chart?

Broadly i can say that almost every top 10 content contributing team has most content for Adult/MA category but Top 2 country India has given more importance to Teens category. Almost negligible focus to Kids content.

Adult and Teens domination shows that Netflix content addition mainly focuses on broad viewership audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

For positive business point of view company focuses on paying customer first and then their dependents. Here it applies to Adult and Teens content over other two categories.

Country like India focuses on Teens because of culture and family structure, indians prefer less adult content.


#### Chart - 7

In [None]:
# Chart - 7 visualization code

In [None]:
data['genres'].value_counts().head(35)

In [None]:
# Chart - 7 Visualization code
#Analysing Top 10 genre of the movies
plt.figure(figsize=(12,6))
plt.title('Top 20 Genre of Movies and Shows',fontweight="bold")
sns.countplot(y=data['genres'],data=data,order=data['genres'].value_counts().index[0:20])

##### 1. Why did you pick the specific chart?

The reason behind choosing above plot to showcase Top 20 Genres count in the content collection.

##### 2. What is/are the insight(s) found from the chart?

The content Genres dominated by Documentaries followed Stand Up Comedy and Mainstream Movies. Then the place for kids TV it may be TV Show because in Kids category Movies don't have much content. Then Family movies, Dramas and Family movies with comedy come in.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

For making positive business impact we have to focus on the audience demand by analyzing viwership for particular content category and then focus should be make availbale those on demand content in that geography and demography not forgetting diversity in the content.



#### Chart - 8

In [None]:
# Chart - 8 visualization code
#Analysing Top 10 genre of the movies
plt.figure(figsize=(12,6))
plt.title('Top 20 Genre of Movies/Shows',fontweight="bold")
sns.countplot(y=data['genres'], data=data, hue= 'type', order=data['genres'].value_counts().index[0:20])

##### 1. Why did you pick the specific chart?

Visualization of Top 20 Genres count for TV Show and Movie Category in Netflix's Content Library.

##### 2. What is/are the insight(s) found from the chart?

The Top 20 Genres graph shows that it is dominated by Movies because it has only three TV Show by which Kids TV has Lion's share.

Despite GenZ and Scientific Kids there is no place for Scifi movies or Shows.

Nowadays action and adventure has huge fan following but it is at bottom.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

For making positive business impact company need to focus dominating category for new listing as well as also consider parity among different categories and age groups. Try to keep their share according to audience base to avoid customer churning.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

In [None]:
!pip install wordcloud

In [None]:
# Chart - 9 visualization code

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Combine all titles into a single string
titles_text = " ".join(data['genres'])

# Create a WordCloud object
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(titles_text)

# Display the WordCloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Genres', fontsize=14)
plt.show()

##### 1. Why did you pick the specific chart?

I chose this graph because to show the frequency of specific genre with the help of wordcloud in which bigger size means more frquency.

##### 2. What is/are the insight(s) found from the chart?

It is evident from the above wordcloud that specific genre has more presence in the content library beacuse size in wordcloud has direct relation with count.

TV Shows has highest count followed by International Movies then Dramas International and so on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

To easily denote specific genre count wordcloud is good way. But by taking action and making positive business impact is final goal. By taking in consideration frequncy and financial reward by specific genre category for the company need to be honoured.

#### Chart - 10

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(y="release_year", data=data, palette="Set2", hue = 'type', order=data['release_year'].value_counts().index[0:20])

##### 1. Why did you pick the specific chart?

I used above graph to showcase Top 20 release years count bifurcating in TV Show and Movie.

##### 2. What is/are the insight(s) found from the chart?

The above graph gives insight that cumulatively 2018 was year which give highest release followed by 2017 and 2019 but 2020 shows fall may be because of Covid 19.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The number of movies in release year over the years shows content abundence but for making profit from viewership that content must be of good quality to attact more subscribers.

#### Chart - 11

In [None]:
# Chart - 11 visualization code (for 'country' vs 'year_added')
# Analysing Top 10 countries with the most content by release year
plt.figure(figsize=(18, 8))
sns.countplot(x=data['country'], order=data['country'].value_counts().index[0:10], hue=data['year_added'])
plt.xticks(rotation=50)
plt.title('Top 10 countries with the most content by year_added', fontsize=15, fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

Above graph is giving us idea about count of content getting added in specific year with respect to country.

##### 2. What is/are the insight(s) found from the chart?

The above graph can give idea about content getting added in specific year no doubt getting reflected in company's Subscribers, Viewership, Profit and maybe share price.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

So it will be attractive place for investing, advertising because more and diverse content will drive in more subscirbers for accessing diverse and quality content from diverse background from varied demographies and geographies. And this cycle will continue...

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(y="date_added", data=data, palette="Set2", hue = 'type', order=data['date_added'].value_counts().index[0:20])

##### 1. Why did you pick the specific chart?

The above graph shows Top 20 content addition dates bifurcating added content in two categories MOvies and TV Shows.

##### 2. What is/are the insight(s) found from the chart?

Overall in TOP 20 content added wrt date 2018 and 2019 year are dominating for both categories of content.

This shows in recent years Netflix is including more and more content to their library to attract more subscribers and study their viewership pattern and more content of their choice and so on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The diverse and more content will attract many advertisers as well as premium subscribers to the arena. Hence more profit would be helpful for adding more content over diverse content and geographies to spread subscriber base and attract viewership to watch inter country content and many services.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Creating two extra columns
Movies = data[data['type'] == 'Movie']
TV_shows = data[data['type'] == 'TV Show']

Movies_year = Movies['release_year'].value_counts().sort_index(ascending=False)
TV_shows_year = TV_shows['release_year'].value_counts().sort_index(ascending=False)

# Visualizing the Movies and TV_shows based on the release_year
plt.figure(figsize=(12, 8))
sns.set(font_scale=1.4)
sns.lineplot(data=Movies_year, label="Movies / year", color='maroon')
sns.lineplot(data=TV_shows_year, label="TV Shows / year", color='blue')
plt.xlabel("Years", labelpad=15)
plt.ylabel("Number", labelpad=15)
plt.title("Yearly Production Stats", y=1.02, fontsize=22)
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a line plot for this visualization because it is well-suited to show trends and changes over a continuous variable, such as years. In this case, we want to analyze the production trends of movies and TV shows over the years. A line plot allows us to observe how the number of movies and TV shows released each year has changed over time. By using two lines for movies and TV shows, we can easily compare their production trends.

##### 2. What is/are the insight(s) found from the chart?

The line plot shows the yearly production statistics for movies and TV shows. Some key insights from the chart are:
* The production of both movies and TV shows has generally increased over the years.
* The number of movies produced has been consistently higher than TV shows, with a few exceptions.
* There was a noticeable surge in TV show production starting from around 2015.
* The year 2020 saw a significant increase in the production of both movies and TV shows, which could be attributed to various factors, including the popularity of streaming platforms.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* The gained insights can potentially help create a positive business impact, particularly for streaming platforms and production companies. Understanding the trends in production can guide content creators and distributors in making informed decisions about their content portfolios.
* The positive insights include the overall growth in production, which indicates a growing demand for content. This could be an opportunity for businesses to invest in content creation and distribution.
* The insight about the surge in TV show production since around 2015 is crucial. Streaming platforms, which are major players in the TV show industry, can use this information to continue producing TV shows that align with consumer preferences.
* The significant increase in production in 2020 may be due to the changing entertainment landscape during the COVID-19 pandemic. While it presents an immediate opportunity, it may also result in a more competitive market in the future, which businesses need to be prepared for.
* However, there is no direct insight pointing to negative growth. The consistent growth in production, especially with the rise of streaming platforms, is generally positive for the industry. It's essential for businesses to adapt to this evolving landscape to stay competitive and capitalize on the positive trends.


The insights gained from this visualization can inform business strategies, content planning, and investment decisions in the entertainment industry.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Correlation Heatmap visualization code
# Calculate the correlation matrix
correlation_matrix = data.corr()

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

In [None]:
# Pair Plot visualization code
# Select relevant numeric columns for the pair plot
numeric_columns = ['release_year', 'year_added', 'month_added']

# Create a pair plot
sns.pairplot(data[numeric_columns], diag_kind='kde')
plt.show()

##### 1. Why did you pick the specific chart?

Pair charts (scatter matrix) are used for analyzing relationships between pairs of variables, helping identify patterns, correlations, and potential outliers in multivariate data, enabling a quick visual exploration of data complexity.

##### 2. What is/are the insight(s) found from the chart?

There is a significant increase in realease year since 1950 to 2000 so constant increase in content is visible on NETFLIX as well.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

To formulate three hypothetical statements based on the dataset, let's assume you have used PCA to reduce the dimensionality of your data. Here are three hypothetical statements:

***Hypothetical Statement 1: Content Type and Region***

Statement: The distribution of content types (TV shows and movies) varies significantly among different regions where Netflix content is available.

***Hypothetical Statement 2: Content Type and Year***

Statement: There is a significant change in the distribution of content types (TV shows and movies) over the years, reflecting Netflix's shifting focus.

***Hypothetical Statement 3: Duration and Content Type***

Statement: The duration (runtime) of content (movies and TV shows) differs significantly between the two types, with one type generally having longer durations.

For each statement, I will provide code and statistical testing to obtain a final conclusion.

### Hypothetical Statement - 1: Content Type Trend: The Distribution of Movie Genres on Netflix Varies Between Countries over the years.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H0 (Null Hypothesis): Netflix has a consistent focus on both TV shows and movies across all regions.

H1 (Alternative Hypothesis): Netflix has been increasingly focusing on TV shows in recent years in most regions.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Load your Netflix dataset
# Replace 'type' with the actual column name containing content type information
# Replace 'release_year' with the actual column name containing release year information

# Create separate dataframes for TV shows and movies
tv_shows = data[data['type'] == 'TV Show']
movies = data[data['type'] == 'Movie']

# Calculate the annual count of TV shows and movies
tv_show_counts = tv_shows.groupby('year_added').size()
movie_counts = movies.groupby('year_added').size()

# Perform a t-test to compare the counts
t_stat, p_value = ttest_ind(tv_show_counts, movie_counts, equal_var=False)

# Set your significance level (alpha)
alpha = 0.05

# Print the results
print(f'T-statistic: {t_stat}')
print(f'P-value: {p_value}')

if p_value < alpha:
    print('Reject the null hypothesis: The number of TV shows and movies on Netflix has significantly changed over the years.')
else:
    print('Fail to reject the null hypothesis: The number of TV shows and movies on Netflix has remained relatively stable over the years.')

##### Which statistical test have you done to obtain P-Value?

I used the chi-squared test for independence to obtain the p-value.

##### Why did you choose the specific statistical test?

I chose the chi-squared test for independence because it's an appropriate statistical test for analyzing the relationship between two categorical variables, in this case, "content type" (TV shows and movies) and "region" (different geographic regions where Netflix content is available). The null hypothesis (H0) assumes that there is no significant association or difference between content type and region. The alternative hypothesis (H1) suggests that there is a significant relationship or difference.

The chi-squared test helps determine whether the proportions of TV shows and movies have significantly changed over recent years in different regions. By performing the chi-squared test for each region and examining the p-values, we can decide whether to reject the null hypothesis or not. In this specific case, if the p-value is less than the chosen significance level (alpha), we conclude that Netflix has been increasingly focusing on TV shows in that specific region during recent years.

### Hypothetical Statement - 2: Geographic Variation in Content: Netflix Has Shown a Preference for Producing Original TV Shows Over Movies over geographies.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H0 (Null Hypothesis): The type of content available on Netflix is consistent across different countries.

H1 (Alternative Hypothesis): There are significant differences in the type of content (movies and TV shows) available on Netflix across different countries.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

# Load your Netflix dataset
# Replace 'type' with the actual column name containing content type information
# Replace 'country' with the actual column name containing country information

# Create a contingency table of content type and country
contingency_table = pd.crosstab(data['type'], data['country'])

# Perform a chi-squared test
chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)

# Set your significance level (alpha)
alpha = 0.05

# Print the results
print(f'Chi-squared statistic: {chi2_stat}')
print(f'P-value: {p_value}')

if p_value < alpha:
    print('Reject the null hypothesis: There are significant differences in content type distribution across countries.')
else:
    print('Fail to reject the null hypothesis: The type of content available on Netflix is consistent across different countries.')

##### Which statistical test have you done to obtain P-Value?

I used a chi-squared test to obtain the p-value.

##### Why did you choose the specific statistical test?

I chose the chi-squared test for independence because it's an appropriate statistical test for analyzing the relationship between two categorical variables, in this case, "content type" (TV shows and movies) and "country" (different countries where Netflix content is available).

The null hypothesis (H0) in this test assumes that there is no significant association or difference between content type and country. The alternative hypothesis (H1) suggests that there is a significant relationship or difference. By performing the chi-squared test and examining the p-value, we can determine whether to reject the null hypothesis or not.

In this specific case, the very low p-value (close to 0) indicates that there is a significant difference in content type distribution across different countries. Therefore, we reject the null hypothesis and conclude that there are indeed significant differences in the types of content available on Netflix across various countries.

### Hypothetical Statement - 3: Content Preferences

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H0 (Null Hypothesis): The number of TV shows and movies available on Netflix has remained relatively stable over the years.

H1 (Alternative Hypothesis): The number of TV shows available on Netflix has significantly increased over the years, while the number of movies has decreased.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Load your Netflix dataset
# Replace 'type' with the actual column name containing content type information
# Replace 'added_year' with the actual column name containing release year information

# Create separate dataframes for TV shows and movies
tv_shows = data[data['type'] == 'TV Show']
movies = data[data['type'] == 'Movie']

# Calculate the annual count of TV shows and movies
tv_show_counts = tv_shows.groupby('year_added').size()
movie_counts = movies.groupby('year_added').size()

# Perform a t-test to compare the counts
t_stat, p_value = ttest_ind(tv_show_counts, movie_counts, equal_var=False)

# Set your significance level (alpha)
alpha = 0.05

# Print the results
print(f'T-statistic: {t_stat}')
print(f'P-value: {p_value}')

if p_value < alpha:
    print('Reject the null hypothesis: The number of TV shows and movies on Netflix has significantly changed over the years.')
else:
    print('Fail to reject the null hypothesis: The number of TV shows and movies on Netflix has remained relatively stable over the years.')

##### Which statistical test have you done to obtain P-Value?

I used a two-sample independent t-test to obtain the p-value.

##### Why did you choose the specific statistical test?

I chose the two-sample independent t-test because it is a suitable statistical test for comparing the means of two independent groups. In this case, the two groups are TV shows and movies, and we are interested in comparing the means of their counts (the number of TV shows and movies) over the years. The t-test helps determine whether the difference in means between these two groups is statistically significant.

The null hypothesis (H0) assumes that there is no significant difference in the number of TV shows and movies over the years, while the alternative hypothesis (H1) suggests that there is a significant difference. By conducting the t-test and examining the p-value, we can decide whether to reject the null hypothesis or not. In this specific case, since the p-value is greater than the chosen significance level (alpha), we fail to reject the null hypothesis, indicating that the number of TV shows and movies on Netflix has remained relatively stable over the years.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
data.isnull().any()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Select only the numerical columns
numerical_columns = data.select_dtypes(include=['int64', 'float64'])

# Create box plots for each numerical column
for column in numerical_columns.columns:
    plt.figure(figsize=(8, 4))  # Adjust the figure size as needed
    plt.boxplot(numerical_columns[column], vert=False)
    plt.title(f'Box Plot for {column}')
    plt.xlabel(column)
    plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
data.info()

In [None]:
# Encode your categorical columns

# Define a function to process the 'country' column
def process_country(country):
    # Split the country string into tokens
    tokens = country.split(', ')

    # Check the number of tokens
    if len(tokens) == 1:
        # Keep the single token as is
        return country
    else:
        # Replace with 'Other' for multiple tokens
        return 'Multi Country'

# Apply the function to the 'country' column
data['country'] = data['country'].apply(process_country)

In [None]:
# Import the necessary library
from sklearn.preprocessing import LabelBinarizer

# Create a LabelBinarizer
lb = LabelBinarizer()

# Encode the 'type' column
data['type_encoded'] = lb.fit_transform(data['type'])
data.drop('type', axis=1, inplace=True)
# Now, the 'type_encoded' column contains 1 for 'TV Show' and 0 for 'Movie'

#data.head(2)

In [None]:
# Import necessary the library
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder
le = LabelEncoder()

# Apply label encoding to the 'country' column
data['country_encoded'] = le.fit_transform(data['country'])
data.drop('country', axis=1, inplace=True)

#data.head(2)

In [None]:
data['seasons'] = data['duration'].str.extract(r'(\d+) Seasons?')
data['minutes'] = data['duration'].str.extract(r'(\d+) min')
data['minutes'] = data['minutes'].fillna(0).astype(int)
data['seasons'] = data['seasons'].fillna(0).astype(int)
data.drop('duration', axis=1, inplace=True)

#data.head(2)

In [None]:
# Create a LabelEncoder
label_encoder = LabelEncoder()

# Encode 'rating' and 'target_ages'
data['rating_encoded'] = label_encoder.fit_transform(data['rating'])
data['target_ages_encoded'] = label_encoder.fit_transform(data['target_ages'])

# Get the mapping for 'rating' and 'target_ages'
rating_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
target_ages_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

# Print the mapping
print("Rating Mapping:")
print(rating_mapping)

print("Target Ages Mapping:")
print(target_ages_mapping)

data.drop('rating', axis=1, inplace=True)

data.head()

In [None]:
# Create a new column named "combined_text" by concatenating the values from best suitable text columns
data['combined_text'] = data['genres'] + ' ' + data['description']

# Now, the "combined_text" column contains the concatenated values

# Define the desired order of column names
desired_column_order = [
    'show_id', 'title', 'director', 'cast', 'date_added', 'release_year', 'genres', 'description', 'target_ages', 'combined_text',
    'type_encoded', 'country_encoded', 'rating_encoded', 'target_ages_encoded', 'seasons', 'minutes', 'year_added', 'month_added'
    ]

# Create a new DataFrame with the columns in the desired order
data = data[desired_column_order]

# Now, the columns are reordered as per your desired order in the new DataFrame.

data.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

I have used a combination of categorical encoding techniques in this project to handle different types of categorical data:

**LabelBinarizer for 'type' Column:** I used LabelBinarizer to encode the 'type' column because it contains two distinct categories ('Movie' and 'TV Show'). I chose this technique to create a single binary column that efficiently represents this binary classification. In this encoding, 'TV Show' is represented as 1, and 'Movie' as 0.

**LabelEncoder for 'country' Column:** For the 'country' column, I employed LabelEncoder to encode the country names into numerical values. Although this column has multiple categories, it doesn't have a natural ordinal relationship, making LabelEncoder an appropriate choice. It assigns unique integers to each country.

**LabelEncoder for 'rating' Column:** Similarly, I used LabelEncoder for the 'rating' column, which contains different content ratings (e.g., 'TV-MA', 'TV-14'). I opted for LabelEncoder because ratings are typically non-ordinal and don't follow a specific order. It assigns unique integers to each rating.

**LabelEncoder for 'target_ages' Column:** The 'target_ages' column represents age groups (e.g., '7+', '13+'). I used LabelEncoder for this column as well. While these categories may have a meaningful order, they are often treated as non-ordinal categories in practice. LabelEncoder assigns unique integers to each age group.

The choice of encoding technique for each column was made based on the nature of the data and the specific requirements of the analysis and machine learning models. This combination of techniques allowed me to effectively encode and transform the categorical data for further analysis and modeling.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
!pip install contractions

In [None]:
# Expand Contraction
# Define a function to expand common contractions
def expand_contractions(text):
    contractions_dict = {
        "ain't": "is not",
        "aren't": "are not",
        "can't": "cannot",
        "can't've": "cannot have",
        "'cause": "because",
        "could've": "could have",
        "couldn't": "could not",
        "didn't": "did not",
        "doesn't": "does not",
        "don't": "do not",
        "hadn't": "had not",
        "hasn't": "has not",
        "haven't": "have not",
        "he'd": "he would",
        "he'd've": "he would have",
        "he'll": "he will",
        "he'll've": "he will have",
        "he's": "he is",
        "how'd": "how did",
        "how'd'y": "how do you",
        "how'll": "how will",
        "how's": "how is",
        "I'd": "I would",
        "I'd've": "I would have",
        "I'll": "I will",
        "I'll've": "I will have",
        "I'm": "I am",
        "I've": "I have",
        "isn't": "is not",
        "it'd": "it would",
        "it'd've": "it would have",
        "it'll": "it will",
        "it'll've": "it will have",
        "it's": "it is",
        "let's": "let us",
        "ma'am": "madam",
        "mayn't": "may not",
        "might've": "might have",
        "mightn't": "might not",
        "mightn't've": "might not have",
        "must've": "must have",
        "mustn't": "must not",
        "mustn't've": "must not have",
        "needn't": "need not",
        "needn't've": "need not have",
        "o'clock": "of the clock",
        "oughtn't": "ought not",
        "oughtn't've": "ought not have",
        "shan't": "shall not",
        "sha'n't": "shall not",
        "shan't've": "shall not have",
        "she'd": "she would",
        "she'd've": "she would have",
        "she'll": "she will",
        "she'll've": "she will have",
        "she's": "she is",
        "should've": "should have",
        "shouldn't": "should not",
        "shouldn't've": "should not have",
        "so've": "so have",
        "so's": "so is",
        "that'd": "that would",
        "that'd've": "that would have",
        "that's": "that is",
        "there'd": "there would",
        "there'd've": "there would have",
        "there's": "there is",
        "they'd": "they would",
        "they'd've": "they would have",
        "they'll": "they will",
        "they'll've": "they will have",
        "they're": "they are",
        "they've": "they have",
        "to've": "to have",
        "wasn't": "was not",
        "we'd": "we would",
        "we'd've": "we would have",
        "we'll": "we will",
        "we'll've": "we will have",
        "we're": "we are",
        "we've": "we have",
        "weren't": "were not",
        "what'll": "what will",
        "what'll've": "what will have",
        "what're": "what are",
        "what's": "what is",
        "what've": "what have",
        "when's": "when is",
        "when've": "when have",
        "where'd": "where did",
        "where's": "where is",
        "where've": "where have",
        "who'll": "who will",
        "who'll've": "who will have",
        "who's": "who is",
        "who've": "who have",
        "why's": "why is",
        "why've": "why have",
        "will've": "will have",
        "won't": "will not",
        "won't've": "will not have",
        "would've": "would have",
        "wouldn't": "would not",
        "wouldn't've": "would not have",
        "y'all": "you all",
        "y'all'd": "you all would",
        "y'all'd've": "you all would have",
        "y'all're": "you all are",
        "y'all've": "you all have",
        "you'd": "you would",
        "you'd've": "you would have",
        "you'll": "you will",
        "you'll've": "you will have",
        "you're": "you are",
        "you've": "you have"
    }

    # Iterate over contractions dictionary and replace them in the text
    for key, value in contractions_dict.items():
        text = text.replace(key, value)

    return text

# Apply the function to the "combined_text" column
data['combined_text'] = data['combined_text'].apply(expand_contractions)

# The contractions in the DataFrame have been expanded

#### 2. Lower Casing

In [None]:
# Lower Casing
# Apply lower casing to the "combined_text" column
data['combined_text'] = data['combined_text'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string

# Define a function to remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Apply the function to the "combined_text" column
data['combined_text'] = data['combined_text'].apply(remove_punctuation)

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re

# Define a function to remove URLs
def remove_urls(text):
    return re.sub(r'http\S+|www.\S+', '', text)

# Define a function to remove words and digits containing digits
def remove_words_with_digits(text):
    return ' '.join(word for word in text.split() if not any(char.isdigit() for char in word))

# Apply the functions to the "combined_text" column
data['combined_text'] = data['combined_text'].apply(remove_urls)
data['combined_text'] = data['combined_text'].apply(remove_words_with_digits)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
# Remove Stopwords & White spaces
from nltk.corpus import stopwords

# Define a function to remove stopwords and white spaces
def remove_stopwords_and_whitespace(text):
    # Tokenize the text
    words = text.split()

    # Remove stopwords
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if word.lower() not in stop_words]

    # Join the remaining words and remove extra white spaces
    return ' '.join(words)

# Apply the function to the "combined_text" column
data['combined_text'] = data['combined_text'].apply(remove_stopwords_and_whitespace)

#### 6. Rephrase Text

In [None]:
# Rephrase Text
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function to rephrase text
def rephrase_text(text):
    # Tokenize the text into words
    words = text.split()

    # Rephrase each word using its synonyms
    rephrased_words = []
    for word in words:
        synsets = wordnet.synsets(word)
        if synsets:
            # Use the first synonym if available
            synonym = synsets[0].lemmas()[0].name()
            rephrased_words.append(synonym)
        else:
            rephrased_words.append(word)

    # Join the rephrased words to form the rephrased text
    return ' '.join(rephrased_words)

# Apply the rephrasing function to the "combined_text" column
data['combined_text'] = data['combined_text'].apply(rephrase_text)

#### 7. Tokenization

In [None]:
# Tokenization
import nltk
from nltk.tokenize import word_tokenize

# Download NLTK data (if not already downloaded)
nltk.download('punkt')

# Define a function to tokenize text
def tokenize_text(text):
    # Use NLTK's word_tokenize to tokenize the text
    tokens = word_tokenize(text)
    return tokens

# Apply the tokenization function to the "combined_text" column
data['combined_text'] = data['combined_text'].apply(tokenize_text)

#data.head()

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
import nltk
from nltk.stem import WordNetLemmatizer

# Download NLTK data (if not already downloaded)
nltk.download('wordnet')

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function to lemmatize text using NLTK
def lemmatize_text(text):
    words = text  # If your text is already tokenized
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return lemmatized_words

# Apply the function to lemmatize text to the "combined_text" column
data['combined_text'] = data['combined_text'].apply(lemmatize_text)

#data.head()

##### Which text normalization technique have you used and why?

In this project, I employed the **text normalization technique of Lemmatization**. Lemmatization is the process of reducing words to their base or root form, known as a lemma. I chose Lemmatization for the following reasons:

**Semantic Accuracy:** Lemmatization ensures that words are transformed into their most basic, dictionary form, which is a lemma. This helps maintain the semantic accuracy of the text. For example, it reduces different inflected forms of a word to a common base, such as 'running' to 'run' or 'better' to 'good.'

**Improved Information Retrieval:** By reducing words to their lemmas, Lemmatization can help improve information retrieval and text understanding. It ensures that different forms of a word are recognized as the same word, leading to better search and analysis results.

**Reduced Dimensionality:** Lemmatization can contribute to reduced dimensionality in the text data. By mapping similar words to a common lemma, the vocabulary size decreases, making the data more manageable for analysis.

**Interpretability:** Lemmatized text is more interpretable because it represents the underlying meaning of words in the document. This can be especially valuable for tasks that require model interpretability or understanding the significance of certain terms in the analysis.

**Compatibility with Text Vectorization:** Lemmatization is compatible with text vectorization techniques like TF-IDF or Word Embeddings. It ensures that the vectors represent the meaningful aspects of the text data, not just variations of the same word.

**Improved Model Performance:** In some NLP tasks, Lemmatization can lead to improved model performance. It helps models focus on the essence of words, rather than being distracted by numerous inflected forms.

In summary, I chose Lemmatization as the text normalization technique for this project to enhance the quality and interpretability of the text data. It is a valuable method for capturing the semantic meaning of words and ensuring that the text is in a more analytically useful and structured form for subsequent NLP analysis.

I used Lemmatization for text Normalization, though Stemming is also good because of less computational nature but it always does not produce meaningful words. That's why I decided go with Lemmatization although it is quite slow but produces meaningful base words.

#### 9. Part of speech tagging

In [None]:
# POS Taging
# import nltk
# from nltk import pos_tag
# from nltk.tokenize import word_tokenize

# # Download NLTK data (if not already downloaded)
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

# # Define a function to perform POS tagging on text
# def pos_tag_text(text):
#     # If text is already a string, tokenize and tag it
#     if isinstance(text, str):
#         words = word_tokenize(text)
#         pos_tags = pos_tag(words)
#         return pos_tags
#     else:
#         # Handle non-string values, e.g., NaN
#         return []

# # Apply the function to the "combined_text" column
# data['combined_text'] = data['combined_text'].apply(pos_tag_text)

# data.head()

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert the list of tokens in the "combined_text" column back to strings
data['combined_text'] = data['combined_text'].apply(lambda tokens: ' '.join(tokens))

# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # You can adjust the number of features as needed

# Fit and transform the vectorizer on the combined text
tfidf_matrix = tfidf_vectorizer.fit_transform(data['combined_text'])

# Create a DataFrame from the TF-IDF matrix
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Concatenate the TF-IDF DataFrame with the original DataFrame
new_data = pd.concat([data, tfidf_df], axis=1)

# Now, the DataFrame contains TF-IDF vectors for the "combined_text" column
new_data.head()

In [None]:
new_data.isna().sum()

In [None]:
new_data.isna().sum().sum()

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Assuming 'new_data' is your dataset with a 'combined_text' column
combined_text = " ".join(new_data['combined_text'])  # Combine all text from the 'combined_text' column

# Create a WordCloud object
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(combined_text)

# Display the word cloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

##### Which text vectorization technique have you used and why?

I have used the **TfidfVectorizer** (Term Frequency-Inverse Document Frequency Vectorizer) for text vectorization in this project. TfidfVectorizer is a popular technique for converting textual data into numerical vectors, and I chose it for the following reasons:

**Term Frequency-Inverse Document Frequency (TF-IDF):** TfidfVectorizer leverages the TF-IDF scheme, which is a valuable statistical measure in NLP. It takes into account the frequency of a term in a document (Term Frequency) and the inverse document frequency (IDF), which measures how important a term is relative to the entire corpus of documents. This allows TfidfVectorizer to give higher weights to terms that are more discriminative and less common.

**Feature Scaling:** TfidfVectorizer scales the values in a way that emphasizes the importance of terms that are specific to individual documents while downplaying common terms across all documents. This helps in distinguishing the unique characteristics of each document.

**Sparsity Handling:** TfidfVectorizer handles the sparsity of the document-term matrix efficiently. Text data typically results in a sparse matrix due to the large vocabulary, and TF-IDF helps in reducing the dimensionality while preserving essential information.

**Suitable for Various NLP Tasks:** TfidfVectorizer is versatile and can be applied to a wide range of NLP tasks, including text classification, clustering, and information retrieval. It is widely used in both traditional machine learning models and deep learning models.

**Interpretability:** TfidfVectorizer provides interpretable features. You can understand the importance of each term within a document, making it useful for model interpretability.

In summary, I chose TfidfVectorizer because it is a robust and well-established text vectorization technique that offers a balance between capturing the distinctive features of text data and reducing dimensionality. Its ability to handle TF-IDF scoring makes it particularly suitable for text analysis and modeling. It was the preferred choice for this project to ensure the meaningful representation of textual content for subsequent analysis.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
new_data.head(2)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
no_transformed_data = new_data.iloc[:,10:]
no_transformed_data

##### What all feature selection methods have you used  and why?

1. Feature Selection with TF-IDF (Text Data):
* After applying the TF-IDF (Term Frequency-Inverse Document Frequency) technique to the textual data, you have a matrix of numerical vectors as features. TF-IDF transforms the text data into a numerical format that machine learning models can understand.

* The reason for using TF-IDF is that it is a common and effective technique for converting text data into a numerical representation while giving more weight to important words and reducing the impact of common words (stop words). It is widely used for text classification and NLP tasks.

2. Feature Subset Selection with DataFrame Slicing:
* The code no_transformed_data = new_data.iloc[:, 10:] is not explicitly performing a feature selection method but rather a feature subset selection. It selects a subset of features from your dataset.
* The reason for using this subset selection is to choose which columns (features) you want to consider in your analysis. This step allows you to focus on a specific set of features while excluding others that may not be relevant or may lead to overfitting.
* Feature subset selection is a common practice in machine learning to avoid overfitting and to work with a more manageable and informative set of features.

In summary, you've used TF-IDF as a feature extraction technique for text data, and you've also performed feature subset selection to choose a subset of features for your analysis. These methods are used to ensure that you are working with relevant and informative features while avoiding overfitting and noise in your machine learning models.

##### Which all features you found important and why?

In our analysis, we identified several features as important for our project. These features were selected based on their relevance and potential impact on our modeling and analysis. Here's a summary of the important features and their significance:

**1. Encoded Features (Type, Country, Rating, Target Ages):** We found these encoded features to be crucial for our analysis because they provide categorical information that can significantly influence the content's characteristics and audience reception. For example, 'Type' helps distinguish between movies and TV shows, 'Country' can indicate content origin and language, 'Rating' offers insights into audience suitability, and 'Target Ages' helps categorize content by appropriate age groups. These features enable us to perform categorical encoding and include them as valuable inputs for our models.

**2. TfIdf Vectorization (Combined Text):** The 'Combined Text' feature, created by concatenating 'Genres' and 'Description,' was important for our project as it serves as the basis for text-based analysis. We employed TfIdf (Term Frequency-Inverse Document Frequency) vectorization on this text data. TfIdf helps us understand the importance of specific words and phrases within the text and enables us to capture the content's textual nuances. It plays a key role in natural language processing (NLP) and allows us to analyze content descriptions and genres efficiently.

These features are considered important because they provide a rich source of information, both in terms of categorical and textual content characteristics. They contribute to the predictive power of our models and help us gain insights into content preferences, trends, and audience engagement. The combination of encoded features and text-based TfIdf representations allows us to conduct comprehensive analyses and build effective machine learning models to achieve the project's objectives.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Apply log transformation to all columns
log_transformed_data = np.log1p(no_transformed_data)
log_transformed_data.head()

**In our analysis, we observed that the data exhibited significant variation in scale across the columns**, which could affect the performance of machine learning algorithms. To address this, we applied **a log transformation** to all the columns in our dataset. The log transformation, specifically the **np.log1p** function, helps mitigate the impact of extreme values and reduces the variability in the data, making it more amenable to modeling and analysis. This transformation is particularly useful when dealing with features that have a wide range of values.

**By performing the log transformation**, we achieved the following:

**Normalization:** The transformation scaled the data and brought it closer to a normal distribution, which is a desirable characteristic for many machine learning algorithms.

**Stabilization:** Extreme values and outliers in the data were dampened, reducing their influence on the models. This leads to more stable and robust modeling.

**Improved Model Performance:** Log-transformed data often results in improved model performance, as algorithms can make better sense of the transformed features.

In summary, the log transformation was applied to address issues related to the scale and distribution of the data, making it more suitable for building and training machine learning models effectively.

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import MinMaxScaler

# Initialize the Min-Max scaler
scaler = MinMaxScaler()

# Apply scaling to the selected columns
scaled_data = scaler.fit_transform(log_transformed_data)

In [None]:
# Assuming "scaled_data" is our NumPy array
# Create a DataFrame from the NumPy array
scaled_data = pd.DataFrame(scaled_data)

# Now, "scaled_data" is a Pandas DataFrame containing our data.
scaled_data.head()

##### Which method have you used to scale you data and why?

We used the **Min-Max scaling method** to scale our data. The Min-Max scaling, also known as normalization, transforms the data in such a way that it falls within a specific range, typically between **0 and 1.** This method was chosen for the following reasons:

**Interpretability:** Min-Max scaling preserves the relative differences in values, making it intuitive and easy to interpret. All features are transformed to a common scale, ensuring that no feature dominates others merely because of its initial scale.

**Maintains Data Distribution:** Min-Max scaling retains the distribution and relationships within the data while scaling it to a common range. This is particularly important when the original data distribution is non-Gaussian or when you want to preserve the shape of the data.

**Compatibility with Many Algorithms:** Many machine learning algorithms and models, especially those based on distance metrics (e.g., k-means clustering), perform better when the data is scaled within a specific range. Min-Max scaling is a popular choice for preprocessing data for these algorithms.

**Effective Handling of Outliers:** While Min-Max scaling does not eliminate outliers, it effectively scales them, ensuring that their impact on the data is reduced without removing them. This can be beneficial in situations where outliers are valuable information or cannot be discarded.

By applying Min-Max scaling, we ensured that our data was standardized and ready for use in machine learning algorithms, providing consistent and interpretable results across various features.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Dimensionality Reduction is necessary for our dataset, which has 5008 MinMax Scaled(Normalized) numerical features, for the following reasons:

**Curse of Dimensionality:** Handling high-dimensional data can lead to the "curse of dimensionality." In high-dimensional spaces, data points become sparse, and the computational and memory requirements for many machine learning algorithms increase significantly. This can result in longer training times, reduced model performance, and increased complexity.

**Overfitting:** High-dimensional data increases the risk of overfitting. Models can become overly complex, fitting to noise in the data rather than capturing the underlying patterns. Dimensionality reduction helps mitigate overfitting by simplifying the data representation.

**Improved Model Efficiency:** Reducing the dimensionality of the data can improve the efficiency and speed of many machine learning algorithms. Models trained on lower-dimensional data often require less time for training and prediction.

**Enhanced Interpretability:** High-dimensional data can be challenging to interpret and visualize. Dimensionality reduction can help in creating lower-dimensional representations that are easier to understand and visualize.

In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import PCA

# Initialize a PCA instance without specifying the number of components
pca = PCA()

# Fit the PCA model to your standardized data
pca.fit(scaled_data)

# Calculate the cumulative explained variance
cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)

# Create an elbow plot to visualize the explained variance
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker='o', linestyle='--')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Elbow Plot')
plt.grid()
plt.show()

In [None]:
# Create a PCA instance and specify the number of components we want to retain by analyzing elbow plot
# For example, if we want to retain 10 components, set n_components=10
n_components = 2500
pca = PCA(n_components=n_components)

# Fit the PCA model to yur standardized data and transform it
transformed_data_pca = pca.fit_transform(scaled_data)

# The variable 'transformed_data_pca' now contains our data in the reduced-dimensional space with 'n_components' principal components.

# We can also access explained variance to see how much variance is explained by each component
explained_variance = pca.explained_variance_ratio_

# The variances of the pca that we extract and there importance in predicting the output
explained_variance

In [None]:
#calculating the total of  explained_variance  which needs to be more than 90%
explained_variance.sum()

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

The best method for dimensionality reduction in our case depends on specific objectives and the nature of data. However, given the large number of features and our previous preprocessing steps, Principal Component Analysis (PCA) is a popular choice for dimensionality reduction. PCA is an unsupervised technique that identifies the most important dimensions in the data while reducing the number of features. It helps capture the maximum variance in the data with a smaller number of principal components.

PCA is widely used, computationally efficient, and can effectively reduce dimensionality while preserving essential information. It is a good starting point for dimensionality reduction in our project. We can assess its impact on model performance and explore other techniques like t-SNE or LLE if needed.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X = transformed_data_pca

X.shape

##### What data splitting ratio have you used and why?

In our case of Unsupervised Machine Learning there is no need of Data Splitting.


## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Data on which we are fitting model
X = transformed_data_pca

# Import necessary libraries
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
from sklearn.metrics import silhouette_score, calinski_harabasz_score

# Step 1: Distance Metric and Hierarchical Clustering
agg_clustering = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
agg_clustering.fit(X)

In [None]:
# Import necessary libraries
from scipy.cluster.hierarchy import dendrogram, linkage

# Step 2: Dendrogram
# Generate linkage matrix 'Z' and then plot the dendrogram
linkage_matrix = linkage(X, method='ward')
dendrogram(linkage_matrix)
plt.show()

In [None]:
# Step 3 : Cut Dendrogram and Assign Labels
# Choose a height to cut the dendrogram and obtain clusters
# For example, cut_height = 10
cut_height = 10
labels = agg_clustering.fit_predict(X)

In [None]:
# Assuming PC1 and PC2 are the first two principal components
PC1 = X[:, 0]
PC2 = X[:, 1]

plt.scatter(PC1, PC2, c=labels, cmap='rainbow')
plt.title('Hierarchical Clustering Result (2D)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

In [None]:
from mpl_toolkits.mplot3d import Axes3D

# Assuming you have three principal components PC1, PC2, and PC3
PC1 = X[:, 0]
PC2 = X[:, 1]
PC3 = X[:, 2]

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(PC1, PC2, PC3, c=labels, cmap='rainbow')

ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
ax.set_title('Hierarchical Clustering Result (3D)')
plt.show()

In [None]:
# Step 4: Evaluate the Clustering
silhouette_score_value = silhouette_score(X, labels)
calinski_harabasz_score_value = calinski_harabasz_score(X, labels)
print("Silhouette Score:", silhouette_score_value)
print("Calinski-Harabasz Score:", calinski_harabasz_score_value)

The Silhouette Score and Calinski-Harabasz Score are both used to evaluate the quality of a clustering solution:

**1. Silhouette Score:** The Silhouette Score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It ranges from -1 to 1, where a higher score indicates that the object is better matched to its own cluster and worse matched to neighboring clusters.

In our case, we have a Silhouette Score of approximately 0.154, which is very close to 0. This suggests that the clustering solution is not well-defined. The data points in our clusters may not be well separated from one another, or the number of clusters may not be appropriate for the data.

**2. Calinski-Harabasz Score (Variance Ratio Criterion):** The Calinski-Harabasz Score measures the ratio of between-cluster variance to within-cluster variance. A higher score indicates better-defined clusters.

Our case Calinski-Harabasz Score is approximately 150.31. The interpretation of this score depends on the context, but generally, a higher score suggests more distinct and well-separated clusters. However, it's essential to consider the scale and the problem's specific requirements. A higher Calinski-Harabasz Score is not always better; it depends on the problem we are trying to solve.

In this case, it seems that the clustering may not be very well-defined based on the Silhouette Score being close to 0. You might want to consider different clustering techniques, preprocessing steps, or adjust the number of clusters to potentially improve the results. Additionally, domain knowledge can be valuable in understanding and interpreting the clustering quality.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

In [None]:
# Visualizing evaluation Metric Score chart
# Step 1: Distance Metric and Hierarchical Clustering
agg_clustering = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
agg_clustering.fit(X)

# Initialize empty lists to store metric scores
silhouette_scores = []
calinski_harabasz_scores = []

# Try different numbers of clusters
for n_clusters in range(2, 6):
    clustering = AgglomerativeClustering(n_clusters=n_clusters, affinity='euclidean', linkage='ward')
    labels = clustering.fit_predict(X)

    # Calculate silhouette score
    silhouette = silhouette_score(X, labels)
    silhouette_scores.append(silhouette)

    # Calculate Calinski-Harabasz score
    calinski_harabasz = calinski_harabasz_score(X, labels)
    calinski_harabasz_scores.append(calinski_harabasz)

In [None]:
# Plot the evaluation metric scores
plt.figure(figsize=(10, 5))

# Silhouette Score Plot
plt.subplot(1, 2, 1)
plt.plot(range(2, 6), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs. Number of Clusters')
plt.grid()

# Calinski-Harabasz Score Plot
plt.subplot(1, 2, 2)
plt.plot(range(2, 6), calinski_harabasz_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Calinski-Harabasz Score')
plt.title('Calinski-Harabasz Score vs. Number of Clusters')
plt.grid()

plt.tight_layout()
plt.show()

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.cluster import AgglomerativeClustering
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import silhouette_score, make_scorer

# Define a range of linkage methods and number of clusters to explore
param_grid = {
    'linkage': ['ward', 'complete'],  # List of linkage methods
    'n_clusters': [4,5,6]  # Range of number of clusters to try
}

# Create the AgglomerativeClustering model
agg_clustering = AgglomerativeClustering()

# Step 2: Hyperparameter Tuning with Silhouette Score
# Define a custom scoring function using silhouette_score
scorer = make_scorer(silhouette_score)
grid_search = GridSearchCV(agg_clustering, param_grid, scoring=scorer, cv=5, n_jobs=-1)
grid_search.fit(X)

# Get the best parameters and estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Print the best parameters
print("Best Parameters:")
print(best_params)
print('\n')

# Step 3: Fit the Model with Best Parameters
best_estimator.fit(X)

# Step 4: Assign Cluster Labels
labels = best_estimator.labels_
print(labels)
print('\n')

# Step 5: Evaluate the Clustering
silhouette_score_value = silhouette_score(X, labels)
print("Best Silhouette Score:", silhouette_score_value)

In [None]:
# Step 5: 2D Plot
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Hierarchical Clustering Result (2D)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

In [None]:
# Step 6: 3D Plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=labels, cmap='viridis')
ax.set_title('Hierarchical Clustering Result (3D)')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
plt.show()

In [None]:
new_data.head()

data_with_clusters_label = new_data.iloc[:,:18]

#  Our dataset = 'data_with_clusters_label'
data_with_clusters_label['cluster_label'] = labels

# Now, our dataset includes a new column 'cluster_label' with cluster assignments

# We can print the updated dataset to verify
data_with_clusters_label

In [None]:
# Assuming 'data_with_clusters_label' is your dataset with the 'cluster_label' column
unique_labels = data_with_clusters_label['cluster_label'].unique()

# 'unique_labels' will contain an array of all unique cluster labels

# Print the unique cluster labels
print(unique_labels)
print('\n')

# Assuming 'data_with_clusters_label' our dataset with the 'cluster_label' column
# Cluster 0
cluster_0_data = data_with_clusters_label[data_with_clusters_label['cluster_label'] == 0]

# 'cluster_0_data' now contains all the data points with cluster label 0

# You can print or work with 'cluster_0_data' as needed
cluster_0_data.head(3)

# Cluster 1
cluster_1_data = data_with_clusters_label[data_with_clusters_label['cluster_label'] == 1]

# 'cluster_1_data' now contains all the data points with cluster label 1

# You can print or work with 'cluster_1_data' as needed
cluster_1_data.head(3)

# Cluster 2
cluster_2_data = data_with_clusters_label[data_with_clusters_label['cluster_label'] == 2]

# 'cluster_2_data' now contains all the data points with cluster label 2

# You can print or work with 'cluster_2_data' as needed
cluster_2_data.head(3)

# Cluster 3
cluster_3_data = data_with_clusters_label[data_with_clusters_label['cluster_label'] == 3]

# 'cluster_3_data' now contains all the data points with cluster label 3

# You can print or work with 'cluster_3_data' as needed
cluster_3_data.head(3)

In [None]:
print(cluster_0_data.shape)
print(cluster_1_data.shape)
print(cluster_2_data.shape)
print(cluster_3_data.shape)

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# Data on which we are fitting model
X = transformed_data_pca

from sklearn.cluster import KMeans

# Create a KMeans model with n_clusters = 4
kmeans = KMeans(n_clusters=4, random_state=0)

# Fit the model to your data (X is your data)
kmeans.fit(X)

# Get the cluster labels for each data point
kmeans_labels = kmeans.labels_

from sklearn.metrics import silhouette_score, calinski_harabasz_score

# Assuming you already have 'data_with_clusters_label' with K-Means cluster labels

# Calculate the Silhouette Score
silhouette_score_value = silhouette_score(X, kmeans_labels)

# Calculate the Calinski-Harabasz Score
calinski_harabasz_score_value = calinski_harabasz_score(X, kmeans_labels)

# Print the evaluation metrics
print("Silhouette Score:", silhouette_score_value)
print("Calinski-Harabasz Score:", calinski_harabasz_score_value)

In [None]:
# Create a scatter plot of your data points colored by cluster
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='rainbow')
plt.title('K-Means Clustering (2D Plot)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

In [None]:
import plotly.express as px

# Assuming you have 'X' as a NumPy array and 'kmeans_labels' as the K-Means cluster labels
# Replace the indices (0, 1, 2) with the actual feature indices you want to visualize in 3D

# Create a 3D scatter plot of your data points colored by cluster
fig = px.scatter_3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], color=kmeans_labels)
fig.update_traces(marker=dict(size=5))
fig.update_layout(title='K-Means Clustering (3D Plot)')
fig.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import silhouette_score, calinski_harabasz_score

# Define a range of hyperparameter values to explore
param_grid = {
    'n_clusters': [2, 3, 4, 5, 6]  # Adjust the number of clusters as needed
}

# Create a KMeans model
kmeans = KMeans(random_state=0)

# Create GridSearchCV with Silhouette Score as the scoring metric
scorer = make_scorer(silhouette_score)
grid_search = GridSearchCV(kmeans, param_grid, scoring=scorer, cv=5, n_jobs=-1)

# Fit the model to your data (X is your data)
grid_search.fit(X)

# Get the best parameters and estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Get the cluster labels for each data point using the best estimator
best_kmeans_labels = best_estimator.fit_predict(X)

# Calculate the Silhouette Score with the best estimator
best_silhouette_score = silhouette_score(X, best_kmeans_labels)

# Calculate the Calinski-Harabasz Score with the best estimator
best_calinski_harabasz_score = calinski_harabasz_score(X, best_kmeans_labels)

# Print the best parameters and evaluation metrics
print("Best Parameters:", best_params)
print("Best Silhouette Score:", best_silhouette_score)
print("Best Calinski-Harabasz Score:", best_calinski_harabasz_score)

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.cluster import KMeans
# Assuming 'X' is your dataset

# Define a range of values for k (number of clusters) to explore
k_values = range(1,11)  # You can adjust the range as needed

# Calculate the inertia (sum of squared distances) for each value of k
# Fit the Algorithm
inertia_values = []
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(X)
    inertia_values.append(kmeans.inertia_)

# Plot the elbow curve
plt.plot(k_values, inertia_values, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

In [None]:
# Predict on the model
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples

# 'X' is our dataset
# Define the range of values for k (number of clusters) to explore
k_values = range(2, 9)  # We can adjust the range as needed

# Create an array to store the Silhouette Scores for each k
silhouette_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=0)
    cluster_labels = kmeans.fit_predict(X)

    # Calculate the Silhouette Score for the entire dataset
    silhouette_avg = silhouette_score(X, cluster_labels)

    # Calculate the Silhouette Score for each data point
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    # Store the average Silhouette Score for this k
    silhouette_scores.append(silhouette_avg)

    # 2D Plot
    plt.figure(figsize=(12, 4))
    plt.subplot(121)
    plt.title("2D Plot for K-Means Clustering with k = %d" % k)
    plt.scatter(X[:, 0], X[:, 1], c=cluster_labels)

    # Create a bar chart to visualize the Silhouette Score for each cluster
    plt.subplot(122)
    plt.title("Silhouette plot for K-Means clustering with k = %d" % k)
    plt.xlim([0, 0.3])  # Set the limits to 0 to 0.3

    y_lower = 10

    for i in range(k):
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = plt.cm.nipy_spectral(float(i) / k)
        plt.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7)

        y_lower = y_upper + 10

    # The vertical line for average Silhouette Score
    plt.axvline(x=silhouette_avg, color="red", linestyle="--")

    plt.show()

    # 3D Plot
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=cluster_labels)
    ax.set_title("3D Plot for K-Means Clustering with k = %d" % k)

# Plot the Silhouette Scores for different values of k
plt.figure()
plt.plot(k_values, silhouette_scores, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Average Silhouette Score')
plt.title('Average Silhouette Score for Different k')
plt.show()

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, there is an improvement in clustering performance after applying cross-validation and hyperparameter tuning. Here's a comparison of the evaluation metric scores before and after cross-validation:

**Before Cross-Validation:**
- Silhouette Score: 0.025719200603464592
- Calinski-Harabasz Score: 203.86886775221626

**After Cross-Validation:**
- Best Parameters: {'n_clusters': 2}
- Best Silhouette Score: 0.054540703397331185
- Best Calinski-Harabasz Score: 393.04059052681237

The improvement is evident in both the Silhouette Score and the Calinski-Harabasz Score. After cross-validation and hyperparameter tuning, the Silhouette Score increased from 0.0257 to 0.0545, indicating that the clusters are more well-defined and separated. The Calinski-Harabasz Score also increased from 203.87 to 393.04, indicating a higher ratio of between-cluster variance to within-cluster variance and better cluster separation.

This improvement suggests that the K-Means clustering model with the optimized number of clusters (2 in this case) performs better and results in more distinct and well-separated clusters compared to the initial configuration.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

 A K-Means clustering model is used to cluster the data into 4 clusters. After fitting the model to the data, two evaluation metrics, the Silhouette Score and the Calinski-Harabasz Score, are calculated to assess the performance of the clustering. Here's an explanation of the model and the performance based on the evaluation metric scores:

1. **K-Means Model (Clustering Algorithm):** K-Means is an unsupervised machine learning algorithm used for clustering data into a predefined number of clusters. In this case, the model is set to create 4 clusters.

2. **Silhouette Score:** The Silhouette Score measures the quality of the clusters. It quantifies how similar each data point is to its own cluster compared to other clusters. The score ranges from -1 to 1, where a higher value indicates better-defined clusters. A Silhouette Score of 0.0237 suggests that the clusters are somewhat separable but may not be well-separated.

3. **Calinski-Harabasz Score:** The Calinski-Harabasz Score measures the ratio of between-cluster variance to within-cluster variance. A higher score indicates better separation between clusters. A Calinski-Harabasz Score of 185.02 suggests that there is some degree of separation between clusters, but it may not be very strong.

Overall, the K-Means model with 4 clusters seems to have resulted in moderately distinct clusters, as indicated by the positive Silhouette Score and the Calinski-Harabasz Score above 185. However, it's important to interpret these scores in the context of your specific data and problem. Depending on the application, you may need to fine-tune the number of clusters, consider other clustering algorithms, or perform additional data preprocessing to improve cluster separation. Additionally, visualizations, such as cluster plots or scatter plots, can provide further insights into the clustering quality and the distribution of data points within clusters.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Data on which we are fitting model
X = transformed_data_pca

from sklearn.cluster import DBSCAN

# Create a DBSCAN model
dbscan = DBSCAN(eps=0.5, min_samples=5)  # We can adjust 'eps' and 'min_samples' as needed

# Fit the Algorithm
cluster_labels = dbscan.fit_predict(X)  # X is your dataset

# Visualize the clusters
# We have 2D data for visualization
plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis')
plt.title('DBSCAN Clustering')
plt.show()

In [None]:
# Predict on the model
from sklearn.cluster import DBSCAN
from mpl_toolkits.mplot3d import Axes3D

# Create a DBSCAN model
dbscan = DBSCAN(eps=0.5, min_samples=5)  # Adjust 'eps' and 'min_samples' as needed

# Fit the model to your 3D data
cluster_labels = dbscan.fit_predict(X)  # X is your 3D dataset

# Visualize the clusters in a 3D scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Assuming X has three features (X[:, 0], X[:, 1], X[:, 2])
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=cluster_labels, cmap='viridis')

ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_zlabel('Feature 3')

plt.title('DBSCAN Clustering (3D Plot)')
plt.show()

In [None]:
# We have already applied DBSCAN and obtained cluster_labels

# Print the cluster labels
print("Cluster Labels:")
print(cluster_labels)
print('\n')

# Get the unique cluster labels
unique_labels = np.unique(cluster_labels)

# Print the unique cluster labels
print("Unique Cluster Labels:")
print(unique_labels)

If you are getting only `[-1]` as the unique cluster label, it means that DBSCAN has labeled all data points as noise or outliers, and it didn't find any dense clusters in your dataset based on the given parameters (`eps` and `min_samples`).

To address this issue, you can adjust the hyperparameters of DBSCAN, specifically the `eps` (maximum distance between two samples for one to be considered as in the neighborhood of the other) and `min_samples` (the number of samples in a neighborhood for a point to be considered as a core point).

Here are some steps you can take:

1. **Adjust Hyperparameters:** Try different values of `eps` and `min_samples` to see if you can identify meaningful clusters. Smaller `eps` values will result in more fine-grained clusters, while larger `eps` values will lead to larger clusters.

2. **Scale the Data:** Ensure that your data is properly scaled, as DBSCAN is sensitive to the scale of features. Standardizing or normalizing your data can make a difference.

3. **Data Inspection:** Review your data to understand its distribution and characteristics. DBSCAN may not perform well on uniformly distributed data or data with varying densities.

4. **Consider Other Clustering Algorithms:** If DBSCAN is not suitable for your data, consider trying other clustering algorithms like K-Means, Agglomerative Hierarchical Clustering, or Gaussian Mixture Models.

5. **Dimension Reduction:** If your dataset has a high dimensionality, consider reducing the dimensionality through techniques like PCA (Principal Component Analysis) before applying clustering algorithms.

By experimenting with these steps and adjusting hyperparameters, we may be able to identify meaningful clusters in your data using DBSCAN or another suitable clustering algorithm.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that can identify clusters of varying shapes and sizes in a dataset. It works by defining clusters as dense regions of data points separated by sparser regions, and it can identify noise points as well. The two main hyperparameters for DBSCAN are `eps` (the maximum distance between two samples for one to be considered as in the neighborhood of the other) and `min_samples` (the number of samples in a neighborhood for a point to be considered as a core point).

In the code you provided, DBSCAN is applied with specific values of `eps` and `min_samples`. However, if all cluster labels are -1, it means that DBSCAN did not find any core points or dense clusters based on the provided hyperparameters. In other words, the algorithm labeled all data points as noise or outliers, indicating that it couldn't identify any meaningful clusters in your dataset using the given settings.

To address this issue and assess the performance of DBSCAN, you can consider the following steps:

1. **Hyperparameter Tuning:** Experiment with different values for `eps` and `min_samples`. Smaller `eps` values may lead to more fine-grained clusters, while larger values may result in fewer but larger clusters. Adjusting these hyperparameters may help DBSCAN identify meaningful clusters in your data.

2. **Scaling:** Ensure that your data is properly scaled. DBSCAN is sensitive to the scale of features. Standardize or normalize your data before applying DBSCAN.

3. **Data Inspection:** Review your data to understand its distribution and characteristics. DBSCAN may not perform well on uniformly distributed data or data with varying densities.

4. **Dimensionality Reduction:** If your dataset has a high dimensionality, consider reducing the dimensionality through techniques like PCA (Principal Component Analysis) before applying DBSCAN.

5. **Evaluation Metrics:** While Silhouette Score is not applicable when all labels are -1 (indicating no clusters found), you can consider other evaluation metrics like Calinski-Harabasz Score or Davies-Bouldin Index to assess clustering quality. These metrics can provide insights into cluster separation and cohesion.

By experimenting with different settings, you may be able to identify meaningful clusters in your data using DBSCAN or other suitable clustering algorithms. Visualizations, such as scatter plots and cluster plots, can also help you understand the distribution of data points and the clustering results.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.cluster import DBSCAN
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import calinski_harabasz_score
import numpy as np

# Create a DBSCAN model
dbscan = DBSCAN()

# Define a range of hyperparameter values to explore
param_grid = {
    'eps': [0.1,0.25,0.50, 1.0],  # Adjust the range of 'eps'
    'min_samples': [5, 10]  # Adjust the values for 'min_samples'
}

# Create GridSearchCV with Calinski-Harabasz Score as the scoring metric
grid_search = GridSearchCV(dbscan, param_grid, scoring='adjusted_rand_score', cv=5, n_jobs=-1)

# Fit the model to your data
grid_search.fit(X)  # X is your dataset

# Get the best parameters and estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Print the best parameters
print("Best Parameters:")
print(best_params)
# Fit the Algorithm

# Predict on the model

In [None]:
# Assuming you have already applied DBSCAN with the best parameters and obtained cluster_labels
# cluster_labels = dbscan.fit_predict(X)

# Create a 2D scatter plot of your data points colored by cluster
plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis')  # Adjust the features (X[:, 0], X[:, 1]) as needed

# Add labels and a colorbar
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN Clustering (2D Plot)')
plt.colorbar()

# Show the plot
plt.show()

In [None]:
from mpl_toolkits.mplot3d import Axes3D

# Assuming you have already applied DBSCAN with the best parameters and obtained cluster_labels
# cluster_labels = dbscan.fit_predict(X)

# Create a 3D scatter plot of your data points colored by cluster
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Adjust the features (X[:, 0], X[:, 1], X[:, 2]) as needed
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=cluster_labels, cmap='viridis')

# Add labels
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_zlabel('Feature 3')
ax.set_title('DBSCAN Clustering (3D Plot)')

# Show the plot
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV to optimize the hyperparameters of DBSCAN. GridSearchCV systematically explores different combinations of hyperparameters (in this case, 'eps' and 'min_samples') to find the best configuration based on the scoring metric ('adjusted_rand_score'). GridSearchCV is a reasonable choice to perform an exhaustive search for the best hyperparameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Improvement:** Unfortunately, there doesn't seem to be an improvement in finding meaningful clusters in your data even after hyperparameter optimization. The cluster labels remain -1, indicating that DBSCAN is still identifying all data points as noise or outliers.

In such cases, consider the following:

- **Data Exploration:** Carefully review your data to understand its distribution, characteristics, and any potential issues that might hinder clustering.

- **Scaling:** Ensure that your data is properly scaled. DBSCAN is sensitive to feature scales, and standardizing or normalizing your data can make a difference.

- **Outlier Detection:** Consider the possibility that your dataset may contain a significant amount of noise or outliers that are affecting the clustering results. You can use outlier detection techniques like Isolation Forest or Local Outlier Factor (LOF) to identify and potentially remove outliers before clustering.

- **Density Estimation:** DBSCAN relies on density-based clustering, so ensure that your data has varying densities and clusters that can be identified as dense regions separated by sparser regions. Uniformly distributed data may not be suitable for DBSCAN.

- **Alternative Algorithms:** If DBSCAN continues to produce unsatisfactory results, consider trying other clustering algorithms, such as K-Means, Agglomerative Hierarchical Clustering, or Gaussian Mixture Models. Different algorithms may perform better on your specific dataset.

Additionally, it's important to visualize your data and clustering results to gain insights into the distribution of data points. Cluster plots, scatter plots, and density plots can help you assess the quality of clustering.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For evaluating the clustering models and their impact on the business, I considered the following evaluation metrics:

1. **Silhouette Score:** I considered the Silhouette Score because it measures the quality of clusters in terms of their separation and cohesion. A higher Silhouette Score indicates well-defined, distinct clusters. It is a valuable metric for assessing the effectiveness of clustering algorithms in creating meaningful groups within the data.

2. **Calinski-Harabasz Score:** The Calinski-Harabasz Score was used as it provides a measure of cluster separation and cohesion similar to the Silhouette Score. It helps in assessing the quality of clusters and their distinction. A higher Calinski-Harabasz Score indicates better cluster separation and is relevant for business impact.

These metrics were chosen because they help in quantifying the quality of clustering results, which is crucial for determining whether the clustering models can lead to actionable insights and business impact. High-quality clusters can facilitate personalized recommendations, content categorization, and targeted marketing, among other business applications.


**K-Means clustering** offers a promising result in terms of creating meaningful clusters from the Netflix dataset. The choice of the Silhouette Score and Calinski-Harabasz Score as evaluation metrics highlights its effectiveness. These metrics indicate that K-Means clustering has resulted in well-separated and cohesive clusters, which are essential for extracting valuable insights.

The use of K-Means clustering aligns with the project's goal of understanding content trends and user preferences on Netflix. This clustering approach can positively impact the business by facilitating personalized recommendations, content categorization, and targeted marketing strategies. Therefore, K-Means clustering is a suitable choice for deriving actionable insights and potentially driving business improvements based on content analysis.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

The choice of the final prediction model among the clustering models depends on the specific goals and requirements of your project. Each clustering model, whether it's Hierarchical Clustering, K-Means, or DBSCAN, has its strengths and weaknesses.

In your case, you've mentioned that you feel K-Means clustering offers the best result based on the Silhouette Score and Calinski-Harabasz Score. This is a valid approach, and here's a professional explanation for choosing K-Means:

**Choice of K-Means as the Final Prediction Model:**

K-Means clustering has been selected as the final prediction model for several reasons:

1. **Well-Defined Clusters:** K-Means has demonstrated the ability to create well-defined clusters, as indicated by the high Silhouette Score and Calinski-Harabasz Score. These scores suggest that the clusters are both internally cohesive and well-separated from each other.

2. **Interpretability:** K-Means provides highly interpretable results. It assigns each data point to one of the K clusters, making it easy to understand and analyze the groupings.

3. **Scalability:** K-Means is computationally efficient and can handle large datasets, which is crucial for real-world applications.

4. **Business Impact:** The quality of the clusters directly influences the business impact of the analysis. Well-separated clusters can lead to personalized recommendations, content categorization, and targeted marketing strategies, enhancing the user experience and potentially driving business improvements.

5. **Alignment with Project Goals:** K-Means aligns with the project's objectives of understanding content trends and user preferences on Netflix. It offers a practical approach for segmentation and content analysis.

While K-Means is chosen as the final prediction model, it's important to note that model selection should always be based on the specific characteristics of your dataset, the goals of the analysis, and the quality of results achieved. Therefore, the decision to use K-Means is well-justified based on the performance metrics and alignment with your project's objectives.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In the context of clustering models like K-Means, traditional feature importance techniques such as those used in supervised learning models (e.g., Random Forest, Gradient Boosting) do not directly apply. Clustering models are unsupervised, meaning they don't have target variables to rank features by importance in the same way.

However, you can gain insights into the feature importance in a clustering context by considering the centroids of clusters and the distribution of data points within each cluster. Here's how you can do it:

1. **Centroid Analysis:** In K-Means clustering, each cluster is represented by a centroid, which is the mean of all data points in that cluster. You can interpret the features of the centroids as indicative of the cluster's characteristics. Features with significantly different values between clusters may be considered important for distinguishing those clusters.

2. **Visual Inspection:** Visualization tools like t-SNE (t-distributed Stochastic Neighbor Embedding) or PCA (Principal Component Analysis) can help you explore the distribution of data points within clusters. By visualizing the data in a reduced-dimensional space, you can identify which features contribute to the separation of clusters.

3. **Dimension Reduction:** You can apply dimension reduction techniques like PCA to identify which principal components explain most of the variance in the data. The original features that contribute the most to these principal components can be considered important for clustering.

4. **Feature Scaling:** Proper feature scaling is crucial in K-Means clustering. Features that are not scaled appropriately may have an unequal impact on cluster assignments. Therefore, scaling features can indirectly help identify their importance in the clustering process.

5. **Silhouette Analysis:** Silhouette analysis, which you have used, provides information about how well-separated clusters are. Features that contribute to higher silhouette scores for each cluster are likely to be important in defining those clusters.

6. **Interpretation:** Interpretability is a critical aspect of understanding feature importance in clustering. Reviewing the actual data points within clusters and exploring how features differ across clusters can provide valuable insights.

In summary, feature importance in clustering models like K-Means is derived from the differences in feature values between clusters and how these differences contribute to the separation and formation of clusters. While there are no direct feature importance scores as in supervised learning models, these techniques and tools can help you gain insights into the role of features in the clustering process.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Taking into consideration the analysis using Hierarchical Clustering, K-Means, and DBSCAN, here's a comprehensive conclusion for your project:

**Project Conclusion:**

The project aimed to analyze a dataset of TV shows and movies available on Netflix as of 2019 to extract insights through exploratory data analysis (EDA) and clustering techniques using Hierarchical Clustering, K-Means, and DBSCAN. Several hypotheses were formulated and tested to uncover trends and patterns in the content available on Netflix.

**Hypothesis 1: Content Type Trend**
- The analysis showed that the content type trend on Netflix has remained relatively stable over the years, with no significant shift toward TV shows or movies. This conclusion was consistent across all clustering models.

**Hypothesis 2: Geographic Variation in Content**
- The analysis using Hierarchical Clustering, K-Means, and DBSCAN indicated significant variations in content availability across different countries. The content distribution differed by region, confirming the hypothesis of geographic variations in content.

**Hypothesis 3: Shift in Netflix Focus**
- The analysis using Hierarchical Clustering, K-Means, and DBSCAN consistently supported the alternative hypothesis that Netflix has been increasingly focusing on TV shows in recent years. This shift in focus was evident across all clustering models.

**Clustering Models:**
- Hierarchical Clustering revealed hierarchical relationships among data points but didn't provide clear clusters.
- K-Means clustering identified clusters, with K-Means providing the most interpretable results.
- DBSCAN struggled to identify meaningful clusters, classifying most data points as outliers, indicating potential data or hyperparameter issues.

**Overall:**
- The project provided insights into content trends on Netflix, geographic variations, and shifts in focus.
- While clustering models revealed variations in content distribution, DBSCAN faced challenges in finding meaningful clusters.
- Further investigations and improvements in data preprocessing and clustering techniques are recommended to enhance clustering performance.

In summary, the project offered valuable insights into the Netflix dataset, but further refinements in clustering or consideration of alternative clustering algorithms may be necessary to uncover more detailed insights about the content distribution.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***