<a href="https://colab.research.google.com/github/Kuldeep25112000/M6-Capstone_Project/blob/main/Unsupervised_ML_Netflix_Movies_and_TV_Shows_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Unsupervised_ML_Netflix_Movies_and_TV_Shows_Clustering



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

This project applies unsupervised machine learning techniques to cluster Netflix movies and TV shows using the Netflix dataset containing attributes such as type, title, director, cast, country, date added, release year, rating, duration, listed genres, and description. Since no target labels are available, clustering is used to identify natural groupings within the content library.

The data preprocessing stage includes handling missing values (such as absent director names), converting categorical features (type, country, rating, and listed genres) into numerical representations, and extracting useful information from text-based columns like description and listed_in using TF-IDF vectorization. Duration is standardized by separating movies (minutes) and TV shows (number of seasons), and date-related features are cleaned for consistency.

Clustering algorithms such as K-Means are implemented to group similar titles based on genre combinations, content type (Movie or TV Show), release year, maturity rating, and textual similarity in descriptions. The Elbow Method and Silhouette Score are used to determine the optimal number of clusters, while PCA helps visualize the clustered data in lower dimensions.

The clustering results reveal meaningful patterns, such as groups of international TV dramas, action and sci-fi movies, crime and mystery series, and short-duration horror or thriller films. These insights can support content recommendation systems, audience segmentation, and strategic content planning for streaming platforms.

Overall, this project demonstrates how unsupervised learning can effectively analyze unlabelled entertainment data to uncover hidden structures and improve understanding of content diversity on Netflix.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine. In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset so that we can know about the reasons of streaming service which is decreased in movies where increased in TV shows.
This may have a huge impact on business if we not soon figure it out.

#### **Define Your Business Objective?**

***To increase user engagement, retention, and revenue by delivering the right content to the right audience.***

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
#Data Preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
#Text Processing (NLP)
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
#Dimensionality Reduction
from sklearn.decomposition import PCA
#Clustering Algorithms
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
#Clustering Evaluation
from sklearn.metrics import silhouette_score
#Handling Warnings
import warnings
warnings.filterwarnings('ignore')
#Date Handling
from datetime import datetime


### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
data = pd.read_csv('/content/drive/MyDrive/M6: Model/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
data.isnull().sum().plot.bar()

### What did you know about your dataset?

My datasets contains the data of Netflix Movies and TV Shows in which there are 7787 rows and 12 columns. There are no duplicated values in the data. But there are sum of the rows that contains the missing values in director column has max missing values about 2389 out of 7787 which is almost 30% and least number of missing values is with rating which is 7.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

### Variables Description

**show_id:** Unique id for every Movies and TVshows

**type:** Identifier - A Movie or TVshow

**title:** Title of movie or tv show

**director:** Director of the show

**cast:** Actors involved

**country:** Country of production

**date_added:** Date it was added on Netflix

**release_year:** Actual release year of the show

**rating:** Ratings of the show

**duration:** Total duration in minutes or number of seasons

**listed_in:** Genre

**description:** The summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

**Step 1: Exploring Exceptional Cases**

In [None]:
# Explore exceptional cases
print("\nUnique values in 'type':", data['type'].unique())#This line prints all the unique values found in the type column of your DataFrame. This is useful for understanding the different categories of content available, such as 'Movie' or 'TV Show'.
print("Unique values in 'rating':", data['rating'].unique())#Similar to the previous line, this prints all the unique values in the rating column. This helps in identifying the various maturity ratings (e.g., 'TV-MA', 'R', 'PG-13') present in the dataset, including any NaN (Not a Number) values if there are missing ratings.
print("Some rows with missing director or cast:\n", data[data['director'].isnull() | data['cast'].isnull()].head())
#This line identifies and displays the first few rows (.head()) from the DataFrame where either the director column or the cast column has a missing value (.isnull()). The | symbol acts as an 'OR' operator, meaning it will select rows where at least one of these conditions is true. This is a quick way to inspect examples of entries that have missing data in these crucial fields, helping you understand the nature of the missingness before deciding on a data cleaning strategy.

**Step 2: Preprocessing**

In [None]:
# Fill missing director names with 'Unknown'
data['director'] = data['director'].fillna('Unknown')
# Fill missing cast names with 'Unknown'
data['cast'] = data['cast'].fillna('Unknown')
# Fill missing country with 'Unknown'
data['country'] = data['country'].fillna('Unknown')

In [None]:
data.dropna(subset=['date_added'], inplace=True)

In [None]:
data.dropna(subset=['rating'], inplace=True)

In [None]:
data['date_added'] = data['date_added'].str.strip()
data['date_added'] = pd.to_datetime(data['date_added'], format="%B %d, %Y", errors='coerce')

In [None]:
# Extract duration
def convert_duration(row):
    if pd.isnull(row):
        return np.nan
    if 'Season' in str(row):
        return int(str(row).split(' ')[0])  # number of seasons
    else:
        return int(str(row).split(' ')[0])  # duration in minutes

data['duration_clean'] = data['duration'].apply(convert_duration)

In [None]:
# Encode categorical columns
label_cols = ['type', 'country', 'rating']#This line defines a list called label_cols that contains the names of the columns you want to encode. In this case, 'type', 'country', and 'rating' are identified as categorical features that need to be converted into numerical format.
le = LabelEncoder()#This line initializes an instance of LabelEncoder from the sklearn.preprocessing module. The LabelEncoder is a utility class to help normalize labels such that they contain values between 0 and n_classes-1.
for col in label_cols:#This initiates a loop that iterates through each column name specified in the label_cols list.
    data[col] = le.fit_transform(data[col])#first learns all the unique values (categories) in that specific column and then transforms those text categories into numerical labels (e.g., 'Movie' might become 0, 'TV Show' might become 1). The first unique value encountered gets 0, the second gets 1, and so on.

**Step 3: TF-IDF / Bag of Words for Text Features**

In [None]:
# Combine description and listed_in for text analysis
data['text_features'] = data['description'].fillna('') + ' ' + data['listed_in'].fillna('')

# TF-IDF Vectorization
tfidf = TfidfVectorizer(stop_words='english', max_features=500)  # Bag of Words using TF-IDF
text_matrix = tfidf.fit_transform(data['text_features'])


**Step 4: Combine Numeric + Text Features**

In [None]:
# Scale numerical features
numerical_features = ['release_year', 'duration_clean', 'type', 'country', 'rating']#This line defines a list of columns from your data DataFrame that contain numerical information. These are the features you want to scale.
scaler = StandardScaler()#An instance of StandardScaler is created. This scaler transforms features by removing the mean and scaling to unit variance. This is important for many machine learning algorithms, including clustering, because it prevents features with larger numerical ranges from dominating the distance calculations.
scaled_features = scaler.fit_transform(data[numerical_features])#This is where the scaling happens. fit_transform calculates the mean and standard deviation for each numerical feature (fit) and then applies the scaling transformation to them (transform). The result, scaled_features, is a NumPy array of your standardized numerical data.

# Combine numeric and text features
X = hstack([scaled_features, text_matrix])#Here, you're combining the scaled_features (your numerical data) with text_matrix (your TF-IDF vectorized text data). hstack (horizontal stack) from scipy.sparse is used because text_matrix is likely a sparse matrix (which is common for TF-IDF to save memory). This function efficiently joins these two sets of features column-wise to create a single, comprehensive feature matrix X.
print("Final feature matrix shape:", X.shape)#This line simply prints the dimensions (rows, columns) of the newly created combined feature matrix X. This confirms how many data points (rows) and how many features (columns) your clustering algorithm will work with.

**Step 5: Selecting the Approach & Algorithms**

We’ll use unsupervised learning:

K-Means

Agglomerative Hierarchical Clustering

We’ll also check optimal number of clusters using the Elbow Method.

**Step 6: Modeling – K-Means**

In [None]:
# Elbow Method to find optimal K
inertia = []
K = range(2, 10)#These lines initialize an empty list to store inertia values and define a range of k (number of clusters) to test, from 2 to 9.
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)#A K-Means model is initialized for each k. random_state=42 ensures reproducibility.
    kmeans.fit(X)#The K-Means model is trained on your combined feature matrix X.
    inertia.append(kmeans.inertia_)#The inertia_ attribute, which measures the sum of squared distances of samples to their closest cluster center (a measure of how internally coherent clusters are), is appended to the inertia list.

plt.plot(K, inertia, 'bx-')#These lines generate and display a line plot where the x-axis represents the number of clusters (K) and the y-axis represents the inertia.
plt.xlabel('Number of clusters')#The 'elbow point' in this plot (where the decrease in inertia starts to slow down significantly) is often used as an indicator for the optimal number of clusters.
plt.ylabel('Inertia')
plt.title('Elbow Method for K-Means')
plt.show()

In [None]:
# Fit K-Means with chosen K (e.g., K=4)
kmeans = KMeans(n_clusters=4, random_state=42)#Based on the visual inspection of the Elbow Method plot (which suggests an elbow around K=4), a K-Means model is re-initialized with 4 clusters.
kmeans_labels = kmeans.fit_predict(X)#The model is fitted to the data X, and simultaneously, cluster labels (kmeans_labels) are predicted for each data point.
print("Silhouette Score for K-Means:", silhouette_score(X, kmeans_labels))#Finally, the Silhouette Score is calculated and printed. This metric evaluates the quality of clustering by measuring how similar an object is to its own cluster compared to other clusters. A higher silhouette score (closer to 1) indicates better-defined clusters.

**Step 7: Modeling – Agglomerative Hierarchical Clustering**

In [None]:
# Agglomerative Clustering with same number of clusters
agglo = AgglomerativeClustering(n_clusters=4, metric='euclidean', linkage='ward')
agglo_labels = agglo.fit_predict(X.toarray())  # Convert sparse matrix to dense
print("Silhouette Score for Agglomerative Clustering:", silhouette_score(X, agglo_labels))

### What all manipulations have you done and insights you found?

**Data Manipulations Performed**

To make the Netflix Movies and TV Shows dataset suitable for unsupervised machine learning, the following data manipulations were carried out:

* Handling Missing Values

Filled missing values in columns such as director, cast and country using placeholders like “Unknown”.

* Date Transformation

Converted the date_added column from string format to datetime using the format %B %d, %Y.

Enabled time-based analysis and ensured consistency.

* Dropped null values

Dropped the null values of the columns rating and date_added because they were very less in numbers that it won't effect the data.

* Duration Cleaning

Transformed the duration column (e.g., “93 min”, “4 Seasons”) into a numerical column duration_clean.

Movies were represented by minutes and TV shows by number of seasons.

* Categorical Encoding

Encoded categorical columns such as type, country, and rating using Label Encoding to convert them into numeric form suitable for machine learning models.

* Text Preprocessing

Combined description and listed_in columns to capture both storyline and genre information.

Applied TF-IDF vectorization to convert text into numerical features while reducing the impact of common words.

* Feature Scaling

Applied StandardScaler to normalize numerical features so that no single feature dominated the distance-based clustering algorithms.

* Feature Combination

Combined scaled numerical features and TF-IDF text features into a single feature matrix using sparse matrix stacking.

**Insights Found from the Analysis**

* Clear Content Grouping

The clustering algorithms successfully grouped content into meaningful categories such as international TV dramas, action and sci-fi movies, crime and mystery series, and horror or thriller movies.

* Genre and Description Are Strong Drivers

Text-based features (genres and descriptions) had a major influence on cluster formation, showing that content similarity is driven more by storyline and genre than by metadata alone.

* Separation Between Movies and TV Shows

Movies and TV shows tended to form separate clusters due to differences in duration and structure.

* International Content Patterns

A significant number of clusters were dominated by international content, indicating strong regional and language-based similarities.

* Rating-Based Segmentation

Content ratings (such as TV-MA, PG-13) contributed to distinct clusters, suggesting age-based audience segmentation.

* Model Comparison Insight

K-Means provided compact and well-separated clusters, while Agglomerative Clustering revealed hierarchical relationships between content groups.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Content Type Distribution (Movies vs TV Shows)

In [None]:
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

In [None]:
# Chart - 1 visualization code
sns.countplot(data=data, x='type', palette='Set2')
plt.title('Distribution of Content Type on Netflix')
plt.xlabel('Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

* To compare how much content is Movies vs TV Shows

* Bar charts are best for category comparison

##### 2. What is/are the insight(s) found from the chart?

* Netflix has more movies than TV shows

* TV shows are fewer but usually longer-term content

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact**

✅ Helps Netflix:

* Decide whether to invest more in series (higher engagement)

* Balance content library strategically

**Negative growth insight?**

⚠️ Too many movies may reduce subscriber retention, because TV shows keep users longer.

#### Chart - 2 Content Growth Over Years

In [None]:
# Chart - 2 visualization code
year_count = data['release_year'].value_counts().sort_index()

plt.plot(year_count.index, year_count.values)
plt.title('Netflix Content Growth Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

* Line chart is ideal for trend analysis over time

##### 2. What is/are the insight(s) found from the chart?

* Content production increased rapidly after 2015

* Shows Netflix’s aggressive expansion

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact**

✅ Indicates:

* Strong growth strategy

* More content → more audience reach

**Negative growth insight?**

⚠️ Rapid growth may lead to:

* Quality dilution

* Increased cost without guaranteed engagement

#### Chart - 3 Top 10 Genres Distribution

In [None]:
# Chart - 3 visualization code
genres = data['listed_in'].str.split(', ').explode()
top_genres = genres.value_counts().head(10)

sns.barplot(x=top_genres.values, y=top_genres.index)
plt.title('Top 10 Genres on Netflix')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.show()

##### 1. Why did you pick the specific chart?

* To identify most popular genres

* Horizontal bar chart improves readability

##### 2. What is/are the insight(s) found from the chart?

* Drama and International content dominate

* Indicates audience preference for story-driven content

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact**

✅ Helps Netflix:

* Focus on high-demand genres

* Improve recommendation accuracy

**Negative growth insight?**

⚠️ Over-dependence on few genres can:

* Reduce variety

* Lose niche audience segments

#### Chart - 4 Country-wise Content Production

In [None]:
# Chart - 4 visualization code
top_countries = data['country'].value_counts().head(10)

sns.barplot(x=top_countries.values, y=top_countries.index)
plt.title('Top 10 Content Producing Countries')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.show()

##### 1. Why did you pick the specific chart?

* To understand geographical content contribution

##### 2. What is/are the insight(s) found from the chart?

* USA dominates, followed by India and other countries

* Strong international presence

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact**

✅ Supports:

* Global audience expansion

* Localized content strategy

**Negative growth insight?**

⚠️ Heavy dominance of one country may:

* Limit cultural diversity

* Reduce appeal in underrepresented regions

#### Chart - 5 PCA Cluster Visualization

In [None]:
# Chart - 5 visualization code
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X.toarray())

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels, cmap='viridis')
plt.title('PCA Visualization of Clusters')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.show()

##### 1. Why did you pick the specific chart?

* PCA helps visualize high-dimensional clustering

* Shows how content groups naturally form

##### 2. What is/are the insight(s) found from the chart?

* Clear separation between clusters

* Confirms clustering worked well

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact**

✅ Enables:

* Better recommendations

* Accurate audience segmentation

**Negative growth insight?**

⚠️ Overlapping clusters mean:

* Some content may not fit clear categories

* Can confuse recommendations

#### Chart - 6 Duration Distribution

In [None]:
# Chart - 6 visualization code
sns.histplot(data['duration_clean'], bins=30, kde=True)
plt.title('Distribution of Duration')
plt.xlabel('Duration (Minutes / Seasons)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

* To understand how long movies and TV shows are

* Histogram is best for numerical data distribution

##### 2. What is/are the insight(s) found from the chart?

* Most movies are between 80–120 minutes

* TV shows mostly have 1–3 seasons

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact**

✅ Helps Netflix:

* Optimize content length based on viewer attention span

* Design better binge-worthy content

**Negative growth insight?**

⚠️ Very long content may:

* Reduce completion rates

* Increase viewer drop-off

#### Chart - 7 Rating vs Release Year

In [None]:
# Chart - 7 visualization code
sns.scatterplot(data=data, x='release_year', y='rating', alpha=0.5)
plt.title('Rating vs Release Year')
plt.xlabel('Release Year')
plt.ylabel('Rating (Encoded)')
plt.show()

##### 1. Why did you pick the specific chart?

* To analyze how content ratings changed over time

* Scatter plots show relationships between two variables

##### 2. What is/are the insight(s) found from the chart?

* Increase in mature-rated content in recent years

* Netflix is targeting adult audiences more

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact**

✅ Supports:

* Audience-specific content planning

* Growth in adult subscriber base

**Negative growth insight?**

⚠️ Too much mature content may:

* Reduce family-friendly appeal

* Limit younger audience engagement

#### Chart - 8 Rating vs Content Type

In [None]:
# Chart - 8 visualization code
sns.countplot(data=data, x='rating', hue='type')
plt.title('Rating Distribution by Content Type')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

* To compare ratings between movies and TV shows

* Stacked bars clearly show category differences

##### 2. What is/are the insight(s) found from the chart?

* TV shows are more often TV-MA

* Movies are more evenly distributed across ratings

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact**

✅ Helps Netflix:

* Match content to audience maturity levels

* Improve recommendation filtering

**Negative growth insight?**

⚠️ Excess TV-MA content may:

* Restrict younger viewers

* Create content imbalance

#### Chart - 9 Cluster-wise Content Count

In [None]:
# Chart - 9 visualization code
cluster_counts = pd.Series(kmeans_labels).value_counts().sort_index()

sns.barplot(x=cluster_counts.index, y=cluster_counts.values)
plt.title('Number of Titles per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

* To understand how content is distributed across clusters

* Bar chart shows cluster dominance

##### 2. What is/are the insight(s) found from the chart?

* Some clusters are much larger than others

* Indicates dominant content categories

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact**

✅ Helps Netflix:

* Identify high-demand content categories

* Allocate resources efficiently

**Negative growth insight?**

⚠️ Very small clusters indicate:

* Niche content

* Risk of underinvestment or neglect

#### Chart - 10 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numeric_cols = ['release_year', 'duration_clean', 'type', 'country', 'rating']
corr_matrix = data[numeric_cols].corr()

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

* To understand relationships between numerical variables

* Heatmap gives quick visual correlation strength

##### 2. What is/are the insight(s) found from the chart?

* Weak correlation between most variables

* Indicates diverse content characteristics

#### Chart - 11 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select only numerical features for pair plot
pairplot_cols = ['release_year', 'duration_clean', 'type', 'rating']

sns.pairplot(
    data[pairplot_cols],
    diag_kind='kde'
)

plt.suptitle('Pair Plot of Key Numerical Features', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

* To visualize relationships between multiple variables at once

* Helps detect:

→Trends

→Patterns

→Clusters

→Outliers

* Very useful for exploratory data analysis (EDA)

##### 2. What is/are the insight(s) found from the chart?

* release_year vs duration_clean

→ Newer content tends to have varied durations

* type vs duration_clean

→ Clear separation between movies and TV shows

* rating vs release_year

→ Increase in mature-rated content in recent years

*Diagonal KDE plots show:

→ Distribution of each variable

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To increase user engagement, retention, and revenue by delivering the right content to the right audience using data-driven insights.

**Key Business Recommendations**

1️⃣ **Strengthen Personalized Recommendations**

**Suggestion:**
Use the content clusters to recommend movies and TV shows with similar genres, descriptions, and viewing patterns.

**Why:**
Clusters group similar content naturally, improving recommendation relevance.

**Business Impact:**
✅ Higher watch time
✅ Improved user satisfaction
✅ Reduced churn

2️⃣ **Invest More in High-Performing Content Clusters**

**Suggestion:**
Identify clusters with the highest engagement (e.g., international dramas, sci-fi, thrillers) and invest in producing or acquiring similar content.

**Why:**
Data shows certain clusters are dominant and popular.

**Business Impact:**
✅ Better ROI on content spending
✅ Faster audience growth

3️⃣ **Balance Movies and TV Shows Strategically**

**Suggestion:**
Increase investment in TV shows and limited series, especially in popular genres.

**Why:**
TV shows drive longer engagement compared to movies.

**Business Impact:**
✅ Increased subscription retention
✅ More binge-watching behavior

4️⃣ **Expand Local & Regional Content**

**Suggestion:**
Create more region-specific content in underrepresented countries.

**Why:**
Country-based clusters highlight strong interest in international content.

**Business Impact:**
✅ Growth in global markets
✅ Stronger regional subscriber base

5️⃣ **Maintain Genre Diversity**

**Suggestion:**
Avoid over-focusing on a few genres (e.g., drama only). Invest in niche clusters like documentaries, kids, and experimental content.

**Why:**
Over-concentration risks audience fatigue.

**Business Impact:**
✅ Broader audience coverage
✅ Reduced subscriber drop-off

6️⃣ **Optimize Content Duration**

**Suggestion:**
Keep movie durations mostly within the 80–120 minute range and produce TV shows with 1–3 seasons initially.

**Why:**
Duration analysis shows these formats are most common and likely well-received.

**Business Impact:**
✅ Higher completion rates
✅ Better user satisfaction

7️⃣ **Balance Content Ratings**

**Suggestion:**
Maintain a healthy mix of mature and family-friendly content.

**Why:**
Excess mature content may limit family subscriptions.

**Business Impact:**
✅ Wider audience reach
✅ Reduced churn from families

8️⃣ **Use Clustering for Smarter Marketing**

**Suggestion:**
Run cluster-based marketing campaigns (e.g., thriller lovers, family viewers, regional audiences).

**Why:**
Audience segmentation improves targeting accuracy.

**Business Impact:**
✅ Higher campaign conversion
✅ Lower marketing cost

# **Conclusion**

The Netflix Movies and TV Shows dataset was analyzed and prepared using a combination of data cleaning, feature engineering, and text processing. Both numerical features (release year, duration, rating, content type, country) and textual features (description, genres) were transformed and combined to form a feature-rich dataset for unsupervised learning.

Using K-Means and Agglomerative Clustering, the content was grouped into meaningful clusters, revealing natural patterns in the dataset:

* **Content type separation:** Movies and TV shows form distinct clusters due to differences in duration and structure.

* **Genre-based grouping:** Similar genres and storylines cluster together, indicating strong audience preferences.

* **Regional patterns:** Clusters often highlight country-specific content trends.

* **Rating trends:** Mature-rated content is more prevalent in certain clusters, guiding audience targeting.

Data visualizations further supported these insights, showing content distribution, temporal trends, feature relationships, and cluster sizes, which can inform business decisions.

**Business Implications**

* Clusters can be used for personalized recommendations, improving user engagement and retention.

* Insights help prioritize content acquisition in popular genres or regions.

* Duration, rating, and type patterns can guide content production strategy.

* Maintaining genre diversity ensures a broad audience appeal while minimizing churn.

Overall, this project demonstrates that unsupervised machine learning and exploratory data analysis can effectively reveal hidden patterns in Netflix content, supporting data-driven decisions for content strategy, marketing, and subscriber satisfaction.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***