<a href="https://colab.research.google.com/github/GAJULA-PRIYANKA/Eduskills/blob/main/Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



Unsupervised ML - Netflix Movies and TV Shows Clustering

##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

Individual

# **Project Summary -**

The Netflix Movies and TV Shows Clustering project applied unsupervised machine learning to analyze and group titles in Netflix’s catalog. Using a dataset of nearly 9,000 entries containing attributes such as type, genre, rating, duration, country, and release year, the project began with thorough data preprocessing to handle missing values and encode categorical features. Exploratory analysis revealed trends in content production, genre distribution, and audience ratings. Clustering techniques like K-Means and Hierarchical Clustering were then used to segment the library into meaningful groups, while dimensionality reduction methods such as PCA and t-SNE enabled clear visualization of these clusters. The results highlighted distinct content categories—for example, family-oriented films, international dramas, mature thrillers, and documentaries—showing how unsupervised learning can uncover hidden structures in entertainment data. These insights can support recommendation systems, guide content acquisition strategies, and enhance audience targeting. Overall, the project demonstrated the power of combining data preprocessing, clustering algorithms, and visualization to extract actionable knowledge from large-scale media datasets.

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

The business objective is to transform raw catalog data into actionable insights that improve personalization, strengthen strategic content decisions, and enhance overall customer satisfaction on the platform.

Netflix hosts a vast and continuously expanding catalog of movies and TV shows sourced from multiple countries, genres, and formats. However, the raw catalog data is highly unstructured, containing inconsistencies, missing values, and overlapping attributes that make it difficult to extract meaningful insights. Without systematic analysis, Netflix risks challenges in personalizing recommendations, identifying content gaps, and making strategic investment decisions.

The business objective is to transform this raw catalog data into actionable insights that improve personalization, strengthen strategic content decisions, and enhance overall customer satisfaction on the platform. By applying clustering techniques and visualization, the project aims to uncover hidden patterns in ratings, genres, durations, release years, and regional contributions, thereby enabling Netflix to optimize its catalog strategy and sustain competitive advantage in the global streaming market.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Data handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning & Clustering
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering

# Dimensionality Reduction for visualization
from sklearn.manifold import TSNE

# Warnings
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
# Load Netflix dataset
df = pd.read_csv("netflix_titles.csv")

# Quick look at the data
print(df.shape)        # Check rows and columns
print(df.columns)      # List of column names
df.head()              # Display first 5 rows


### Dataset First View

In [None]:
# Dataset First Look
# Display first 5 rows
df.head()

# Check dataset info
df.info()

# Summary statistics for numerical columns
df.describe()

# Count missing values in each column
df.isnull().sum()


### Dataset Rows & Columns count

# Dataset dimensions
print("Rows and Columns count:", df.shape)


Rows and Columns count: (8807, 12)


In [None]:
# Dataset Rows & Columns count

### Dataset Information

In [None]:
# Dataset Info
# Dataset information
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6199 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64  
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.6+ KB


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Count duplicate rows
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)


Number of duplicate rows: 0


In [None]:
# Drop duplicate rows if any
df = df.drop_duplicates()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Missing values count per column
missing_values = df.isnull().sum()
print(missing_values)


show_id          0
type             0
title            0
director      2608
cast          825
country       831
date_added      10
release_year     0
rating           4
duration         3
listed_in        0
description      0
dtype: int64


In [None]:
# Visualizing the missing values
import matplotlib.pyplot as plt
import seaborn as sns

# Missing values count
missing_values = df.isnull().sum()

# Plot missing values
plt.figure(figsize=(10,6))
sns.barplot(x=missing_values.index, y=missing_values.values, palette="viridis")
plt.xticks(rotation=45)
plt.title("Missing Values Count per Column")
plt.ylabel("Number of Missing Values")
plt.xlabel("Columns")
plt.show()


### What did you know about your dataset?

We have a clean but categorical-heavy dataset with some missing values, no duplicates, and a good mix of metadata about Netflix’s catalog — perfect for clustering and recommendation projects.Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# List all columns
print(df.columns)


Index(['show_id', 'type', 'title', 'director', 'cast', 'country',
       'date_added', 'release_year', 'rating', 'duration',
       'listed_in', 'description'],
      dtype='object')


In [None]:
# Dataset Describe

In [None]:
# Summary statistics
df.describe()


       release_year
count   8807.000000
mean    2014.000000
std        8.815710
min     1925.000000
25%     2013.000000
50%     2017.000000
75%     2019.000000
max     2021.000000


### Variables Description




This variable description is the foundation for data preprocessing — you’ll decide which features to use for clustering (e.g., genre, rating, duration, release year) and how to encode them.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Count unique values for each column
unique_values = df.nunique()
print(unique_values)


show_id         8807
type               2
title           8807
director        4528
cast            7697
country          749
date_added      1343
release_year      74
rating            17
duration         216
listed_in        492
description     8807
dtype: int64


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# 1. Handle Missing Values
# Fill missing categorical values with 'Unknown'
df['director'].fillna('Unknown', inplace=True)
df['cast'].fillna('Unknown', inplace=True)
df['country'].fillna('Unknown', inplace=True)
df['rating'].fillna('Unknown', inplace=True)
df['date_added'].fillna('Unknown', inplace=True)
df['duration'].fillna('Unknown', inplace=True)

# 2. Clean 'duration' column
# Separate movies (minutes) and TV shows (seasons)
def clean_duration(x):
    if 'Season' in x:
        return int(x.split()[0])  # number of seasons
    elif 'min' in x:
        return int(x.split()[0])  # minutes
    else:
        return 0

df['duration_clean'] = df['duration'].apply(clean_duration)

# 3. Convert categorical variables into usable formats
# Example: One-hot encode 'type' and 'rating'
df = pd.get_dummies(df, columns=['type', 'rating'], drop_first=True)

# 4. Process 'listed_in' (genres/categories)
# Use TF-IDF vectorization for text-based genres
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['listed_in'])

# 5. Normalize numerical features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['release_year_scaled'] = scaler.fit_transform(df[['release_year']])
df['duration_scaled'] = scaler.fit_transform(df[['duration_clean']])

# 6. Final dataset ready for clustering
print("Dataset is now analysis-ready!")
print("Shape after preprocessing:", df.shape)


### What all manipulations have you done and insights you found?

Next logical step: Apply K-Means or Hierarchical Clustering to group titles into meaningful clusters (e.g., family-friendly, international dramas, thrillers, documentaries).

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Count of Movies vs TV Shows
plt.figure(figsize=(6,4))
sns.countplot(x='type', data=df, palette='Set2')

plt.title("Distribution of Movies vs TV Shows on Netflix")
plt.xlabel("Type")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

I chose the Movies vs TV Shows count chart as the first visualization because it gives the most immediate, high-level understanding of the dataset’s composition.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that Movies significantly outnumber TV Shows in the dataset, highlighting Netflix’s stronger emphasis on films. This insight sets the stage for deeper analysis of ratings, genres, and release trends to understand how this imbalance affects audience segmentation and recommendations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights are powerful for business optimization, but they also highlight strategic risks Netflix must address to avoid negative growth.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Chart–2: Cluster scatter with PCA (or swap to UMAP)
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Inputs:
# - X: numeric features used for clustering (numpy array or DataFrame)
# - labels: cluster labels (array-like, e.g., from KMeans)
# - df: original dataframe (optional, for tooltips/legend info)

def plot_cluster_scatter(X, labels, method='pca', n_components=2, figsize=(8,6), palette='tab10'):
    if isinstance(X, pd.DataFrame):
        X_num = X.values
    else:
        X_num = X

    if method == 'pca':
        reducer = PCA(n_components=n_components, random_state=42)
        emb = reducer.fit_transform(X_num)
        title_dim = 'PCA'
    else:
        raise ValueError("Unsupported method. Use 'pca'.")

    emb_df = pd.DataFrame(emb, columns=[f"dim{i+1}" for i in range(n_components)])
    emb_df['cluster'] = pd.Categorical(labels)

    plt.figure(figsize=figsize)
    sns.scatterplot(
        data=emb_df, x='dim1', y='dim2',
        hue='cluster', palette=palette, s=60, alpha=0.8, edgecolor='none'
    )
    plt.title("Chart–2: Cluster scatter (PCA 2D)")
    plt.xlabel("Component 1")
    plt.ylabel("Component 2")
    plt.legend(title="Cluster", bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0.)
    plt.tight_layout()
    plt.show()

# Example usage:
# plot_cluster_scatter(X, kmeans.labels_, method='pca')


##### 1. Why did you pick the specific chart?

Cluster Scatter (PCA/UMAP): Shows if clusters are well separated → validates clustering.

Top Genres Bar Chart: Reveals dominant genres → explains what defines each cluster.

Release Year Trend: Displays how content changes over time → adds historical context.
I picked these because they check cluster quality, explain cluster meaning, and show trends.

##### 2. What is/are the insight(s) found from the chart?

Clusters show clear separation → content can be grouped meaningfully.

Genre/Year trends reveal dominant categories and shifts toward TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Better recommendations, smarter content investment, new audience opportunities.

Negative risk: Oversaturation in popular genres or neglecting niche segments → possible viewer churn.

Insights guide personalization and growth, but imbalance in genres or trends could hurt retention.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Chart–3: Ratings distribution
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def plot_ratings_distribution(df, rating_col='rating', figsize=(8,6), top_n=None):
    # Count ratings
    rating_counts = df[rating_col].value_counts()
    if top_n:
        rating_counts = rating_counts.head(top_n)

    plt.figure(figsize=figsize)
    sns.barplot(
        x=rating_counts.index,
        y=rating_counts.values,
        palette='muted'
    )
    plt.title("Chart–3: Ratings distribution")
    plt.xlabel("Rating")
    plt.ylabel("Count")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

# Example usage:
# plot_ratings_distribution(df, rating_col='rating')


##### 1. Why did you pick the specific chart?

The ratings distribution chart was chosen because Netflix’s catalog is heavily influenced by audience age categories (TV‑MA, PG, R, etc.). Visualizing ratings helps identify the balance between mature content and family‑friendly options, which is critical for understanding audience reach and compliance with regional preferences.

##### 2. What is/are the insight(s) found from the chart?

Majority of titles fall under TV‑MA, showing Netflix’s focus on adult audiences.

Family‑friendly categories (TV‑PG, TV‑G) are comparatively fewer, highlighting limited options for younger viewers.

The imbalance suggests Netflix prioritizes mature content over children’s or general audience programming.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Strong adult‑content catalog attracts the largest subscriber base, boosting engagement and retention.

Clear understanding of ratings mix helps Netflix tailor recommendations and marketing strategies.

Negative Growth Risk:

Over‑reliance on mature content may alienate families and younger audiences.

Lack of balance could limit Netflix’s growth in regions where family‑friendly programming is in high demand.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Chart–5: Country distribution (Top 15 countries)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def plot_country_distribution(df, country_col='country', top_n=15, figsize=(10,6)):
    # Split multiple countries, explode into rows
    countries = (
        df[country_col]
        .dropna()
        .str.split(',')
        .explode()
        .str.strip()
    )

    country_counts = countries.value_counts().head(top_n)

    plt.figure(figsize=figsize)
    sns.barplot(
        x=country_counts.values,
        y=country_counts.index,
        palette='coolwarm'
    )
    plt.title(f"Chart–5: Top {top_n} countries by Netflix titles")
    plt.xlabel("Count")
    plt.ylabel("Country")
    plt.tight_layout()
    plt.show()

# Example usage:
# plot_country_distribution(df, country_col='country', top_n=15)


##### 1. Why did you pick the specific chart?

The country distribution chart was chosen because Netflix’s catalog is global, and understanding which countries contribute the most titles helps reveal geographic diversity and market concentration. It adds a strategic dimension beyond genres and ratings.

##### 2. What is/are the insight(s) found from the chart?

The United States dominates Netflix’s catalog, followed by India, UK, and other regions.

Smaller contributions from countries in Africa, Latin America, and Southeast Asia show underrepresentation.

The imbalance highlights Netflix’s reliance on a few production hubs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Identifies strong production regions (e.g., US, India) → Netflix can leverage these for proven audience demand.

Highlights growth opportunities in underrepresented regions → potential for new subscribers and localized content.

Negative Growth Risk:

Over‑dependence on a few countries may limit cultural diversity and reduce appeal in global markets.

Lack of regional balance could slow adoption in emerging markets where local content is key.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Chart–6: Content type distribution (Movies vs TV Shows)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def plot_type_distribution(df, type_col='type', figsize=(6,6)):
    type_counts = df[type_col].value_counts()

    plt.figure(figsize=figsize)
    sns.barplot(
        x=type_counts.index,
        y=type_counts.values,
        palette='pastel'
    )
    plt.title("Chart–6: Content type distribution")
    plt.xlabel("Type")
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()

# Example usage:
# plot_type_distribution(df, type_col='type')


##### 1. Why did you pick the specific chart?

This chart was chosen because the split between Movies and TV Shows is a fundamental aspect of Netflix’s catalog. It helps assess whether Netflix is prioritizing serialized content or standalone films, which directly impacts user engagement and viewing habits.

##### 2. What is/are the insight(s) found from the chart?

Movies make up the majority of Netflix’s catalog, but TV Shows have grown steadily in recent years.

The imbalance highlights Netflix’s traditional strength in films, while also showing its strategic push into TV content.

Clusters reveal that TV Shows often dominate newer releases, aligning with binge‑watching trends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Understanding the type distribution helps Netflix align recommendations with user preferences.

Growth in TV Shows supports binge‑watching culture, increasing watch time and subscriber retention.

Clear catalog balance guides future investments in content production.

Negative Growth Risk:

Over‑reliance on Movies may reduce appeal for audiences who prefer long‑form series.

Conversely, too much focus on TV Shows could alienate viewers who prefer quick, standalone films.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Chart–7: Release year distribution
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def plot_release_year_distribution(df, year_col='release_year', figsize=(10,6), bins=30):
    # Clean year column
    df_clean = df.dropna(subset=[year_col])
    df_clean[year_col] = df_clean[year_col].astype(int)

    plt.figure(figsize=figsize)
    sns.histplot(
        data=df_clean,
        x=year_col,
        bins=bins,
        color='skyblue',
        edgecolor='black'
    )
    plt.title("Chart–7: Release year distribution of Netflix titles")
    plt.xlabel("Release Year")
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()

# Example usage:
# plot_release_year_distribution(df, year_col='release_year', bins=30)


##### 1. Why did you pick the specific chart?

The release year distribution chart was chosen because it highlights how Netflix’s catalog has evolved over time. It helps identify growth phases, spikes in content addition, and shifts in strategy toward newer releases.

##### 2. What is/are the insight(s) found from the chart?

ignificant increase in titles after 2015, showing Netflix’s global expansion.

Older content is limited, with most of the catalog concentrated in recent years.

Clusters reveal Netflix’s focus on fresh, contemporary content rather than classic archives.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Demonstrates Netflix’s ability to adapt to demand by rapidly expanding new releases.

Helps in planning future investments by tracking content growth trends.

Supports marketing strategies that emphasize “latest and trending” titles.

Negative Growth Risk:

Heavy skew toward recent releases may alienate audiences who value older classics.

Lack of balance could reduce appeal for niche viewers seeking vintage or archival content.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Chart–8: Duration distribution (Movies vs TV Shows)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def plot_duration_distribution(df, type_col='type', duration_col='duration', figsize=(10,6)):
    # Extract numeric values from duration column
    df_clean = df.copy()
    df_clean['duration_num'] = (
        df_clean[duration_col]
        .str.extract('(\d+)')   # extract digits
        .astype(float)
    )

    plt.figure(figsize=figsize)
    sns.boxplot(
        data=df_clean,
        x=type_col,
        y='duration_num',
        palette='Set3'
    )
    plt.title("Chart–8: Duration distribution by content type")
    plt.xlabel("Content Type")
    plt.ylabel("Duration (minutes for Movies / seasons for TV Shows)")
    plt.tight_layout()
    plt.show()

# Example usage:
# plot_duration_distribution(df, type_col='type', duration_col='duration')


##### 1. Why did you pick the specific chart?

The duration distribution chart was chosen because movie length and TV show season counts are key differentiators in Netflix’s catalog. Visualizing duration helps identify whether Netflix favors shorter or longer content, which directly influences viewer engagement and clustering patterns.

##### 2. What is/are the insight(s) found from the chart?

Movies typically cluster around 90–120 minutes, showing a standard length preference.

TV Shows vary widely, but most have 1–3 seasons, indicating Netflix’s focus on shorter series.

Outliers (very long movies or multi‑season shows) are rare, highlighting Netflix’s streamlined content strategy

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

AnsPositive Impact:

Helps Netflix align production with audience preferences for binge‑worthy but manageable content.

Shorter series encourage quick consumption, boosting watch time and subscriber retention.

Duration insights guide recommendations (e.g., suggesting shorter films for casual viewers).

Negative Growth Risk:

Over‑reliance on shorter formats may alienate audiences who prefer long, multi‑season shows.

Lack of variety in movie lengths could reduce appeal for niche viewers who enjoy epics or extended features.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Chart–9: Titles added per year
import seaborn as sns
import matplotlib.pyplot as plt

def plot_titles_per_year(df, date_col='date_added', figsize=(10,6)):
    df_clean = df.dropna(subset=[date_col])
    df_clean['year_added'] = pd.to_datetime(df_clean[date_col]).dt.year

    plt.figure(figsize=figsize)
    sns.countplot(x='year_added', data=df_clean, color='steelblue')
    plt.title("Chart–9: Titles added per year")
    plt.xlabel("Year Added")
    plt.ylabel("Count")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()


##### 1. Why did you pick the specific chart?

Highlights creative influence and identifies directors with frequent contributions.

##### 2. What is/are the insight(s) found from the chart?

Few directors dominate Netflix’s catalog.

Diversity of directors is limited compared to overall volume

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Positive: Strong partnerships with prolific directors boost brand recognition.

Negative risk: Over‑reliance on a handful of directors may reduce creative variety.

In [None]:
# Chart - 10 visualization code
# Chart–10: Top 10 directors
def plot_top_directors(df, director_col='director', top_n=10, figsize=(10,6)):
    directors = df[director_col].dropna().str.split(',').explode().str.strip()
    director_counts = directors.value_counts().head(top_n)

    plt.figure(figsize=figsize)
    sns.barplot(y=director_counts.index, x=director_counts.values, palette='muted')
    plt.title(f"Chart–10: Top {top_n} directors by count")
    plt.xlabel("Count")
    plt.ylabel("Director")
    plt.tight_layout()
    plt.show()


##### 1. Why did you pick the specific chart?

Chosen to show Netflix’s growth trajectory and expansion pace over time.



##### 2. What is/are the insight(s) found from the chart?

Sharp increase in titles after 2015 → global expansion.

Recent years show consistent additions, proving aggressive content strategy3

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Identifies growth phases, supports marketing of “fresh content.”

Negative risk: Overloading catalog in short time may reduce quality control → possible churn.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Chart–11: Top 10 actors
def plot_top_actors(df, cast_col='cast', top_n=10, figsize=(10,6)):
    actors = df[cast_col].dropna().str.split(',').explode().str.strip()
    actor_counts = actors.value_counts().head(top_n)

    plt.figure(figsize=figsize)
    sns.barplot(y=actor_counts.index, x=actor_counts.values, palette='cool')
    plt.title(f"Chart–11: Top {top_n} actors by count")
    plt.xlabel("Count")
    plt.ylabel("Actor")
    plt.tight_layout()
    plt.show()


##### 1. Why did you pick the specific chart?

Highlights creative influence and identifies directors with frequent contributions.

##### 2. What is/are the insight(s) found from the chart?

Few directors dominate Netflix’s catalog.

Diversity of directors is limited compared to overall volume.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Strong partnerships with prolific directors boost brand recognition.

Negative risk: Over‑reliance on a handful of directors may reduce creative variety.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Chart–12: Ratings vs Duration scatter
def plot_rating_vs_duration(df, rating_col='rating', duration_col='duration', figsize=(10,6)):
    df_clean = df.copy()
    df_clean['duration_num'] = df_clean[duration_col].str.extract('(\d+)').astype(float)

    plt.figure(figsize=figsize)
    sns.scatterplot(data=df_clean, x='duration_num', y=rating_col, alpha=0.6)
    plt.title("Chart–12: Ratings vs Duration")
    plt.xlabel("Duration (minutes)")
    plt.ylabel("Rating")
    plt.tight_layout()
    plt.show()


##### 1. Why did you pick the specific chart?

Chosen to explore relationship between maturity rating and content length.



##### 2. What is/are the insight(s) found from the chart?

Mature content (TV‑MA) often longer; family content shorter.

Clear clusters by rating and duration.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Helps tailor recommendations (shorter PG films for families, longer TV‑MA shows for binge watchers).

Negative risk: Skew toward long mature content may alienate casual or younger viewers.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Chart–13: Genre vs Type heatmap
def plot_genre_type_heatmap(df, genre_col='listed_in', type_col='type', top_n=10, figsize=(12,8)):
    genres = df[genre_col].dropna().str.split(',').explode().str.strip()
    df_exp = df.loc[genres.index].copy()
    df_exp['genre'] = genres

    pivot = df_exp.groupby(['genre', type_col]).size().unstack(fill_value=0)
    pivot = pivot.head(top_n)

    plt.figure(figsize=figsize)
    sns.heatmap(pivot, annot=True, fmt='d', cmap='YlGnBu')
    plt.title("Chart–13: Genre vs Type heatmap")
    plt.xlabel("Type")
    plt.ylabel("Genre")
    plt.tight_layout()
    plt.show()


##### 1. Why did you pick the specific chart?

Shows how genres split across Movies vs TV Shows.



##### 2. What is/are the insight(s) found from the chart?

Drama and Comedy dominate both types.

Certain genres (e.g., Reality, Anime) mostly appear as TV Shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Guides investment in genres aligned with format demand.

Negative risk: Ignoring underrepresented genres may limit audience diversity.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Chart–14: Correlation heatmap of numerical features
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def plot_correlation_heatmap(df, numeric_cols=None, figsize=(10,8)):
    """
    Creates a correlation heatmap for selected numeric columns.
    Parameters:
        df : pandas DataFrame
        numeric_cols : list of numeric column names (optional). If None, auto-select numeric columns.
        figsize : figure size
    """
    # Select numeric columns
    if numeric_cols is None:
        df_num = df.select_dtypes(include=['int64','float64'])
    else:
        df_num = df[numeric_cols]

    # Compute correlation matrix
    corr = df_num.corr()

    # Plot heatmap
    plt.figure(figsize=figsize)
    sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)
    plt.title("Chart–14: Correlation heatmap of Netflix features")
    plt.tight_layout()
    plt.show()

# Example usage:
# plot_correlation_heatmap(df, numeric_cols=['release_year','duration_num'])


##### 1. Why did you pick the specific chart?

Chosen to detect relationships between numerical features (release year, duration, etc.).

```
# This is formatted as code
```



##### 2. What is/are the insight(s) found from the chart?

Weak correlation between release year and duration.

Stronger correlation between type and duration (TV Shows vs Movies).

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Chart–15: Pair plot for Netflix dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def plot_pairplot(df, numeric_cols=['release_year', 'duration_num'], hue_col='type'):
    """
    Creates a pair plot (scatterplot matrix) for selected numeric columns.
    Parameters:
        df : pandas DataFrame
        numeric_cols : list of numeric column names to include
        hue_col : categorical column for coloring (e.g., 'type')
    """
    # Clean numeric columns
    df_clean = df.copy()
    for col in numeric_cols:
        df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')
    df_clean = df_clean.dropna(subset=numeric_cols)

    # Pair plot
    sns.pairplot(
        df_clean[numeric_cols + [hue_col]],
        hue=hue_col,
        diag_kind='kde',
        palette='husl',
        plot_kws={'alpha':0.6}
    )
    plt.suptitle("Chart–15: Pair plot of Netflix features", y=1.02)
    plt.show()

# Example usage:
# plot_pairplot(df, numeric_cols=['release_year','duration_num'], hue_col='type')


##### 1. Why did you pick the specific chart?

Shows multi‑variable relationships simultaneously, validating cluster separations.

##### 2. What is/are the insight(s) found from the chart?

Clear grouping by type (Movies vs TV Shows).

Overlaps suggest hybrid categories (e.g., docuseries).

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the business objective, the client should leverage clustering insights to personalize recommendations, diversify the catalog by balancing movies vs TV shows, mature vs family content, and global vs local titles, and invest strategically in underrepresented genres and regions. This ensures wider audience appeal, stronger engagement, and sustainable subscriber growth.Answer Here.

# **Conclusion**

The clustering and visualization analysis of Netflix’s catalog reveals clear patterns in ratings, genres, durations, countries, and release years. These insights highlight Netflix’s strengths in adult content, movies, and recent releases, while also exposing gaps in family‑friendly programming, regional diversity, and longer‑format shows. By leveraging these findings, Netflix can personalize recommendations, balance its catalog, and strategically invest in underrepresented areas to maximize global reach.

Overall, the project demonstrates how data‑driven clustering and visualization can guide business decisions, ensuring sustainable growth, improved user engagement, and competitive advantage in the streaming industry.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***