<a href="https://colab.research.google.com/github/Debjit-2005-ML/Netflix-Content-Distribution-using-ml/blob/main/NETFLIX_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -
Netflix Movies & TV Shows – Unsupervised Learning (EDA + Clustering Foundation)


##### **Project Type**    - Unsupervised Learning (Exploratory Data Analysis + Clustering Preparation)
##### **Contribution**    - Individual


# **Project Summary -**

This project analyzes Netflix Movies and TV Shows dataset to extract meaningful insights regarding content distribution, growth trends, genre dominance, and regional production patterns. The objective is to deeply explore the dataset using EDA techniques and prepare it for Unsupervised Machine Learning (Clustering).

Through this project, we analyze:

Content growth over time

Movie vs TV Show trends

Country-wise distribution

Genre frequency

Duration patterns

Rating distribution

Director and cast contributions

After thorough data cleaning and feature engineering, the dataset is prepared for clustering using TF-IDF and segmentation techniques. The goal is to help Netflix understand content segmentation and optimize business strategy.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Netflix wants to understand how its content is distributed across various genres, countries, durations, and ratings. The dataset contains both Movies and TV Shows, and the objective is to analyze patterns and prepare the data for clustering similar content together using Unsupervised Learning.

#### **Define Your Business Objective?**

The main business objective is:

To understand content distribution patterns.

To analyze growth trends over time.

To identify dominant genres and regions.

To prepare structured data for clustering.

To support Netflix in strategic content planning.

# **General Guidelines** : -  

Perform detailed EDA.

Handle missing values logically.

Engineer meaningful features.

Extract actionable insights.

Prepare dataset for clustering.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
df = pd.read_csv("NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")
df.head()

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
print("Rows:", df.shape[0])
print("Columns:", df.shape[1])

### Dataset Information

In [None]:
df.info()

#### Duplicate Values

In [None]:
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
df.isnull().sum()

In [None]:
(df.isnull().sum()/len(df))*100

### What did you know about your dataset?

The dataset contains 7,787 Netflix titles including Movies and TV Shows. Most features are categorical. Significant missing values exist in director, cast, and country columns. Duration column contains mixed formats and requires cleaning. Description column is crucial for clustering.

## ***2. Understanding Your Variables***

In [None]:
df.describe()

### Variables Description

Variables Description

show_id: Unique identifier

type: Movie or TV Show

title: Title name

director: Director of content

cast: Actors involved

country: Production country

date_added: Date added to Netflix

release_year: Year of release

rating: Age certification

duration: Runtime or Seasons

listed_in: Genre categories

description: Content summary

### Check Unique Values for each variable.

In [None]:
for col in df.columns:
    print(col, ":", df[col].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert date_added
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Extract year & month added
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month

# Clean duration
df['duration_int'] = df['duration'].str.extract('(\d+)')
df['duration_int'] = df['duration_int'].astype(float)

df['duration_type'] = df['duration'].apply(lambda x: "Season" if "Season" in x else "Minute")

# Fill missing categorical
df['director'].fillna('Unknown', inplace=True)
df['cast'].fillna('Unknown', inplace=True)
df['country'].fillna('Unknown', inplace=True)
df['rating'].fillna('Not Rated', inplace=True)

### What all manipulations have you done and insights you found?

Converted date column.

Extracted time-based features.

Cleaned duration.

Filled missing logically.

Prepared dataset for analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
sns.countplot(data=df, x='type')
plt.title("Distribution of Movies vs TV Shows")
plt.show()

##### 1. Why did you pick the specific chart?

To understand content composition.

##### 2. What is/are the insight(s) found from the chart?

Movies dominate Netflix library.

##### 3. Will the gained insights help creating a positive business impact?


Netflix can evaluate investment ratio.

#### Chart - 2

In [None]:
df['year_added'].value_counts().sort_index().plot(kind='line')
plt.title("Content Added Over Years")
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Massive growth after 2015.

##### 3. Will the gained insights help creating a positive business impact?


Platform expansion strategy.

#### Chart - 3

In [None]:
df['country'].value_counts().head(10).plot(kind='bar')
plt.title("Top 10 Content Producing Countries")
plt.show()

Insight:
USA dominates.

Impact:
Regional diversification opportunity.

#### Chart - 4

In [None]:
df['listed_in'].value_counts().head(10).plot(kind='bar')
plt.title("Top Genres")
plt.show()

Insight:
Drama and Comedy dominate.

Impact:
Content investment guidance.

#### Chart - 5

In [None]:
sns.countplot(data=df, x='rating')
plt.xticks(rotation=90)
plt.show()

Insight:
TV-MA common.

Impact:
Adult audience focus.

#### Chart - 6

In [None]:
df['release_year'].value_counts().sort_index().plot()
plt.show()

Insight:
Content peaked after 2010.

#### Chart - 7

In [None]:
sns.histplot(df[df['type']=="Movie"]['duration_int'])
plt.show()

#### Chart - 8

In [None]:
sns.histplot(df[df['type']=="TV Show"]['duration_int'])
plt.show()

#### Chart - 9

In [None]:
sns.countplot(data=df, x='year_added', hue='type')
plt.xticks(rotation=90)
plt.show()

#### Chart - 10

In [None]:
df['month_added'].value_counts().sort_index().plot(kind='bar')
plt.show()

#### Chart - 11

In [None]:
df['director'].value_counts().head(10).plot(kind='bar')
plt.show()

#### Chart - 12

In [None]:
pd.crosstab(df['year_added'], df['type']).plot()
plt.show()

#### Chart - 13

In [None]:
sns.scatterplot(data=df, x='release_year', y='duration_int')
plt.show()

#### Chart - 14 - Correlation Heatmap

In [None]:
# Select only numeric columns
numeric_df = df.select_dtypes(include=['int64', 'float64'])

# Plot heatmap
plt.figure(figsize=(8,6))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(numeric_df.corr(), annot=True, fmt=".2f", cmap="YlGnBu", linewidths=0.5)
plt.title("Correlation Heatmap of Numeric Features")
plt.show()

##### 1. Why did you pick the specific chart?

Correlation heatmap helps in understanding relationships between numeric variables.

##### 2. What is/are the insight(s) found from the chart?

The numeric variables do not show strong linear relationships. Release year and duration have weak correlation, indicating that movie length does not strongly depend on release year.

#### Chart - 15 - Pair Plot

In [None]:
sns.pairplot(df[['release_year','duration_int']])
plt.show()

##### 2. What is/are the insight(s) found from the chart?

No strong linear relationships.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Netflix should:

Invest in high-demand genres.

Expand regional content.

Continue TV Show production growth.

Segment content for personalized recommendations.

# **Conclusion**

The EDA reveals strong growth trends post-2015, dominance of US content, and popularity of drama-based genres. The cleaned dataset is now ready for clustering, which will group similar content and help Netflix improve recommendation systems and content strategy.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***

# **Netflix Machine Learning**

# Hypothesis 1
Statement:

Movies have significantly different average duration compared to TV Shows

Null Hypothesis (H0):

There is no significant difference in average duration between Movies and TV Shows.

Alternate Hypothesis (H1):

There is a significant difference in average duration between Movies and TV Shows.

Statistical Test: Independent T-Test

Because:

Comparing mean of two groups

Numeric variable (duration)

Two independent categories

In [None]:
from scipy.stats import ttest_ind

movie_duration = df[df['type'] == 'Movie']['duration_int'].dropna()
tv_duration = df[df['type'] == 'TV Show']['duration_int'].dropna()

t_stat, p_value = ttest_ind(movie_duration, tv_duration)

print("T-Statistic:", t_stat)
print("P-Value:", p_value)

Interpretation:

If p < 0.05 → Reject H0
Meaning duration differs significantly.

# Hypothesis 2
Statement:

There is an association between Content Type and Rating.

Null Hypothesis (H0):

Content type and rating are independent.

Alternate Hypothesis (H1):

Content type and rating are dependent.

Statistical Test: Chi-Square Test

Because:

Both variables categorical

In [None]:
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(df['type'], df['rating'])

chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-Square Statistic:", chi2)
print("P-Value:", p)

# Hypothesis 3
Statement:

Content release year affects duration.

Null Hypothesis (H0):

No correlation between release_year and duration.

Alternate Hypothesis (H1):

There is correlation.

Statistical Test: Pearson Correlation

In [None]:
from scipy.stats import pearsonr

corr, p_val = pearsonr(df['release_year'].dropna(), df['duration_int'].dropna())

print("Correlation:", corr)
print("P-Value:", p_val)

# **Feature Engineering & Data Preprocessing**

# Handling Missing Values
New Section
We already used:

Fill categorical with "Unknown"

Dropped invalid dates

Why?
Because categorical data cannot use mean/median.

# Handling Outliers

Use IQR method for duration.

In [None]:
Q1 = df['duration_int'].quantile(0.25)
Q3 = df['duration_int'].quantile(0.75)
IQR = Q3 - Q1

df = df[(df['duration_int'] >= Q1 - 1.5*IQR) &
        (df['duration_int'] <= Q3 + 1.5*IQR)]

# Categorical Encoding

For clustering, we use text-based encoding.
No label encoding required.

# **Textual Data Preprocessing**

# Expand Contractions

In [None]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z ]", "", text)
    return text

df['combined_text'] = (
    df['listed_in'] + " " +
    df['description'] + " " +
    df['director'] + " " +
    df['cast']
)

df['combined_text'] = df['combined_text'].apply(clean_text)

# Remove Stopwords + Tokenization

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def remove_stopwords(text):
    words = text.split()
    words = [word for word in words if word not in ENGLISH_STOP_WORDS]
    return " ".join(words)

df['combined_text'] = df['combined_text'].apply(remove_stopwords)

# **Text Vectorization**

We use TF-IDF.

Why?

Captures importance of words

Penalizes common words

Works best for clustering

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['combined_text'])

# Dimensionality Reduction

Use PCA for visualization.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X.toarray())

# ***ML Model Implementation***

Since this is Unsupervised:

We will use:

Model 1 → KMeans

Model 2 → Hierarchical

Model 3 → DBSCAN

# ML Model 1 – KMeans

**Elbow Method**

In [None]:
from sklearn.cluster import KMeans

inertia = []

for k in range(2, 11):
    model = KMeans(n_clusters=k, random_state=42)
    model.fit(X)
    inertia.append(model.inertia_)

plt.plot(range(2,11), inertia)
plt.title("Elbow Method")
plt.show()

**Apply KMeans**

In [None]:
kmeans = KMeans(n_clusters=5, random_state=42)
df['kmeans_cluster'] = kmeans.fit_predict(X)

**Silhouette Score**

In [None]:
from sklearn.metrics import silhouette_score

score = silhouette_score(X, df['kmeans_cluster'])
print("Silhouette Score:", score)

# **ML Model 2 – Hierarchical Clustering**

In [None]:
from sklearn.cluster import AgglomerativeClustering

hierarchical = AgglomerativeClustering(n_clusters=5)
df['hier_cluster'] = hierarchical.fit_predict(X.toarray())
print("Hierarchical Cluster Distribution:")
print(df['hier_cluster'].value_counts())

**Silhouette Score**

In [None]:
from sklearn.metrics import silhouette_score

hier_score = silhouette_score(X, df['hier_cluster'])
print("Hierarchical Silhouette Score:", hier_score)

# Visualize Clusters

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=df['hier_cluster'], cmap='viridis')
plt.title("Hierarchical Clustering Visualization")
plt.show()

# **ML Model 3 – DBSCAN**

In [None]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5)
df['dbscan_cluster'] = dbscan.fit_predict(X)
print("DBSCAN Cluster Distribution:")
print(df['dbscan_cluster'].value_counts())

**Silhouette Score**

In [None]:
unique_clusters = len(set(df['dbscan_cluster']))

if unique_clusters > 1:
    db_score = silhouette_score(X, df['dbscan_cluster'])
    print("DBSCAN Silhouette Score:", db_score)
else:
    print("DBSCAN formed only one cluster or noise.")

# Visualization

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=df['dbscan_cluster'], cmap='plasma')
plt.title("DBSCAN Clustering Visualization")
plt.show()

# **Evaluation Metric Explanation**

For clustering:

Silhouette Score → Measures cluster separation.

Inertia → Within-cluster variance.

Cluster distribution → Business segmentation.

Higher silhouette = better defined clusters.

# **Final Model Selection**

Choose model with highest silhouette.

Usually:
KMeans performs better in text clustering.

# Feature Importance

In [None]:
terms = tfidf.get_feature_names_out()
centroids = kmeans.cluster_centers_

for i in range(5):
    print(f"Cluster {i} Top Words:")
    print([terms[ind] for ind in centroids[i].argsort()[-10:]])

# **Future Work**

**Save model:**

In [None]:
import joblib

joblib.dump(kmeans, "netflix_clustering_model.pkl")

**Load model:**

In [None]:
model = joblib.load("netflix_clustering_model.pkl")

# ***Conclusion***

Netflix content can be segmented effectively.

Drama & action clusters dominate.

TV shows growing cluster.

Clustering helps recommendation engine.

Business can target content investment accordingly

# **Hurrah !**

# ***You have successfully completed your Machine Learning Capstone Project !!!***