# **Project Name**    -



##### **Project Type**    - EDA/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 - Rajavel J M**


# **Project Summary -**

In this project, we set out to explore, analyze, and derive insights from the Netflix content dataset using advanced data science and machine learning techniques, with a particular focus on unsupervised modeling and clustering. As content streaming platforms have exploded in popularity, their vast and ever-growing catalogs pose unique challenges to content management, discovery, and personalization. Our goal was to unravel patterns in the Netflix catalog—encompassing thousands of movies and TV shows—by systematically engineering features, preprocessing data, and applying a range of unsupervised learning algorithms.

We began by thoroughly cleaning and preparing the dataset. Key steps included handling missing values in categorical columns like ‘country,’ ‘director,’ and ‘rating’ by assigning intuitive placeholders such as “Unknown,” ensuring that we maximized the retention of valuable data. Numerical features impacting user experience, notably the duration of content, were capped at the 99th percentile to address anomalies and outliers without discarding records, keeping the dataset both robust and realistic.

We turned our attention to textual features, particularly those describing genres and content descriptions. By expanding contractions, converting to lowercase, removing punctuation and stopwords, and applying lemmatization, we ensured text consistency. Advanced techniques like tokenization and TF-IDF vectorization distilled key content information into actionable numeric features. Further feature engineering included the derivation of new variables, such as the count of cast members or release decade, to provide historical and structural context.

With the feature matrix ready, we employed data scaling (using standardization) to harmonize the range of all numeric features, ensuring fair and effective clustering. Dimensionality reduction (using PCA) was then used to streamline the feature set, making visualizations intuitive and clusters cleaner. Throughout the process, visual explorations—such as correlation heatmaps and silhouette charts—guided our modeling decisions.

Three primary clustering techniques formed the backbone of our machine learning approach: KMeans, Agglomerative (hierarchical) clustering, and DBSCAN. KMeans offered a balance of scalability and explainability, with hyperparameter tuning (the optimal number of clusters) determined through maximization of the Silhouette Score. Agglomerative clustering provided an alternative, revealing content patterns with a hierarchical perspective. DBSCAN illustrated the presence of dense content subgroups and distinguished noise, capturing outliers as unique catalog segments. Each model’s cluster assignments and evaluation were augmented by rich visualizations and interpretability checks.

Critically, cluster interpretation went beyond the numbers: we examined the defining features of each cluster, connecting algorithmic outcomes to business sense—whether identifying family-oriented movies, binge-worthy drama series, or niche international selections. These groupings have direct implications for Netflix’s recommendation algorithms, catalog curation, targeted marketing, and audience analysis.

In considering future deployment, the project demonstrated how trained models could be easily saved and reloaded for live use—demonstrating readiness for integration into real-world streaming or analytical environments.

Overall, this project highlights the real-world process of end-to-end unsupervised learning, where technical rigor is matched with sharp business insight. Each data transformation, modeling step, and interpretation was chosen with a view toward maximizing the value delivered to both the platform and the end users, setting the stage for ongoing innovation in content personalization and discovery.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Analyze and cluster Netflix content using text-based and categorical metadata.  
Identify what content is available in which countries, test if Netflix has been increasingly focusing on TV shows, and group similar content to support business and recommendation needs.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
display(df.head(3))

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(df.shape)

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)
plt.show()

### What did you know about your dataset?

- Contains metadata for Netflix content as of 2019.
- Key columns: type, title, director, cast, country, date_added, release_year, rating, duration, listed_in, description.
- Missing values in director, cast, country most severe.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
display(df.describe(include='all'))

### Variables Description

Answer Here

- show_id: Unique identifier
- type: TV Show or Movie (used in trends/EDA, not for clustering directly)
- title: Title of content
- director: Director name(s)
- cast: List of main actors
- country: Origin country/countries
- date_added: Date added to Netflix
- release_year: Year of original release
- rating: Age/Content rating
- duration: Minutes (Movies) or Seasons (TV Shows)
- listed_in: Genres
- description: Text summary



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Missing value imputation
for col in ['director','cast','country','rating']:
    df[col] = df[col].fillna('Unknown')

# Split duration into numeric & units
df['duration_num'] = df['duration'].str.extract('(\d+)').astype(float)
df['duration_unit'] = df['duration'].str.extract('([a-zA-Z]+)').fillna('Unknown')

# Feature: Number of actors in cast
df['cast_count'] = df['cast'].apply(lambda x: len(x.split(', ')) if x!='Unknown' else 0)

### What all manipulations have you done and insights you found?

- Filled missing categorical variables with 'Unknown'.
- Extracted duration number and unit (important for clustering/multivariate analysis).
- Engineered cast_count, which may segment large ensemble films versus solo/limited-cast titles.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 Content Type Distribution

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='type', data=df)
plt.title('Content Type Distribution')
plt.xlabel('Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

- To quickly see the split between Movies and TV Shows on Netflix.

##### 2. What is/are the insight(s) found from the chart?

- Movies are slightly more numerous, but TV Shows are significant in number.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Signals Netflix must cater to both long-form and episodic audience preferences.

#### Chart - 2

In [None]:
# Chart - 2 Content Added Each Year by Type

df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['year_added'] = df['date_added'].dt.year
sns.countplot(x='year_added', data=df, hue='type')
plt.title('Content Added Each Year by Type')
plt.xlabel('Year Added')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

- To determine changes in content addition trends over time.

##### 2. What is/are the insight(s) found from the chart?

- TV Show additions have increased sharply since 2014, while movies have plateaued.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Informs Netflix's focus on serial content for subscription retention.

#### Chart - 3

In [None]:
# Chart - 3 Content Count by Age Rating

sns.countplot(y='rating', data=df, order=df['rating'].value_counts().index)
plt.title('Content by Age Rating')
plt.xlabel('Count')
plt.ylabel('Rating')
plt.show()

##### 1. Why did you pick the specific chart?

- To assess the maturity/audience target of most Netflix content.

##### 2. What is/are the insight(s) found from the chart?

- Mature and family ratings are most prevalent.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Aids in adjusting content mix for different audience segments.

#### Chart - 4

In [None]:
# Chart - 4 Top Contries by Content

top_countries = df['country'].value_counts().index[:7]
sns.countplot(y='country', data=df[df['country'].isin(top_countries)], hue='type')
plt.title('Top Countries by Content Type')
plt.xlabel('Count')
plt.ylabel('Country')
plt.show()

##### 1. Why did you pick the specific chart?

- To identify major content-producing countries in the library.

##### 2. What is/are the insight(s) found from the chart?

- The US dominates, with the UK and India also contributing heavily.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Supports region-specific promotions and acquisitions.

#### Chart - 5

In [None]:
# Chart - 5 Top Genres

from collections import Counter
all_genres = [g.strip() for cell in df['listed_in'].dropna() for g in str(cell).split(',')]
genre_counts = Counter(all_genres)
top_genres = genre_counts.most_common(10)
genres, counts = zip(*top_genres)
plt.barh(genres, counts)
plt.title('Top 10 Genres')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.show()

##### 1. Why did you pick the specific chart?

- To reveal the most common genres available.

##### 2. What is/are the insight(s) found from the chart?

- Drama and comedy are the top genres.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Helps tailor recommendations and pinpoint content gaps.

#### Chart - 6

In [None]:
# Chart - 6 Duration Distribution for Movies

df['duration_num'] = df['duration'].str.extract('(\d+)').astype(float)
sns.histplot(df[df['type']=='Movie']['duration_num'].dropna(), bins=30, kde=True)
plt.title('Movie Duration Distribution')
plt.xlabel('Duration (minutes)')
plt.show()

##### 1. Why did you pick the specific chart?

- To understand common movie lengths on the platform.

##### 2. What is/are the insight(s) found from the chart?

-  Most movies range from 80–120 minutes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Aligns acquisition and production with typical viewer expectations.

#### Chart - 7

In [None]:
# Chart - 7 Number of Seasons for TV Shows

sns.histplot(df[df['type']=='TV Show']['duration_num'].dropna(), bins=10, kde=True)
plt.title('TV Show: Number of Seasons Distribution')
plt.xlabel('Number of Seasons')
plt.show()

##### 1. Why did you pick the specific chart?

- To examine how long-running most series are.

##### 2. What is/are the insight(s) found from the chart?

- Most TV shows have one or two seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Indicates scope for more multi-season series to enhance engagement.

#### Chart - 8

In [None]:
# Chart - 8 Country-wise Content by Type (Stacked Bar)

country_type = df[df['country'].isin(top_countries)].groupby(['country','type']).size().unstack().fillna(0)
country_type.plot(kind='bar', stacked=True)
plt.title('Country-wise Content by Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

- To compare TV and Movie contributions by country.

##### 2. What is/are the insight(s) found from the chart?

- Some countries specialize more in movies, others in TV.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Guides localization and acquisition by content type per region.

#### Chart - 9

In [None]:
# Chart - 9 Year-wise Content Release

sns.histplot(df['release_year'], bins=30, kde=True)
plt.title('Distribution of Content by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

- To check if content is generally recent or older.

##### 2. What is/are the insight(s) found from the chart?

- Content is biased toward post-2000 releases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Prioritizes acquiring new content to maintain library freshness.

#### Chart - 10

In [None]:
# Chart - 10 Distribution of Content Ratings Over Time

plt.figure(figsize=(12,6))
sns.countplot(x='release_year', hue='rating', data=df, order=sorted(df['release_year'].unique()), palette='tab20')
plt.title('Content Ratings Over Years')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.legend(loc='upper left', bbox_to_anchor=(1,1))
plt.show()

##### 1. Why did you pick the specific chart?

- To see how content rating mix evolves by year.

##### 2. What is/are the insight(s) found from the chart?

-  Mature-rated content has grown in recent years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Directs marketing to age-appropriate segments and subscribers.

#### Chart - 11

In [None]:
# Chart - 11 Director Frequency

top_directors = df['director'].value_counts().head(10)
top_directors.plot(kind='barh')
plt.title('Top 10 Directors by Number of Titles')
plt.xlabel('Number of Titles')
plt.ylabel('Director')
plt.show()

##### 1. Why did you pick the specific chart?

- To highlight directors with the most titles.

##### 2. What is/are the insight(s) found from the chart?

- Few directors have multiple works; most are unique.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Reveals opportunities for exclusive deals with popular directors.

#### Chart - 12

In [None]:
# Chart - 12 Main Actor Frequency

from collections import Counter

actors = [actor.strip() for cast in df['cast'].dropna() for actor in str(cast).split(',')]
actor_counts = Counter(actors)
top_actors = actor_counts.most_common(10)
names, counts = zip(*top_actors)
plt.barh(names, counts)
plt.title('Top 10 Actors/Actresses')
plt.xlabel('Number of Titles')
plt.show()


##### 1. Why did you pick the specific chart?

-  To uncover which actors repeatedly appear on Netflix.

##### 2. What is/are the insight(s) found from the chart?

- A handful of actors are mainstays across several titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Informs casting and promotes content with familiar faces.

#### Chart - 13

In [None]:
# Chart - 13 Cluster Visualization (PCA 2D)

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Minimal feature engineering for demonstration
features = ['release_year', 'duration_num']
X_cluster = df[features].fillna(0)
X_scaled = StandardScaler().fit_transform(X_cluster)

# KMeans for clustering (just for plot)—4 clusters
kmeans = KMeans(n_clusters=4, random_state=0).fit(X_scaled)
df['cluster'] = kmeans.labels_

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.scatter(X_pca[:,0], X_pca[:,1], c=df['cluster'])
plt.title('Content Clusters (PCA-reduced)')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.show()


##### 1. Why did you pick the specific chart?

- To visually inspect how clustering separates content.

##### 2. What is/are the insight(s) found from the chart?

- Clusters tend to group by content era, length, or cast size.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Facilitates strategizing around distinct user/content segments.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart - 14 Correlation Heatmap visualization code

import numpy as np

corr = df[['release_year', 'duration_num']].corr()
sns.heatmap(corr, annot=True, cmap='vlag')
plt.title('Feature Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

- To spot relationships among key numeric features.

##### 2. What is/are the insight(s) found from the chart?

- Release year and duration show low correlation, indicating feature diversity.

#### Chart - 15 - Pair Plot

In [None]:
# Chart - 15 Pair Plot visualization code

# Assume clusters from previous step exist and genre columns as in earlier TF-IDF example
import seaborn as sns
import matplotlib.pyplot as plt

# Make sure your DataFrame 'df' contains these columns and the 'cluster' column from KMeans or other clustering.
# Example features: 'release_year', 'duration_num', 'cast_count', 'cluster'

sns.pairplot(df[['release_year', 'duration_num', 'cast_count', 'cluster']], hue='cluster', palette='tab10')
plt.suptitle('Pair Plot Visualization of Netflix Dataset Features and Clusters', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

- To view all pairwise feature relationships and cluster separability.

##### 2. What is/are the insight(s) found from the chart?

- Some clusters separate well by year, duration, or cast count, while others overlap.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- Null Hypothesis: There is no significant difference in the years when TV Shows and Movies were added to Netflix.

- Alternate Hypothesis: There is a significant difference in the years when TV Shows and Movies were added to Netflix.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import mannwhitneyu

tv = df[df['type'] == 'TV Show']['year_added'].dropna()
movie = df[df['type'] == 'Movie']['year_added'].dropna()
stat, p = mannwhitneyu(tv, movie)
print('Mann-Whitney U test p-value:', p)

##### Which statistical test have you done to obtain P-Value?

- Mann-Whitney U test.

##### Why did you choose the specific statistical test?

- The Mann-Whitney U test is a non-parametric test used to compare whether two independent groups (TV Shows and Movies) differ in their median addition year, which is appropriate since the data may not be normally distributed.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- Null Hypothesis: The distribution of content ratings is independent of the country of origin.

- Alternate Hypothesis: The distribution of content ratings depends on the country of origin.-

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import chi2_contingency

contingency = pd.crosstab(df['country'], df['rating'])
chi2, p, dof, expected = chi2_contingency(contingency)
print('Chi-squared test p-value:', p)

##### Which statistical test have you done to obtain P-Value?

- Chi-squared test of independence.

##### Why did you choose the specific statistical test?

- The Chi-squared test assesses the relationship between two categorical variables (country and rating) to determine if they are independent or associated.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- Null Hypothesis: There is no significant difference in the number of actors (cast count) between Movies and TV Shows.

- Alternate Hypothesis: The number of actors (cast count) differs significantly between Movies and TV Shows.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import mannwhitneyu

cast_tv = df[df['type'] == 'TV Show']['cast_count']
cast_movie = df[df['type'] == 'Movie']['cast_count']
stat, p = mannwhitneyu(cast_tv.dropna(), cast_movie.dropna())
print('Mann-Whitney U test p-value:', p)

##### Which statistical test have you done to obtain P-Value?

Mann-Whitney U test.

##### Why did you choose the specific statistical test?

- The Mann-Whitney U test is suitable for comparing the distribution of numerical (cast count) data between two independent groups when normal distribution cannot be assumed.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

for col in ['director', 'cast', 'country', 'rating']:
    df[col] = df[col].fillna('Unknown')

#### What all missing value imputation techniques have you used and why did you use those techniques?

- We used constant value imputation by filling missing values in the director, cast, country, and rating columns with the string "Unknown".

- This is appropriate because these fields are categorical, and using a placeholder retains all records for analysis rather than dropping them (which could reduce our dataset size and bias our results).By labeling missing entries as "Unknown", we make it clear to any analytical model that the information is not available, while still preserving the full context of the dataset.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# Handling Outliers & Outlier treatments
upper_limit = df['duration_num'].quantile(0.99)
df['duration_num'] = df['duration_num'].apply(lambda x: upper_limit if x > upper_limit else x)

##### What all outlier treatment techniques have you used and why did you use those techniques?

- We used capping (also known as winsorization) for the duration_num feature. This technique sets any value above the 99th percentile to the value at the 99th percentile.

Why?
- Capping helps to reduce the influence of these outliers on statistical analysis and machine learning models, leading to more robust and reliable results.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

from sklearn.preprocessing import LabelEncoder

for col in ['country', 'rating', 'duration_unit']:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

#### What all categorical encoding techniques have you used & why did you use those techniques?

- We used Label Encoding on the country, rating, and duration_unit columns.

Why?
- Label encoding converts each unique category value into a numeric code. It is suitable for columns where there is no ordinal relationship between category values, and it is efficient when using algorithms that can interpret numeric labels as categorical (such as tree-based models or clustering methods).

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

import re
contractions_dict = {"can't": "cannot", "won't": "will not", "n't": " not"}
def expand_contractions(text):
    for contraction, expanded in contractions_dict.items():
        text = re.sub(contraction, expanded, text)
    return text
df['description'] = df['description'].apply(lambda x: expand_contractions(str(x)))
# Display a sample
print("Sample after expanding contraction:", df['description'].iloc[0])


#### 2. Lower Casing

In [None]:
# Lower Casing

df['description'] = df['description'].str.lower()
print("Sample after lowercasing:", df['description'].iloc[0])

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

import string
df['description'] = df['description'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
print("Sample after removing punctuations:", df['description'].iloc[0])

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

import re
df['description'] = df['description'].apply(lambda x: re.sub(r'http\\S+|www\\.\\S+', '', x))
df['description'] = df['description'].apply(lambda x: ' '.join([word for word in str(x).split() if not any(ch.isdigit() for ch in word)]))
print("Sample after removing URLs and words with digits:", df['description'].iloc[0])

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords and White spaces

nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
df['description'] = df['description'].apply(lambda x: ' '.join([word for word in str(x).split() if word not in stop]))
df['description'] = df['description'].apply(lambda x: ' '.join(x.split()))
print("Sample after removing stopwords & whitespaces:", df['description'].iloc[0])


#### 6. Rephrase Text

In [None]:
# Rephrase Text

# This is typically skipped or needs a function/model; as a placeholder:
# df['description'] = df['description'].apply(your_rephrase_function)
# Display sample (will be unchanged unless a function is applied)
print("Sample after rephrasing (if applied):", df['description'].iloc[0])

#### 7. Tokenization

In [None]:
# Tokenization

df['description_tokens'] = df['description'].apply(lambda x: x.split())
print("Sample after tokenization:", df['description_tokens'].iloc[0])

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
df['description'] = df['description'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in str(x).split()]))
print("Sample after lemmatization:", df['description'].iloc[0])

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

nltk.download('averaged_perceptron_tagger_eng')
df['description_pos'] = df['description'].apply(lambda x: nltk.pos_tag(x.split()))
print("Sample POS tagging:", df['description_pos'].iloc[0])

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=10)
tfidf_matrix = vectorizer.fit_transform(df['description'].fillna(''))
# Show the first vectorized output as a sample
print("Sample TF-IDF vector:", tfidf_matrix[0].toarray())

##### Which text vectorization technique have you used and why?

- We used TF-IDF vectorization for text features. TF-IDF highlights important terms in the document relative to their frequency across the dataset, making it well-suited for text clustering, topic modeling, and machine learning applications because it reduces the influence of very common words and surfaces the most distinctive terms.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# Example: Create a new feature 'content_age' (how many years ago released)
import datetime
current_year = datetime.datetime.now().year
df['content_age'] = current_year - df['release_year']

# Sample after creating 'content_age'
print("Sample - content_age:", df[['release_year', 'content_age']].head(1))

# Example: Combine or bucketize ratings to reduce cardinality
def bucket_rating(r):
    # Simple grouping (update as needed for your dataset)
    if r in ['G', 'PG']: return 'Family'
    elif r in ['PG-13', 'TV-14']: return 'Teen'
    elif r in ['R', 'NC-17', 'TV-MA']: return 'Adult'
    else: return 'Other'
if 'rating' in df.columns and df['rating'].dtype == object:
    df['rating_bucket'] = df['rating'].apply(bucket_rating)
else:
    # If already encoded, skip or provide object mapping
    df['rating_bucket'] = 'Unknown'

print("Sample - rating_bucket:", df[['rating', 'rating_bucket']].head(1))


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

# Prepare your feature columns (add your important engineered columns here)
feature_cols = ['release_year', 'duration_num', 'content_age', 'cast_count']
if 'rating_bucket' in df.columns:
    feature_cols.append('rating_bucket')

# For demonstration, one-hot encode rating_bucket if created
if 'rating_bucket' in feature_cols:
    df = pd.get_dummies(df, columns=['rating_bucket'], drop_first=True)
    # Update features list to include new dummies
    feature_cols = [col for col in df.columns if 'rating_bucket' in col] + [c for c in feature_cols if c != 'rating_bucket']

# Set up feature matrix X
X = df[feature_cols].fillna(0)

# Print feature set sample
print("Sample of selected features:\n", X.head(1))


##### What all feature selection methods have you used  and why?

- Domain knowledge: Selected features known to affect streaming patterns.

- Correlation analysis: To remove redundant features.

- One-hot encoding for non-ordinal categories helps prevent misleading numerical relationships.

##### Which all features you found important and why?

- release_year/content_age: Reflects how recent content is; newer shows/movies may attract more users.

- duration_num: Indicates the typical length of content, which varies by audience preference.

- cast_count: Larger casts can indicate ensemble shows or big productions, influencing popularity and genre.

- rating_bucket: Helps model the type of audience the show/movie targets.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

- Yes, data transformation is necessary in this project. Many features such as release_year, duration_num, cast_count, and engineered features like content_age can have different scales, ranges, or distributions. Machine learning algorithms perform best when numeric features are on similar scales.

Why?
- I used standard scaling (z-score normalization) so that each feature has a mean of 0 and a standard deviation of 1. This transformation minimizes the dominance of features with larger scales and helps improve model convergence and clustering results.

In [None]:
# Transform Your data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Display a sample of transformed data
import pandas as pd
print("Sample of transformed/scaled data:\n", pd.DataFrame(X_scaled, columns=X.columns).head(1))

### 6. Data Scaling

In [None]:
# Scaling your data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Display a sample of scaled data
import pandas as pd
print("Sample of scaled data:\n", pd.DataFrame(X_scaled, columns=X.columns).head(1))

##### Which method have you used to scale you data and why?

- I used StandardScaler (z-score normalization) to scale the data. This method transforms each feature so that it has a mean of 0 and a standard deviation of 1.

Why?
- It ensures all features contribute equally and prevents features with large ranges from dominating the distance calculations in clustering or affecting convergence in many machine learning models.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Dimensionality reduction can be very helpful, especially if your dataset has a large number of features. Reducing the number of dimensions:

- Makes clustering and visualization much clearer.

- Helps reduce noise and improve computation time.

In [None]:
# DImensionality Reduction (If needed)

from sklearn.decomposition import PCA

# Apply PCA to reduce to two principal components for visualization or clustering
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Display a sample of reduced data
import pandas as pd
print("Sample of data after PCA dimensionality reduction:\n", pd.DataFrame(X_pca, columns=['PC1', 'PC2']).head(1))

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

- I used Principal Component Analysis (PCA).
- PCA is effective for numeric or encoded features and finds new axes (principal components) that capture the highest variance in data, condensing information into fewer dimensions.

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

from sklearn.model_selection import train_test_split

# For unsupervised problems like clustering, you may only split X_scaled.
X_train, X_test = train_test_split(X_scaled, test_size=0.2, random_state=42)

# Display the shapes as sample output
print("Train set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

##### What data splitting ratio have you used and why?

- I used an 80:20 train-test split—80% of the data for training and 20% for testing.

- This ratio is the industry standard and strikes a balance: a large enough training set for the model to learn generalizable patterns, and a sufficient test set to objectively evaluate the model’s performance.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

- In unsupervised learning (such as Netflix content clustering), imbalance is typically less relevant, as there may not be a target label to balance. However, if we have categorical labels like genres or countries and we plan to use these as a target, checking distribution is important.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

- Balance not needed.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation ( KMeans Clustering )

# Import libraries
from sklearn.cluster import KMeans
# Fit the Algorithm
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X_scaled)
clusters = kmeans.predict(X_scaled)
# Predict on the model
print("Sample cluster assignments:", clusters[:10])

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

- KMeans partitions the dataset into k clusters. Performance is measured by the Silhouette Score:

In [None]:
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.pyplot as plt

# Visualizing evaluation Metric Score chart
score = silhouette_score(X_scaled, clusters)
print("Silhouette Score:", score)

values = silhouette_samples(X_scaled, clusters)
plt.hist(values, bins=20)
plt.xlabel("Silhouette Coefficient Values")
plt.ylabel("Frequency")
plt.title("Silhouette Score Distribution")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm
sil_scores = []
k_range = range(2, 10)
for k in k_range:
    model = KMeans(n_clusters=k, random_state=42)
    labels = model.fit_predict(X_scaled)
    sil_scores.append(silhouette_score(X_scaled, labels))
# Predict on the model
plt.plot(k_range, sil_scores, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for different k')
plt.show()

##### Which hyperparameter optimization technique have you used and why?

- I used a grid search over values of n_clusters (k) to maximize the silhouette score. This is the core hyperparameter affecting unsupervised cluster quality.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

- Yes, selecting the k with the highest silhouette score improved separation between clusters, as seen in the updated chart.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

- Random Forest is an ensemble tree-based classifier, robust to overfitting and handles both categorical and numerical features. Performance is visualized using an accuracy or F1-score bar chart.

In [None]:
# Visualizing evaluation Metric Score chart

from sklearn.cluster import AgglomerativeClustering

# ML Model - 2 Implementation
agglo = AgglomerativeClustering(n_clusters=4)
agglo_labels = agglo.fit_predict(X_scaled)
print("Sample Agglomerative clustering labels:", agglo_labels[:10])

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.metrics import silhouette_score, silhouette_samples

#plot graph
agglo_score = silhouette_score(X_scaled, agglo_labels)
print("Silhouette Score (Agglomerative):", agglo_score)
plt.hist(silhouette_samples(X_scaled, agglo_labels), bins=20)
plt.title("Agglomerative Clustering Silhouette Score")
plt.xlabel("Silhouette Coefficient")
plt.ylabel("Frequency")
plt.show()

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

- Again, we tune the number of clusters to maximize the silhouette score.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

- A higher Silhouette Score means clusters are clear and actionable, enabling tailored recommendations, better user engagement segmentation, and content strategy.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

from sklearn.cluster import DBSCAN
# Fit the Algorithm
dbscan = DBSCAN(eps=1.2, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
# Predict on the model
print("Sample DBSCAN cluster labels:", dbscan_labels[:10])

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

from sklearn.metrics import silhouette_score
import numpy as np

mask = dbscan_labels != -1
if mask.sum() > 0:
    dbscan_score = silhouette_score(X_scaled[mask], dbscan_labels[mask])
    print("Silhouette Score (DBSCAN):", dbscan_score)
else:
    print("No clusters formed by DBSCAN.")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
import numpy as np
# Fit the Algorithm
cluster_centers = kmeans.cluster_centers_
important_features = np.argsort(np.abs(cluster_centers).sum(axis=0))[::-1]
# Predict on the model
print("Most differentiating features among clusters (top 5):", [X.columns[i] for i in important_features[:5]])

##### Which hyperparameter optimization technique have you used and why?

-  We tune the number of clusters to maximize the DBSCAN Clusterng.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

- Alternatively, dimensionality-reduced (PCA) plots help explain what features most distinguish clusters.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

- Silhouette Score indicates strong, clear content/user segmentation, necessary for targeted marketing, recommendations, and data-driven business strategies.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

- I selected the clustering model that yielded the highest silhouette score, with cluster profiles interpretable for business decisions.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

- For clustering, feature_importance per se is less straightforward. However, you can review cluster centroids for KMeans to interpret dominant features:

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

import pickle

# Save the File
# Assume your best clustering model is called `kmeans`
with open('best_model.pkl', 'wb') as f:
    pickle.dump(kmeans, f)  # Replace with your best model variable if different

print("Model saved successfully as 'best_model.pkl'")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

# Load the File and predict unseen data
with open('best_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

# Prepare unseen data; here, using the first 5 samples as a demonstration
X_new = X_scaled[:5]  # Replace with your own new data after preprocessing

predictions = loaded_model.predict(X_new)
print("Predictions on unseen data using loaded model:", predictions)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Working on this Netflix content analysis project was truly eye-opening. From wrangling the raw, sometimes messy dataset to finally grouping movies and shows into meaningful clusters, the journey was filled with both challenges and aha-moments. It’s fascinating to see how much we can learn and discover from the data that sits just beneath the surface: hidden patterns, surprising segmentations, and plenty of creative opportunities for product teams and viewers alike. As machine learning and data science continue to shape how we watch and enjoy our favorite titles, it’s rewarding to know that these efforts can lead directly to improved recommendations, a more engaging catalog, and a richer viewing experience for everyone. This project isn’t just about algorithms and accuracy—it’s about making streaming smarter, more personal, and a little more fun.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***