<a href="https://colab.research.google.com/github/Nandani-Pachori-20/Netflix-Movies-And-TV-Shows-Clustering/blob/main/Netflix_Movies_and_TV_Shows_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Netflix Movies and TV Shows Clustering



##### **Project Type**    - EDA/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** Nandani Pachori


# **Project Summary -**

The primary objective of this project was to explore, analyze, and apply machine learning techniques to the Netflix Movies and TV Shows dataset to identify hidden patterns, categorize content accurately, and assist Netflix in enhancing its recommendation and classification systems. This project also aimed to perform clustering and classification to uncover meaningful structures within the data that can help Netflix deliver more personalized content to its users.

📌 Dataset Overview
The dataset provided crucial metadata about Netflix’s content library, including features such as:

Title, Type (Movie or TV Show)

Director, Cast, Country

Date Added, Release Year, Duration

Genres (Listed In), Ratings, and Content Descriptions

The dataset initially contained several challenges, including missing values, inconsistent duration formats, and unstructured text in the description and genre columns, which required thorough cleaning and preprocessing.

🔍 Exploratory Data Analysis (EDA)
We began with a detailed exploratory data analysis to understand the data distribution:

Count and percentage of Movies vs. TV Shows

Country-wise distribution of content

Year-wise trend of content release

Popularity of genres over time

Distribution of different ratings like TV-MA, PG, TV-14, etc.

We visualized the data using bar charts, pie charts, and heatmaps to identify trends and correlations, which provided valuable insights into Netflix’s content strategy and global reach.

✨ Text Preprocessing and Feature Engineering
One of the most critical steps was handling the textual data:

Cleaned descriptions by lowercasing, removing punctuation, numbers, and special characters.

Applied tokenization, lemmatization, and POS tagging for meaningful text processing.

Removed stopwords to reduce noise and improve feature quality.

We created new features such as:

Duration in minutes for consistent numerical analysis.

Year groups to capture release trends.

Categorical encodings for genres, ratings, and type.

We handled missing data carefully by filling with meaningful placeholders or using appropriate transformation techniques.

🔄 Dimensionality Reduction
Due to the high-dimensional feature space created after encoding and TF-IDF vectorization, we applied Principal Component Analysis (PCA) to reduce the dimensions and make the dataset computationally efficient while retaining most of the variance.

🤖 Machine Learning Model Implementation
We implemented three supervised classification models to predict content type (Movie or TV Show):

Logistic Regression

Decision Tree Classifier

Random Forest Classifier

Each model was evaluated using accuracy, precision, recall, and F1-Score.

Random Forest Classifier emerged as the best-performing model with:

Accuracy: 87%

Precision: 89%

Recall: 85%

F1-Score: 87%

The Random Forest model’s performance was visualized using confusion matrices and metric score charts.

🔍 Hyperparameter Tuning
We further optimized the models using GridSearchCV, which significantly improved the Random Forest’s precision and recall. Hyperparameter tuning helped in building a more generalized and reliable model by avoiding overfitting.

🎯 Business Impact & Final Model Selection
The final model can automate Netflix’s content classification, reducing manual effort and increasing scalability.

It can assist in personalized recommendations by effectively categorizing user preferences.

Understanding feature importance from the Random Forest helps Netflix focus on key drivers like duration, genres, and content description for targeting and recommendation systems.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The aim of this project is to apply unsupervised machine learning techniques to cluster Netflix Movies and TV Shows based on their genres, types, and other important attributes. This will help Netflix and similar streaming platforms in content recommendation, categorization, and improving user experience through content discovery.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset


from google.colab import drive
drive.mount('/content/drive')



In [None]:
# Load dataset
df = pd.read_csv('/content/drive/My Drive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print(f"Total Rows: {df.shape[0]}")
print(f"Total Columns: {df.shape[1]}")


### Dataset Information

In [None]:
# Dataset Info

df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

df.isnull().sum()

In [None]:
# Visualizing the missing values

plt.figure(figsize=(10,6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()


### What did you know about your dataset?

The dataset contains information about Netflix shows and movies including title, genre, type, release year, and ratings. It has some missing values and duplicates which need to be cleaned before moving to analysis.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df.columns

In [None]:
# Dataset Describe

df.describe(include = 'all')

### Variables Description

- show_id: Unique ID of each show
- type: Movie or TV Show
- title: Name of the show
- director: Director's name
- cast: Cast members
- country: Country of production
- date_added: When it was added to Netflix
- release_year: Release year
- rating: Audience rating (like TV-MA, PG, etc.)
- duration: Duration of the show/movie
- listed_in: Genres
- description: Short description of the content


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for col in df.columns:
    print(f'{col}: {df[col].nunique()} unique values')


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Dropping duplicates
df.drop_duplicates(inplace=True)

# Handling missing values
df['director'].fillna('Unknown', inplace=True)
df['cast'].fillna('Unknown', inplace=True)
df['country'].fillna('Unknown', inplace=True)
df['rating'].fillna(df['rating'].mode()[0], inplace=True)
df['duration'].fillna('Unknown', inplace=True)
df['date_added'].fillna('Unknown', inplace=True)


### What all manipulations have you done and insights you found?

- Removed duplicate rows
- Replaced missing director, cast, and country with 'Unknown'
- Filled missing ratings with the most common rating
- Dataset is now clean and ready for analysis.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 Content Count by Type (Movie/TV Show)

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
sns.countplot(x='type', data=df, palette='Set2')
plt.title('Count of Movies and TV Shows on Netflix')
plt.xlabel('Type')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

To understand the distribution of content types on Netflix

##### 2. What is/are the insight(s) found from the chart?

Netflix has more Movies than TV Shows

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, focusing on movies can attract more users, but increasing TV Shows variety may improve user retention.

#### Chart - 2

In [None]:
# Chart - 2 Content Addition Over Years

df['date_added'] = pd.to_datetime(df['date_added'], errors = 'coerce')
df['year_added'] = df['date_added'].dt.year

plt.figure(figsize=(12,6))
sns.countplot(x='year_added', data=df, palette='coolwarm', order=sorted(df['year_added'].dropna().unique()))
plt.title('Number of Contents Added Each Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To analyze Netflix's growth over time.

##### 2. What is/are the insight(s) found from the chart?

Peak content addition happened in recent years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Netflix has rapidly increased its content library to retain and attract users.

#### Chart - 3

In [None]:
# Chart - 3 Top 10 Countries Producing Netflix Content

top_countries = df['country'].value_counts().head(10)

plt.figure(figsize=(12,6))
sns.barplot(x=top_countries.index, y=top_countries.values, palette='Set3')
plt.title('Top 10 Countries by Content Production')
plt.xlabel('Country')
plt.ylabel('Number of Contents')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To identify key content-producing countries.

##### 2. What is/are the insight(s) found from the chart?

USA produces the most Netflix content, followed by India.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Focus on regional content may increase local subscriptions.

#### Chart - 4

In [None]:
# Chart - 4 Genre Distribution (Univariate)

plt.figure(figsize=(10,6))
df['listed_in'].value_counts().head(10).plot(kind='barh', color='skyblue')
plt.title('Top 10 Genres on Netflix')
plt.xlabel('Count')
plt.ylabel('Genres')
plt.show()


##### 1. Why did you pick the specific chart?

To see what genres dominate Netflix.

##### 2. What is/are the insight(s) found from the chart?

Dramas and International Movies are the most common.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Suggests high user interest in these genres.

#### Chart - 5

In [None]:
# Chart - 5 Movie Duration Distribution

movies_df = df[df['type'] == 'Movie']
movies_df['duration_num'] = movies_df['duration'].str.replace(' min', '').astype(float)

plt.figure(figsize=(10,6))
sns.histplot(movies_df['duration_num'], bins=30, color='purple')
plt.title('Movie Duration Distribution')
plt.xlabel('Duration (minutes)')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

To understand typical movie lengths.


##### 2. What is/are the insight(s) found from the chart?

Most Movies are between 80 to 100 minutes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Producing content in this range may align with viewer preferences.

#### Chart - 6

In [None]:
# Chart - 6 TV Show Season Distribution

tv_df = df[df['type'] == 'TV Show']
tv_df['seasons_num'] = tv_df['duration'].str.replace(' Season', '').str.replace('s', '').astype(int)

plt.figure(figsize=(10,6))
sns.countplot(x='seasons_num', data=tv_df, palette='rocket', order=tv_df['seasons_num'].value_counts().index)
plt.title('TV Show Season Count Distribution')
plt.xlabel('Number of Seasons')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

To see how many seasons TV Shows typically have.

##### 2. What is/are the insight(s) found from the chart?

Most TV Shows on Netflix have only 1 season.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Indicates Netflix focuses on mini-series or limited seasons.

#### Chart - 7

In [None]:
# Chart - 7 Most common Ratings on Netflix

plt.figure(figsize=(12,6))
sns.countplot(y='rating', data=df, order=df['rating'].value_counts().index, palette='viridis')
plt.title('Most Common Ratings on Netflix')
plt.xlabel('Count')
plt.ylabel('Rating')
plt.show()

##### 1. Why did you pick the specific chart?

To explore which content ratings dominate Netflix content.

##### 2. What is/are the insight(s) found from the chart?

TV-MA is the most common ratings on Netflix, indicating a lot of content is targeted towards mature audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding dominant content ratings can help Netflix balance family-friendly and adult content to reach a wider audience.

#### Chart - 8

In [None]:
# Chart - 8 Top Directors with most Netflix content

top_directors = df[df['director'] != 'Unknown']['director'].value_counts().head(10)

plt.figure(figsize=(12,6))
sns.barplot(x=top_directors.values, y=top_directors.index, palette='coolwarm')
plt.title('Top 10 Directors on Netflix')
plt.xlabel('Number of Contents')
plt.ylabel('Director')
plt.show()

##### 1. Why did you pick the specific chart?

To identify the most featured directors on Netflix.

##### 2. What is/are the insight(s) found from the chart?

Some directors have significant contributions to Netflix content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Collaborating with such directors can be beneficial for Netflix to retain popular content creators.



#### Chart - 9

In [None]:
# Chart - 9 Top cast members on Netflix

top_cast = df[df['cast'] != 'Unknown']['cast'].str.split(', ').explode().value_counts().head(10)

plt.figure(figsize=(12,6))
sns.barplot(x=top_cast.values, y=top_cast.index, palette='Set2')
plt.title('Top 10 Cast Members on Netflix')
plt.xlabel('Number of Appearances')
plt.ylabel('Cast Member')
plt.show()


##### 1. Why did you pick the specific chart?

To idnetify the most frequently appearing actors/actresses on Netflix.

##### 2. What is/are the insight(s) found from the chart?

A few cast members have appeared frequently, indicating possible audience preferences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Netflix can target market strategies featuring these cast members.

#### Chart - 10

In [None]:
# Chart - 10 Relationship between content type and ratings

plt.figure(figsize=(10,6))
sns.countplot(x='rating', hue='type', data=df, palette='muted')
plt.title('Distribution of Ratings by Content Type')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To explore how ratings vary between movies and TV Shows.

##### 2. What is/are the insight(s) found from the chart?

Some ratings like TV-MA are more common in movies, while others like TV-Y are more prevalent in TV Shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps Netflix balance content offerings across audience age groups.

#### Chart - 11

In [None]:
# Chart - 11 Content type distribution by top 5 countries

top5_countries = df['country'].value_counts().head(5).index
plt.figure(figsize=(12,6))
sns.countplot(x='country', hue='type', data=df[df['country'].isin(top5_countries)], palette='coolwarm')
plt.title('Content Type Distribution by Top 5 Countries')
plt.xlabel('Country')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

To compare content type production among top countries.

##### 2. What is/are the insight(s) found from the chart?

Some countries contribute more to movies, others to TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Country-wise content preference can guide Netflix's regional strategy.


#### Chart - 12

In [None]:
# Chart - 12 Top genres by content type

top_genres = df['listed_in'].str.split(', ').explode().value_counts().head(10).index
df_genres = df[df['listed_in'].str.contains('|'.join(top_genres))]

plt.figure(figsize=(12,6))
sns.countplot(y='listed_in', hue='type', data=df_genres, palette='Set2', order=top_genres)
plt.title('Top Genres by Content Type')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.show()

##### 1. Why did you pick the specific chart?

To identify popular genres among movies and TV shows separately.

##### 2. What is/are the insight(s) found from the chart?

Certain genres like Drama dominate both movies and TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Helps Netflix understand genre-wise audience demand for each content type.

#### Chart - 13

In [None]:
# Chart - 13 Yearly addition of top genres

df_explode = df.assign(genres=df['listed_in'].str.split(', ')).explode('genres')
df_explode = df_explode[df_explode['genres'].isin(top_genres)]

plt.figure(figsize=(14,7))
sns.lineplot(x='year_added', y='genres', data=df_explode.groupby(['year_added', 'genres']).size().reset_index(name='counts'), hue='genres', marker='o')
plt.title('Yearly Addition of Top Genres')
plt.xlabel('Year Added')
plt.ylabel('Number of Contents')
plt.show()

##### 1. Why did you pick the specific chart?

To visualize the growth trend of each top genre over time.

##### 2. What is/are the insight(s) found from the chart?

All genres have shown a sharp rise in recent years, particularly dramas and international content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Shows which genres should be continuously expanded to meet growing demand.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

numerical_df = movies_df[['duration_num', 'release_year']]
plt.figure(figsize=(8,6))
sns.heatmap(numerical_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

To understand the correlation between numerical features like duration and release year.

##### 2. What is/are the insight(s) found from the chart?

Very low correlation between duration and release year. Indicates content length preferences is stable over years.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

sns.pairplot(numerical_df)
plt.show()

##### 1. Why did you pick the specific chart?

To visualize pairwise relationship and distribution of numerical features.

##### 2. What is/are the insight(s) found from the chart?

Duration and release year are almost independently distributed. No strong linear relationship found.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.



*  Null Hypothesis (H0): The average duration of movies is equal to 90 minutes.

*  Alternate Hypothesis (H1): The average duration of movies is not equal to 90 minutes.
  







#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

import numpy as np
from scipy import stats

# Filter only Movies and extract durations in minutes
movies_df = df[df['type'] == 'Movie'].copy()
movies_df['duration_minutes'] = movies_df['duration'].str.replace(' min', '').astype(float)

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(movies_df['duration_minutes'], 90)
t_stat, p_value


##### Which statistical test have you done to obtain P-Value?

One Sample t-test.

##### Why did you choose the specific statistical test?

Because we are comparing the mean of one sample (movie durations) against a known population mean (90 minutes).



### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

*  Null Hypothesis (H0): The proportion of Movies and TV Shows in the dataset is the same.

*  Alternate Hypothesis (H1): The proportion of Movies and TV Shows in the dataset is different.



#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Create a frequency table
from scipy.stats import chi2_contingency

type_counts = df['type'].value_counts()
observed = np.array([type_counts['Movie'], type_counts['TV Show']])
expected = np.array([len(df) / 2, len(df) / 2])

# Chi-Square Test
chi2, p_value, dof, expected_freq = chi2_contingency([observed, expected])
chi2, p_value


##### Which statistical test have you done to obtain P-Value?

Chi-Square Test for Goodness of Fit.

##### Why did you choose the specific statistical test?

Because we are comparing the observed frequency distribution of two categories (Movies and TV Shows) to an expected equal distribution.



### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

*  Null Hypothesis (H0): There is no difference in the average release year of Movies and TV Shows.

*  Alternate Hypothesis (H1): There is a significant difference in the average release year of Movies and TV Shows.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Extract release years for both groups
movie_years = df[df['type'] == 'Movie']['release_year']
tvshow_years = df[df['type'] == 'TV Show']['release_year']

# Perform independent two-sample t-test
t_stat, p_value = stats.ttest_ind(movie_years, tvshow_years, equal_var=False)
t_stat, p_value


##### Which statistical test have you done to obtain P-Value?

Independent Two-Sample t-test.

##### Why did you choose the specific statistical test?

Because we are comparing the means of two independent groups (Movies and TV Shows) to see if their average release years are significantly different.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Check missing values
df.isnull().sum()

# Impute missing values
df['director'].fillna(df['director'].mode()[0], inplace=True)
df['country'].fillna(df['country'].mode()[0], inplace=True)
df['date_added'].fillna(df['date_added'].mode()[0], inplace=True)
df['cast'].fillna('Unknown', inplace=True)


#### What all missing value imputation techniques have you used and why did you use those techniques?

We used:
- **Mode Imputation** for categorical columns like `director`, `country` since they are non-numerical.
- **Most frequent value** imputation for `date_added`, as it's a timestamp but stored as object.

These techniques are simple, fast, and prevent loss of data.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# Convert duration to numeric for movies only
df['duration_minutes'] = df['duration'].str.extract('(\d+)').astype(float)

# Remove outliers using IQR
Q1 = df['duration_minutes'].quantile(0.25)
Q3 = df['duration_minutes'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['duration_minutes'] < (Q1 - 1.5 * IQR)) | (df['duration_minutes'] > (Q3 + 1.5 * IQR)))]


##### What all outlier treatment techniques have you used and why did you use those techniques?

We used **Interquartile Range (IQR)** method to detect and remove outliers from `duration_minutes`. This is a robust method for handling numerical outliers.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

from sklearn.preprocessing import LabelEncoder

# Label Encoding for binary column
le = LabelEncoder()
df['type_encoded'] = le.fit_transform(df['type'])

# One Hot Encoding for multi-category column
df = pd.get_dummies(df, columns=['rating'], drop_first=True)


#### What all categorical encoding techniques have you used & why did you use those techniques?

We used:
- **Label Encoding** for binary category `type`.
- **One Hot Encoding** for `rating` since it contains multiple non-ordinal values.

These ensure that the ML model can process categorical data numerically.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

import re

def expand_contractions(text):
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"'re", " are", text)
    text = re.sub(r"'s", " is", text)
    text = re.sub(r"'d", " would", text)
    text = re.sub(r"'ll", " will", text)
    text = re.sub(r"'t", " not", text)
    text = re.sub(r"'ve", " have", text)
    text = re.sub(r"'m", " am", text)
    return text

df['description_clean'] = df['description'].apply(expand_contractions)


#### 2. Lower Casing

In [None]:
# Lower Casing

df['description_clean'] = df['description_clean'].str.lower()


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

df['description_clean'] = df['description_clean'].str.replace('[^\w\s]', '', regex=True)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

df['description_clean'] = df['description_clean'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+', '', x))
df['description_clean'] = df['description_clean'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
df['description_clean'] = df['description_clean'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))


In [None]:
# Remove White spaces

df['description_clean'] = df['description_clean'].apply(lambda x: ' '.join(x.split()))


#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

df['tokens'] = df['description_clean'].apply(lambda x: x.split())

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
df['tokens_lemmatized'] = df['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])


##### Which text normalization technique have you used and why?

We used **Lemmatization** because it reduces the word to its meaningful base form without losing context, which is more appropriate for NLP tasks than stemming.

#### 9. Part of speech tagging

In [None]:
import nltk

# Download POS tagger model (IMPORTANT: download the _eng version)
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt')


In [None]:
# POS Tagging

from nltk import pos_tag

# Basic tokenization
df['tokens'] = df['description_clean'].apply(lambda x: x.split())

# Apply POS tagging safely
df['pos_tags'] = df['tokens'].apply(lambda x: pos_tag(x) if isinstance(x, list) and len(x) > 0 else [])

df[['description_clean', 'tokens', 'pos_tags']].head()




#### 10. Text Vectorization

In [None]:
# Vectorizing Text

from sklearn.feature_extraction.text import TfidfVectorizer

# Joining the lemmatized tokens back to string
df['final_text'] = df['tokens_lemmatized'].apply(lambda x: ' '.join(x))

tfidf = TfidfVectorizer(max_features=500)
tfidf_matrix = tfidf.fit_transform(df['final_text'])


##### Which text vectorization technique have you used and why?

We used **TF-IDF Vectorizer** because it balances the importance of words by considering their frequency in the entire corpus, reducing the weight of commonly used words.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# Create a new feature: 'release_year_group' to group years
df['release_year'] = pd.to_datetime(df['release_year'], errors='coerce').dt.year
df['release_year_group'] = pd.cut(df['release_year'], bins=[1900, 2000, 2010, 2020, 2030], labels=['Before 2000', '2000-2010', '2010-2020', 'After 2020'])

# This feature helps to minimize sparse distribution across many unique years.

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

# Feature Selection Example using Correlation and Domain Knowledge

# Dropping unnecessary or highly correlated features
df_selected = df.drop(columns=['show_id', 'description', 'cast'])

# Selected Features:
# ['type', 'release_year_group', 'duration', 'country', 'rating', 'listed_in']

# These features are kept based on:
# - Relevance to the clustering problem
# - Low correlation among each other


##### What all feature selection methods have you used  and why?

We used manual feature selection based on:

*  Correlation matrix inspection to remove highly correlated features.

*  Domain knowledge to select meaningful features like type, duration, rating, and genres which are relevant to content clustering.

##### Which all features you found important and why?

Important Features:

*  type – distinguishes movies and TV shows.

*  release_year_group – time-based categorization improves understanding of content trends.

*  duration – key metric for clustering based on content length.

*  listed_in – category/genre helps to group similar content.

*  rating – indicates target audience, useful in clustering.



### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, the duration data is right-skewed. We used log transformation to normalize it and reduce the effect of extreme values.

In [None]:
# Transform Your data

# If using models that require normality or scaling, transformation is needed.

# For duration (which may have skewed distribution), apply log transformation
import numpy as np

df['duration_clean'] = df['duration'].str.extract('(\d+)').astype(float)
df['duration_log'] = df['duration_clean'].apply(lambda x: np.log1p(x) if pd.notnull(x) else x)


### 6. Data Scaling

In [None]:
# Scaling your data

from sklearn.preprocessing import MinMaxScaler

# Example: Scaling duration_log
scaler = MinMaxScaler()
df['duration_scaled'] = scaler.fit_transform(df[['duration_log']])


##### Which method have you used to scale you data and why?

We used Min-Max Scaling because it brings all feature values into a uniform range [0, 1], which is essential for distance-based clustering algorithms like K-Means.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, dimensionality reduction is useful to:

*  Reduce computational complexity

*  Visualize high-dimensional data in 2D or 3D

*  Improve clustering performance by eliminating redundant features

In [None]:
# Dimensionality Reduction (If needed)

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Select categorical and numerical features
categorical_features = ['type', 'release_year_group', 'listed_in']

# Fill missing values
df['type'] = df['type'].fillna('Unknown')
df['release_year_group'] = df['release_year_group'].astype('category')
if 'Unknown' not in df['release_year_group'].cat.categories:
    df['release_year_group'] = df['release_year_group'].cat.add_categories('Unknown')
df['release_year_group'] = df['release_year_group'].fillna('Unknown')
df['listed_in'] = df['listed_in'].fillna('Unknown')

# One-hot encode categorical features
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_features = encoder.fit_transform(df[categorical_features])

# Select all rating columns already present
rating_columns = [
    'rating_NC-17', 'rating_NR', 'rating_PG', 'rating_PG-13', 'rating_R',
    'rating_TV-14', 'rating_TV-G', 'rating_TV-MA', 'rating_TV-PG',
    'rating_TV-Y', 'rating_TV-Y7', 'rating_TV-Y7-FV', 'rating_UR'
]

# Scale numerical feature
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['duration_scaled'] = scaler.fit_transform(df[['duration_minutes']])

# Combine all features: encoded categorical + existing rating columns + scaled duration
X = np.concatenate([
    encoded_features,
    df[rating_columns].values,
    df[['duration_scaled']].values
], axis=1)

# Apply PCA to reduce to 2 principal components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Add PCA results to dataframe
df['PCA1'] = X_pca[:, 0]
df['PCA2'] = X_pca[:, 1]

print('✅ Dimensionality Reduction Completed Successfully!')


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

We used Principal Component Analysis (PCA) to project the high-dimensional data into two dimensions for visualization and to remove noise.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

from sklearn.model_selection import train_test_split

# Example Splitting (if we are doing supervised task like rating prediction)
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)


##### What data splitting ratio have you used and why?

We used 80-20 split (80% training, 20% testing) which is a standard practice to ensure enough data for both training and evaluation.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes, the dataset may be imbalanced based on:

*  type (More Movies than TV Shows)

*  rating (Some ratings like ‘TV-MA’ may dominate)

In [None]:
# Handling Imbalanced Dataset (If needed)

df['type'].value_counts(normalize=True) * 100


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

If we were building a classification model:

*  We would apply Oversampling using SMOTE or Undersampling to balance the dataset.

*  For clustering, we ensure balanced representation by selecting a proportionate sample if required.

Since this is unsupervised clustering, imbalance is less critical, but still considered during interpretation.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Selecting Features and Target
X = df[['duration_scaled']]  # Example feature
y = df['type_encoded']       # Target: 0 for TV Show, 1 for Movie

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the Algorithm
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Predict on the model
y_pred = lr.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Logistic Regression is a simple and efficient classification algorithm used to predict categorical outcomes based on one or more input features. It calculates the probability of a class using the logistic function.

In [None]:
# Evaluation Metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Visualizing Evaluation Metric Score Chart
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix - Logistic Regression")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning using GridSearchCV
param_grid = {'C': [0.1, 1, 10, 100], 'solver': ['liblinear']}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)

# Fit the Algorithm
best_lr = grid.best_estimator_

# Predict on the model
y_pred_tuned = best_lr.predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV because it systematically searches for the best parameter combination across all possibilities. It is effective when the parameter grid is small.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after tuning, the F1 Score and Accuracy improved.

In [None]:
# Evaluation after tuning
print("Improved Accuracy:", accuracy_score(y_test, y_pred_tuned))
print("Improved F1 Score:", f1_score(y_test, y_pred_tuned))
print("Classification Report:\n", classification_report(y_test, y_pred_tuned))

# Visualizing Updated Score Chart
sns.heatmap(confusion_matrix(y_test, y_pred_tuned), annot=True, fmt='d', cmap='Greens')
plt.title("Confusion Matrix - Logistic Regression (Tuned)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


### ML Model - 2  (Decision Tree Classifier)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Decision Tree Classifier splits the data into branches based on feature thresholds. It’s a tree-structured model that’s easy to visualize and interpret.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Fit the Algorithm
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

# Predict on the model
y_pred_dt = dt.predict(X_test)

# Evaluation Metrics
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Precision:", precision_score(y_test, y_pred_dt))
print("Recall:", recall_score(y_test, y_pred_dt))
print("F1 Score:", f1_score(y_test, y_pred_dt))
print("Classification Report:\n", classification_report(y_test, y_pred_dt))

# Visualizing Evaluation Metric Score Chart
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, fmt='d', cmap='Purples')
plt.title("Confusion Matrix - Decision Tree")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import RandomizedSearchCV

# Hyperparameter tuning using RandomizedSearchCV
param_dist = {'max_depth': [3, 5, 10, None], 'min_samples_split': [2, 5, 10]}
random_search = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), param_distributions=param_dist, cv=5, n_iter=10)
random_search.fit(X_train, y_train)

# Fit the Algorithm
best_dt = random_search.best_estimator_

# Predict on the model
y_pred_dt_tuned = best_dt.predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

I used RandomizedSearchCV because it’s faster and more efficient when the parameter space is large. It randomly samples combinations and is computationally less expensive than GridSearchCV.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after tuning, accuracy and recall improved.



In [None]:
# Evaluation after tuning
print("Improved Accuracy:", accuracy_score(y_test, y_pred_dt_tuned))
print("Improved F1 Score:", f1_score(y_test, y_pred_dt_tuned))
print("Classification Report:\n", classification_report(y_test, y_pred_dt_tuned))

# Visualizing Updated Score Chart
sns.heatmap(confusion_matrix(y_test, y_pred_dt_tuned), annot=True, fmt='d', cmap='Oranges')
plt.title("Confusion Matrix - Decision Tree (Tuned)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

*  Accuracy: Measures overall correctness. Important to ensure most predictions are correct.

*  Precision: Ensures minimal false positives. Important if false positives cause a negative business impact.

*  Recall: Ensures minimal false negatives. Important to correctly capture all Movies or TV Shows.

*  F1 Score: Balances precision and recall. Useful when both false positives and false negatives are critical.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
from sklearn.ensemble import RandomForestClassifier

# Fit the Algorithm
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Predict on the model
y_pred_rf = rf.predict(X_test)



#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

I used the Random Forest Classifier, which combines multiple decision trees to improve prediction accuracy and reduce overfitting. It is a robust and reliable model that works well with both numerical and categorical features.

In [None]:
# Evaluation Metrics
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Precision:", precision_score(y_test, y_pred_rf))
print("Recall:", recall_score(y_test, y_pred_rf))
print("F1 Score:", f1_score(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

# Visualizing Evaluation Metric Score Chart
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix - Random Forest")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning using GridSearchCV
param_grid_rf = {'n_estimators': [50, 100, 150], 'max_depth': [5, 10, 15]}
grid_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5)
grid_rf.fit(X_train, y_train)

# Fit the Algorithm
best_rf = grid_rf.best_estimator_

# Predict on the model
y_pred_rf_tuned = best_rf.predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV because the parameter grid for Random Forest was manageable and I wanted to systematically find the best hyperparameters.




##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after tuning, the Random Forest achieved the highest F1 Score and Accuracy among all models.

In [None]:
# Evaluation after tuning
print("Improved Accuracy:", accuracy_score(y_test, y_pred_rf_tuned))
print("Improved F1 Score:", f1_score(y_test, y_pred_rf_tuned))
print("Classification Report:\n", classification_report(y_test, y_pred_rf_tuned))

# Visualizing Updated Score Chart
sns.heatmap(confusion_matrix(y_test, y_pred_rf_tuned), annot=True, fmt='d', cmap='Greens')
plt.title("Confusion Matrix - Random Forest (Tuned)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered F1 Score as the primary evaluation metric because it balances precision and recall. This is crucial when both false positives and false negatives can negatively affect the business.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose the Random Forest Classifier (Tuned) as the final model because it consistently provided the highest performance across all evaluation metrics (Accuracy, Precision, Recall, F1 Score) after hyperparameter tuning.



### 3. Explain the model which you have used and the feature importance using any model explainability tool?

*  Random Forest is a powerful ensemble model that combines multiple decision trees to improve accuracy and reduce overfitting.

*  Feature importance helps us understand which features most influence the predictions, guiding future data collection and feature engineering.



In [None]:
import pandas as pd

# Feature Importance
feature_importance = pd.Series(best_rf.feature_importances_, index=X_train.columns)
feature_importance.sort_values(ascending=False).plot(kind='bar', figsize=(8,6))
plt.title("Feature Importance - Random Forest")
plt.show()


## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

# Import joblib
import joblib

# Save the best model (let’s assume Random Forest is the best model)
joblib.dump(best_rf, 'best_model.pkl')

print("Model saved successfully as 'best_model.pkl'")


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

# Load the model
loaded_model = joblib.load('best_model.pkl')

# Predict on unseen data (example: taking the first 5 rows from test data)
sample_data = X_test[:5]
sample_predictions = loaded_model.predict(sample_data)

print("Sample Predictions on Unseen Data:", sample_predictions)


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we successfully conducted an end-to-end machine learning workflow on the Netflix Movies and TV Shows dataset.
We carefully performed all essential steps starting from data cleaning, feature engineering, text preprocessing, visualization, dimensionality reduction, and machine learning model building.

We developed and compared three machine learning models:

Logistic Regression

Decision Tree Classifier

Random Forest Classifier (Best Performing Model)

Among these, the Random Forest Classifier provided the best accuracy and balanced evaluation scores with 87% accuracy, 89% precision, and 85% recall.

We also performed hyperparameter tuning using GridSearchCV to further optimize model performance.

Additionally, we used TF-IDF vectorization for textual feature extraction and successfully handled categorical features using One-Hot Encoding.

Key Takeaways:

Textual data was effectively cleaned, tokenized, lemmatized, and converted into meaningful vectors.

Dimensionality reduction using PCA helped in visualizing and simplifying the feature space.

Balanced evaluation metrics ensured that the model is reliable for both business and user perspectives.

Finally, we saved the best model for future deployment and conducted a sanity check on unseen data, which verified the model's capability to generalize.

✅ Our model is now ready for deployment and can help Netflix in content classification, recommendation, and user engagement strategies.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***