<a href="https://colab.research.google.com/github/BaraShowCode/Text-Clustering-using-K-Means-and-TF-IDF/blob/main/Text_Clustering_ML_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name** - Netflix Movies and TV Shows Clustering

##### **Project Type** - Unsupervised ML (Clustering)
##### **Contribution** - Individual

# **Project Summary -**

This project focuses on applying unsupervised machine learning, specifically clustering, to the Netflix Movies and TV Shows dataset. The primary objective is to segment the vast content library into distinct groups based on textual features like descriptions, genres, and cast. By doing so, we aim to uncover underlying patterns in the content catalog that can be used to enhance recommendation systems and inform content strategy. The project begins with a rigorous data wrangling phase, where missing values are handled and features are engineered for analysis. A key step in feature engineering is the creation of a 'content soup' by combining multiple text-based columns into a single representative feature for each title. This text data is then converted into a numerical format using the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization technique. To manage the high dimensionality of the resulting data and improve clustering performance, TruncatedSVD is applied for dimensionality reduction. The core of the project is the implementation of the K-Means clustering algorithm. The optimal number of clusters is determined using the Elbow Method. After fitting the model, the resulting clusters are analyzed by examining the top genres and terms within each group to understand their thematic composition. This qualitative analysis reveals distinct and meaningful content segments, such as 'US TV Dramas & Comedies', 'International Dramas & Thrillers', and 'Kids & Family Content'. The project successfully demonstrates how clustering can be used to create a structured, data-driven understanding of a large content library, providing valuable insights for personalization and strategic planning.

# **Problem Statement**

Netflix's recommendation engine is a cornerstone of its user experience, but it relies on understanding the nuanced similarities between titles. The business problem is to create a content-based grouping of movies and TV shows using unsupervised machine learning. The goal is to segment the entire Netflix catalog into a predefined number of distinct clusters based on features like genre, description, director, and cast. These clusters can then be used to power a 'similar content' recommendation feature, improve content discovery for users, and provide a high-level overview of the thematic composition of the content library for strategic analysis. The challenge lies in converting the unstructured text data into a meaningful numerical representation and applying a clustering algorithm that can effectively group the titles into coherent and interpretable segments.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries for Data Handling and Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from collections import Counter
from scipy.stats import ttest_ind, chi2_contingency

# Import Libraries for Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Set default styles for plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 7)
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully! ")

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

## 3. ***Data Wrangling***

In [None]:
# Write your code to make your dataset analysis ready.
df_cleaned = df.copy()
print("Starting data wrangling...")

# Fill missing categorical data
df_cleaned[['director', 'cast', 'country']] = df_cleaned[['director', 'cast', 'country']].fillna('Unknown')

# Drop rows with missing date_added or rating (very few)
df_cleaned.dropna(subset=['date_added', 'rating'], inplace=True)

# Correct date format issue by stripping whitespace
df_cleaned['date_added'] = pd.to_datetime(df_cleaned['date_added'].str.strip())

# Create the 'content_soup' feature for NLP by combining relevant text columns
df_cleaned['content_soup'] = df_cleaned['director'] + ' ' + df_cleaned['cast'] + ' ' + df_cleaned['listed_in'] + ' ' + df_cleaned['description']
print("Created 'content_soup' feature for clustering.")
print("Data wrangling complete. ")

## ***4. Data Vizualization (Condensed for ML Focus)***

#### Chart - 1: Distribution of Content Type

In [None]:
type_counts = df_cleaned['type'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', startangle=140, colors=['#b20710', '#221f1f'])
plt.title('Netflix Content Distribution: Movies vs. TV Shows', fontsize=16)
plt.ylabel('')
plt.show()

##### 1. Why did you pick the specific chart?
A **pie chart** is used here as a quick, high-level overview of the content mix. It effectively shows the proportion of Movies vs. TV Shows.

##### 2. What is/are the insight(s) found from the chart?
The library consists of approximately 69% Movies and 31% TV Shows, providing essential context for our clustering model.

##### 3. Will the gained insights help creating a positive business impact?
Yes, it sets the stage for understanding the different types of content we are about to cluster.

#### Chart - 2: Top 10 Genres on Netflix

In [None]:
genre_list = df_cleaned['listed_in'].str.split(', ').sum()
top_10_genres = Counter(genre_list).most_common(10)
genres_df = pd.DataFrame(top_10_genres, columns=['Genre', 'Count'])

plt.figure(figsize=(14, 8))
sns.barplot(data=genres_df, y='Genre', x='Count', palette='mako')
plt.title('Top 10 Genres on Netflix', fontsize=16)
plt.xlabel('Number of Titles', fontsize=12)
plt.ylabel('Genre', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?
A **horizontal bar chart** is used to rank the top 10 genres. This is a key visualization because the `listed_in` column is a primary component of our text-based features for clustering.

##### 2. What is/are the insight(s) found from the chart?
**'International Movies'**, **'Dramas'**, and **'Comedies'** are the most prevalent genres. This suggests that any effective clustering model should be able to separate content based on these dominant themes.

##### 3. Will the gained insights help creating a positive business impact?
Yes. Understanding the most common genres helps validate our clustering results later. If our clusters align with these top genres, it indicates the model is capturing meaningful patterns.

#### Chart - 3: Distribution of Content Ratings

In [None]:
plt.figure(figsize=(14, 8))
sns.countplot(data=df_cleaned, x='rating', order=df_cleaned['rating'].value_counts().index, palette='plasma')
plt.title('Distribution of Content Ratings on Netflix', fontsize=16)
plt.xlabel('Rating', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?
A **count plot** provides a clear view of the distribution of content ratings, which is another key textual feature used in our model.

##### 2. What is/are the insight(s) found from the chart?
The most common rating is **'TV-MA'** (for mature audiences), followed by **'TV-14'**. This helps characterize the overall tone of the content library.

##### 3. Will the gained insights help creating a positive business impact?
Yes, this helps in understanding the target audience of the content we are clustering. We expect to see clusters that align with these different maturity levels.

## ***6. Feature Engineering & Data Pre-processing***

### 10. Text Vectorization

In [None]:
# Vectorizing Text
tfidf = TfidfVectorizer(stop_words='english', max_features=25000)
tfidf_matrix = tfidf.fit_transform(df_cleaned['content_soup'])

print(f"Shape of the TF-IDF matrix: {tfidf_matrix.shape}")

##### Which text vectorization technique have you used and why?
**TF-IDF (Term Frequency-Inverse Document Frequency)** was used because it is a powerful technique for converting text into a meaningful numerical representation for machine learning. Unlike simple word counts, TF-IDF gives higher weight to words that are frequent in a single document but rare across all documents, effectively highlighting the terms that best define a specific movie or TV show.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, dimensionality reduction is crucial for this project. The TF-IDF matrix has 25,000 columns, which corresponds to 25,000 dimensions. Clustering algorithms perform poorly in such high-dimensional spaces due to the 'curse of dimensionality,' where distances between points become less meaningful. By reducing the dimensions, we can capture the most significant patterns in the data while making the clustering process more computationally efficient and effective.

In [None]:
# Dimensionality Reduction
svd = TruncatedSVD(n_components=200, random_state=42)
reduced_matrix = svd.fit_transform(tfidf_matrix)

print(f"Shape of the reduced matrix: {reduced_matrix.shape}")

##### Which dimensionality reduction technique have you used and why?
**TruncatedSVD** was used for dimensionality reduction. It is a technique similar to Principal Component Analysis (PCA) but is specifically designed to work efficiently with the large, sparse matrices generated by TF-IDF. It effectively reduces the number of features while retaining the maximum amount of variance (i.e., information) from the original data.

## ***7. ML Model Implementation***

### ML Model - 1: K-Means Clustering

In [None]:
# First, we determine the optimal number of clusters using the Elbow Method.
inertia = []
k_range = range(2, 16)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(reduced_matrix)
    inertia.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(14, 8))
plt.plot(k_range, inertia, marker='o', linestyle='--')
plt.title('Elbow Method for Optimal k', fontsize=16)
plt.xlabel('Number of Clusters (k)', fontsize=12)
plt.ylabel('Inertia', fontsize=12)
plt.xticks(k_range)
plt.grid(True)
plt.show()

In [None]:
# ML Model - 1 Implementation
optimal_k = 10 # Determined from the elbow plot
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)

# Fit the Algorithm
kmeans.fit(reduced_matrix)

# Predict on the model (assign cluster labels)
df_cleaned['cluster'] = kmeans.labels_

print(f"Successfully clustered the data into {optimal_k} clusters.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart - Silhouette Score
score = silhouette_score(reduced_matrix, kmeans.labels_)
print(f"Silhouette Score: {score:.4f}")

# Qualitative Analysis: Show the top genres in a few sample clusters
for i in range(optimal_k):
    cluster_df = df_cleaned[df_cleaned['cluster'] == i]
    cluster_genres = cluster_df['listed_in'].str.split(', ').sum()
    top_genres = Counter(cluster_genres).most_common(5)
    print(f"\n--- Cluster {i} (Size: {len(cluster_df)}) Top Genres ---")
    print(top_genres)

#### Explanation of Performance:
**K-Means Clustering** was used to partition the data into 10 groups. Since this is an unsupervised task, performance is judged on interpretability and cluster separation.

1. **Silhouette Score:** The model achieved a low positive score. In high-dimensional text data, low scores are common and do not necessarily mean the clusters are bad. It suggests that the clusters are not perfectly dense and well-separated, which is expected with complex data like movie descriptions.

2. **Qualitative Analysis (Most Important):** The true performance is seen by inspecting the clusters. The analysis of top genres reveals distinct and meaningful themes:
   - **Cluster 0:** Dominated by US TV Shows, particularly Dramas and Comedies.
   - **Cluster 1:** Clearly represents Stand-Up Comedy specials.
   - **Cluster 2 & 3:** Focus on International Movies, especially Dramas.
   - **Cluster 4:** Appears to be a large cluster for Documentaries.
   - **Cluster 7:** Strong theme of Kids' TV and children's content.

This clear thematic separation shows that the model has successfully learned the underlying structure of the content library, which is a strong indicator of good performance for this business problem.

### ML Model - 2

For this unsupervised learning project, the focus was on implementing and thoroughly analyzing one robust clustering algorithm, K-Means. Other algorithms could be used for comparison in future work:

* **Agglomerative Clustering:** A hierarchical approach that builds a tree of clusters. It can be useful for visualizing relationships but is more computationally expensive.
* **DBSCAN:** A density-based algorithm that is excellent at finding arbitrarily shaped clusters and identifying outliers. It does not require specifying the number of clusters beforehand but is sensitive to its parameters.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The most important evaluation metric for a positive business impact is **Qualitative Cluster Analysis**. While the **Silhouette Score** provides a useful quantitative measure of cluster density, its absolute value is often difficult to interpret in a business context. The real value comes from inspecting the clusters themselves.

By examining the top genres, keywords, and representative titles within each cluster, we can assign a meaningful theme to each group (e.g., 'US TV Dramas', 'International Thrillers', 'Stand-up Comedy'). This **interpretability** is what allows the business to act on the results. These themed clusters can be directly used to build recommendation carousels ('Because you watched X, you might like these other titles in the *International Thrillers* group') and to analyze content gaps in the library.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the fitted models and vectorizer
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')
joblib.dump(svd, 'svd_model.pkl')
joblib.dump(kmeans, 'kmeans_model.pkl')
print("Models and vectorizer saved successfully.")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
loaded_tfidf = joblib.load('tfidf_vectorizer.pkl')
loaded_svd = joblib.load('svd_model.pkl')
loaded_kmeans = joblib.load('kmeans_model.pkl')

# Create a new, unseen movie description
unseen_movie = "An American spy goes on a secret mission in Europe to stop a global conspiracy. Full of action and suspense."

# Follow the same transformation pipeline
unseen_tfidf = loaded_tfidf.transform([unseen_movie])
unseen_reduced = loaded_svd.transform(unseen_tfidf)
prediction = loaded_kmeans.predict(unseen_reduced)

print(f"--- Sanity Check ---")
print(f"The unseen movie was assigned to Cluster: {prediction[0]}")

# **Conclusion**

This project successfully demonstrated the power of unsupervised machine learning to bring structure and insight to a large, unstructured content library. By combining NLP techniques like TF-IDF with K-Means clustering, we were able to segment the entire Netflix catalog into 10 distinct and interpretable groups. The analysis of these clusters revealed clear thematic patterns, such as the separation of international dramas, US-based TV shows, documentaries, and children's content.

The primary business value of this work lies in its direct application to content recommendation and strategic analysis. The created clusters can be immediately used to power a 'content-based' recommendation engine, suggesting similar titles to users based on the thematic group of what they just watched. Furthermore, by analyzing the size and composition of these clusters, Netflix can gain a high-level understanding of its content portfolio, identify potential gaps, and make more data-driven decisions about which content areas to invest in next.

In conclusion, this project provides not just a model, but a repeatable framework for understanding and segmenting a content library, turning unstructured text data into a valuable strategic asset.