<a href="https://colab.research.google.com/github/DeepthiManthapuram/MachineLearning-Algorithms/blob/main/News_Hierarchical_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# News Article Clustering using Hierarchical Clustering

## Business Problem
A media company publishes thousands of news articles daily.

### Problems:
- Articles are not consistently tagged  
- Manual categorization is expensive  
- New topics emerge frequently  

### Objective:
- Automatically group similar news articles  
- Discover hidden themes without predefined categories  
- Build a content recommendation system  

We will use **Hierarchical Clustering (Agglomerative)** to discover natural groupings.


In [None]:

import pandas as pd

# Load dataset (Replace with actual file path)
# df = pd.read_csv("news_dataset.csv")

# Example dummy dataset
data = {
    "text": [
        "AI and machine learning are transforming technology",
        "Deep learning improves artificial intelligence systems",
        "Stock markets fluctuate due to economic changes",
        "Investors analyze financial market trends",
        "Healthy diet and exercise improve physical health",
        "Yoga and meditation improve mental well being"
    ]
}

df = pd.DataFrame(data)
df.head()


In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text to numerical form using TF-IDF
tfidf = TfidfVectorizer(max_features=800, stop_words='english')

X = tfidf.fit_transform(df["text"])

print("Shape of TF-IDF matrix:", X.shape)


In [None]:

from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

# Use subset if dataset is large
X_dense = X.toarray()

Z = linkage(X_dense, method='ward')

plt.figure()
dendrogram(Z)
plt.title("Dendrogram of News Articles")
plt.xlabel("Articles")
plt.ylabel("Distance")
plt.show()


In [None]:

from sklearn.cluster import AgglomerativeClustering

# Choose number of clusters based on dendrogram observation
model = AgglomerativeClustering(n_clusters=3, linkage='ward')

labels = model.fit_predict(X_dense)

df["Cluster"] = labels
df


In [None]:

from sklearn.metrics import silhouette_score

score = silhouette_score(X_dense, labels)

print("Silhouette Score:", score)



## Validation Without Labels

Since this is unsupervised learning, we evaluate clustering quality using:

### Silhouette Score
- Value ranges from -1 to 1  
- Closer to 1 → well-separated clusters  
- Around 0 → overlapping clusters  
- Negative → poor clustering  

A higher silhouette score indicates better cluster separation.
