# Dataset preparation for news clustering

# Introduction

The project aims to provide the dataset for clustering news together with same topic.

In [295]:
# Reading the news data
import pandas as pd

file_path = "/Users/amosmbp14/Jupyter notebook/News_classifier/dataset/rel_model_news.xlsx"
df_news = pd.read_excel(file_path)

In [296]:
# showing the dimension of the dataset
df_news.shape

(818, 24)

In [297]:
# showing the columns name
df_news.columns

Index(['date', 'title', 'body', 'url', 'nb_model_preds',
       'relevancy (yes =1, no=0)', 'scrap fully (Y/N)', 'uri', 'lang',
       'isDuplicate', 'time', 'dateTime', 'dateTimePub', 'dataType', 'sim',
       'source', 'authors', 'image', 'eventUri', 'sentiment', 'wgt',
       'relevance', 'sourceTitle', 'sourceUri'],
      dtype='object')

In [298]:
# Extracting necessary column for clustering
dataset = df_news[["date","title","body", "isDuplicate"]].copy()

# Filtering out any duplicated news in the dataset
duplicate_count = dataset['isDuplicate'].value_counts()
duplicate_count

isDuplicate
False    818
Name: count, dtype: int64

Comment: No duplicated news. Next, it is going to create some functions to extract different types of keywords for clustering.

# Generating news summary

This part aims to generate summary to extract the main meaning of the news. It could reduce the noise by using the original body of news for clustering.

In [299]:
sum_df = dataset.copy()
sum_df1 = sum_df[['date','title','body']].copy()

In [300]:
from transformers import pipeline

summarizer = pipeline("summarization", 
                      model = "/Users/amosmbp14/Jupyter notebook/Text_sum/t5_base_models")

  from .autonotebook import tqdm as notebook_tqdm


In [304]:
# summarize the news body
sum_df1['summary'] = sum_df1['body'].apply(summarizer)

Your max_length is set to 200, but your input_length is only 72. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=36)
Your max_length is set to 200, but your input_length is only 70. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=35)
Your max_length is set to 200, but your input_length is only 100. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
Your max_length is set to 200, but your input_length is only 126. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=63)
Yo

In [305]:
# Define a function to extract the summary from the dictionary

def extract_summary(summary):
    # Check if the input is a list and not empty
    if isinstance(summary, list) and summary:
        # Extract the 'summary_text' from the 1st dictionary of the list
        entry_1st = summary[0]
        if isinstance(entry_1st, dict) and 'summary_text' in entry_1st:
            return entry_1st['summary_text']
        else:
            return "Invalid data structure: Missing 'summary_text' key."

    else:
        return 'Invalid data sturcture: not a non-empty list.'

In [306]:
# Define a function to clean the extracted summary
import re

def clean_summary_text(summary_text):
    # remove extra spaces before periods
    cleaned_text = re.sub(r'\s+\.', '.', summary_text)

    # Add a period at the end if the last sentences dosen't have one
    if not cleaned_text.endswith('.'):
        cleaned_text += '.'

    # Correct spacing after periods that end with a capital letter
    cleaned_text = re.sub(r'(?<=[a-zA-Z])\.([A-Z])', r'. \1', cleaned_text)

    return cleaned_text

In [307]:
# extract summary from dictionary and clean it
sum_df1['summary'] = sum_df1['summary'].apply(extract_summary)
sum_df1['summary'] = sum_df1['summary'].apply(clean_summary_text)

In [308]:
sum_df1['summary'].loc[0]

"Ingunn is the strongest storm since Norway's 1992 New Year's hurricane. Storm Jocelyn follows hot on the heels of Storm Isha with wind speeds of up to 156 kph. Climate change is making storms more impactful in various ways, experts say."

# Generate keywords by using 'KeyBERT'

In [311]:
# Define the pre-trained keybert model for keywords extraction
from keybert import KeyBERT

kw = KeyBERT()
kw_model = KeyBERT('all-mpnet-base-v2')

In [312]:
# Test the function of keyBERT model to extract keywords
testing_text = sum_df1['summary'].loc[2]
print('1. Testing sentence:')
print(testing_text)

keywords = kw_model.extract_keywords(testing_text, 
                                     top_n = 10)
print('')
print('2. Output from our KeyBERT model:')
keywords

1. Testing sentence:
farmers are intensifying their pressure on EU leaders gathered in Brussels today. Belgian farmers took a prominent stance by blocking the city center with their tractors. EU offered to shield farmers from cheap Ukrainian imports and proposed a delay.

2. Output from our KeyBERT model:


[('farmers', 0.51),
 ('tractors', 0.3321),
 ('leaders', 0.2705),
 ('brussels', 0.2484),
 ('eu', 0.2062),
 ('pressure', 0.1967),
 ('belgian', 0.1836),
 ('blocking', 0.1644),
 ('stance', 0.1148),
 ('intensifying', 0.1104)]

In [313]:
# Test the function of keyBERT model to extract keywords
testing_text_2 = sum_df1['summary'].loc[78]
print('1. Testing sentence:')
print(testing_text_2)

keywords_2 = kw_model.extract_keywords(testing_text_2, 
                                     top_n = 10)
print('')
print('2. Output from our KeyBERT model:')
keywords_2

1. Testing sentence:
Switzerland agrees to host a global peace summit on Ukraine at request of Ukrainian President Volodymyr Zelenskiy. Neutral Switzerland has previously acted as a broker to resolve conflicts and could now help find a resolution for the war that began when Russia sent troops into Ukraine on Feb. 24, 2022. Ukraine has said it was vital that Beijing was involved in talks to end the conflict.

2. Output from our KeyBERT model:


[('ukraine', 0.4972),
 ('switzerland', 0.4012),
 ('ukrainian', 0.364),
 ('russia', 0.3457),
 ('war', 0.3255),
 ('peace', 0.2825),
 ('conflicts', 0.2408),
 ('conflict', 0.2209),
 ('summit', 0.2041),
 ('talks', 0.199)]

In [314]:
# Test the function of keyBERT model to extract keywords
text = sum_df1['summary']

kds = kw_model.extract_keywords(text, 
                                top_n = 10)

In [315]:
# add a new column to stored the extracted the keywords
sum_df1['extract_kds'] = kds
sum_df1.head(3)

Unnamed: 0,date,title,body,summary,extract_kds
0,2024-02-01,Norway hit by worst storm in 30 years - is cli...,Storm Ingunn follows back to back storms that ...,Ingunn is the strongest storm since Norway's 1...,"[(storm, 0.5237), (storms, 0.4863), (hurricane..."
1,2024-02-01,Victims left with 'life-changing' injuries aft...,Superintendent Gabriel Cameron said none of th...,a 31-year-old woman and her daughter seriously...,"[(acid, 0.3922), (corrosive, 0.3589), (attack,..."
2,2024-02-01,Angry farmers invade the streets of Brussels,Farmers are intensifying their pressure on EU ...,farmers are intensifying their pressure on EU ...,"[(farmers, 0.51), (tractors, 0.3321), (leaders..."


In [316]:
# Extract all keywords from 'extract_kds' without the scores
extracted_kds = [[item[0] for item in row] for row in sum_df1['extract_kds']]

In [317]:
# add a new column 'keywords' to df
sum_df1['keywords'] = extracted_kds
sum_df1.head(3)

Unnamed: 0,date,title,body,summary,extract_kds,keywords
0,2024-02-01,Norway hit by worst storm in 30 years - is cli...,Storm Ingunn follows back to back storms that ...,Ingunn is the strongest storm since Norway's 1...,"[(storm, 0.5237), (storms, 0.4863), (hurricane...","[storm, storms, hurricane, ingunn, strongest, ..."
1,2024-02-01,Victims left with 'life-changing' injuries aft...,Superintendent Gabriel Cameron said none of th...,a 31-year-old woman and her daughter seriously...,"[(acid, 0.3922), (corrosive, 0.3589), (attack,...","[acid, corrosive, attack, injured, clapham, ch..."
2,2024-02-01,Angry farmers invade the streets of Brussels,Farmers are intensifying their pressure on EU ...,farmers are intensifying their pressure on EU ...,"[(farmers, 0.51), (tractors, 0.3321), (leaders...","[farmers, tractors, leaders, brussels, eu, pre..."


In [318]:
# Join the keywords by comma for the column 'keywords'
sum_df1['keywords'] = sum_df1['keywords'].apply(lambda x: ', '.join(x))
sum_df1.head(3)

Unnamed: 0,date,title,body,summary,extract_kds,keywords
0,2024-02-01,Norway hit by worst storm in 30 years - is cli...,Storm Ingunn follows back to back storms that ...,Ingunn is the strongest storm since Norway's 1...,"[(storm, 0.5237), (storms, 0.4863), (hurricane...","storm, storms, hurricane, ingunn, strongest, w..."
1,2024-02-01,Victims left with 'life-changing' injuries aft...,Superintendent Gabriel Cameron said none of th...,a 31-year-old woman and her daughter seriously...,"[(acid, 0.3922), (corrosive, 0.3589), (attack,...","acid, corrosive, attack, injured, clapham, chi..."
2,2024-02-01,Angry farmers invade the streets of Brussels,Farmers are intensifying their pressure on EU ...,farmers are intensifying their pressure on EU ...,"[(farmers, 0.51), (tractors, 0.3321), (leaders...","farmers, tractors, leaders, brussels, eu, pres..."


# HDBSCAN Clustering 

In [319]:
from sentence_transformers import SentenceTransformer

# Download the sentence embedding model
embedder = SentenceTransformer('all-mpnet-base-v2')

# embed article body
corpus_embeddings = embedder.encode(sum_df1['keywords'].values)
corpus_embeddings.shape

(818, 768)

Reducing the dimension by UMAP, here is the instruction of parameter:

'n_neighbors': Constructing the high-dimensional representation of the data. More global structure could be preserved if increased.

'min_dist': Controlling the minimumn distance between points in the low-dimensional embedding space. More packed embeddings if decreased.

In [320]:
import umap

# Reduce the dimensions of embeddings using UMAP
reduced_embeddings = umap.UMAP(n_components=2,
                              n_neighbors= 20,
                              min_dist=0.02).fit_transform(corpus_embeddings)



HDBSCAN clustering parameter: 'min_cluster_size': Specifying the minimum number of points required to group of points to be a cluster.

In [321]:
import hdbscan
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

# Define a range of min_cluster_size values to search over
min_cluster_size_range = list(range(2,9))  # Adjust as needed

# Initialize variables to store the best cluster and its evaluation scores
best_min_cluster_size = None
best_silhouette_score = float('-inf')
best_calinski_harabasz_score = float('-inf')
best_davies_bouldin_score = float('inf')

# Perform clustering for each min_cluster_size value
for min_cluster_size in min_cluster_size_range:
    # Clustering with HDBSCAN
    clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size)
    labels = clusterer.fit_predict(reduced_embeddings)
    
    # Compute evaluation scores
    silhouette = silhouette_score(reduced_embeddings, labels)
    calinski_harabasz = calinski_harabasz_score(reduced_embeddings, labels)
    davies_bouldin = davies_bouldin_score(reduced_embeddings, labels)
    
    # Update best cluster if current cluster has better scores
    if silhouette > best_silhouette_score and calinski_harabasz > best_calinski_harabasz_score and davies_bouldin < best_davies_bouldin_score:
        best_min_cluster_size = min_cluster_size
        best_silhouette_score = silhouette
        best_calinski_harabasz_score = calinski_harabasz
        best_davies_bouldin_score = davies_bouldin

# Print the best cluster and its evaluation scores
print(f"Best Min Cluster Size: {best_min_cluster_size}")
print(f"Best Silhouette Score: {best_silhouette_score}")
print(f"Best Calinski-Harabasz Index: {best_calinski_harabasz_score}")
print(f"Best Davies-Bouldin Index: {best_davies_bouldin_score}")


Best Min Cluster Size: 2
Best Silhouette Score: 0.3325442969799042
Best Calinski-Harabasz Index: 34.246652175759685
Best Davies-Bouldin Index: 1.5284217064690295


In [322]:
import hdbscan

# clustering with HDBSCAN
clusterer = hdbscan.HDBSCAN(min_cluster_size = best_min_cluster_size)
labels = clusterer.fit_predict(reduced_embeddings)
sum_df1["label"] = [str(label) for label in labels]
print(f"Num of clusters: {labels.max()}")

Num of clusters: 116


In [323]:
# number of outliers
num_outliers = len(sum_df1[sum_df1["label"] == "-1"])
print(f"Num of outliers: {num_outliers} ({num_outliers / len(sum_df1) * 100:.2f} % of total)")

Num of outliers: 133 (16.26 % of total)


In [324]:
sum_df1['label'].value_counts()

label
-1     133
41      18
91      17
31      16
50      16
      ... 
67       2
38       2
58       2
12       2
107      2
Name: count, Length: 118, dtype: int64

In [326]:
# save for record
sum_df1.to_csv('clustered_HDBSCAN.csv', index = False)

# Nested clustering

In [327]:
# Subset the dataframe 
nested_df = sum_df1[['date','title','body','summary','keywords','label']].copy()
nested_df.columns

# pre-process text
docs = nested_df['summary'].astype(str)  # Convert to string type if necessary
docs = docs.str.lower()  # Convert all strings to lowercase
docs = docs.str.replace(r'[^\w\s]', '')  # Remove non-alphanumeric characters using regex

In [328]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Cosine similarity 
cosine_sim = cosine_similarity(tfidf_matrix)

# no.of cluster
n_clusters = int(len(df)/2)

# H-Clustering
H_clustering = AgglomerativeClustering(n_clusters = n_clusters,
                                      linkage = 'complete')
H_clustering.fit(cosine_sim)

# Assign nested_label to the dataframe
nested_df['nested_label'] = H_clustering.labels_.astype(str)

In [329]:
# Combine label and nested_label into group_no
nested_df['group_no'] = nested_df.apply(lambda x: f"{x['label']}{x['nested_label']}" if x['nested_label'] != -1 else -1, axis=1)

# Find unique group_no values
unique_group_nos = nested_df['group_no'].value_counts()[nested_df['group_no'].value_counts() == 1].index.tolist()

# Set 'group_no' to -1 for rows with unique group_no values
nested_df.loc[nested_df['group_no'].isin(unique_group_nos), 'group_no'] = -1

# Set 'group_no' to -1 for rows with negative group_no values
nested_df.loc[nested_df['group_no'].astype(int) < 0, 'group_no'] = -1

In [330]:
nested_df['group_no'].value_counts()

group_no
-1        559
9541        8
1129        8
6557        5
4171        5
         ... 
86157       2
104209      2
3445        2
24102       2
9932        2
Name: count, Length: 102, dtype: int64

In [334]:
groupno_9541 = nested_df[nested_df['group_no'] == '9541']
groupno_9541 = groupno_9541[['title','summary','group_no']]
groupno_9541

Unnamed: 0,title,summary,group_no
175,Monkey caught in Cairngorms after five days on...,Japanese macaque escaped from the Highland Wil...,9541
236,Monkey searchers absolutely spank it by findin...,Japanese macaque that escaped from a Scottish ...,9541
332,Escaped Highland monkey 'Kinghussie Kong' is r...,"Japanese macaque, nicknamed Kinghussie Kong, t...",9541
357,Escaped monkey found snacking on peanuts in Hi...,Japanese macaque found snacking on peanuts aft...,9541
440,Escaped monkey captured eating from garden bir...,Japanese macaque found a way out of its enclos...,9541
538,Scotland's missing monkey has been found after...,Japanese Macaque found a way out of its enclos...,9541
585,Escaped monkey captured after roaming Highland...,Japanese macaque broke out of the Highland Wil...,9541
589,Escaped monkey remains on the loose as search ...,Japanese macaque remains on the loose in Kincr...,9541


In [335]:
groupno_1129 = nested_df[nested_df['group_no'] == '1129']
groupno_1129 = groupno_1129[['title','summary','group_no']]
groupno_1129

Unnamed: 0,title,summary,group_no
191,Martin Brundle leads Lewis Hamilton to Ferrari...,"Martin Brundle admits he is ""so looking forwar...",1129
198,Carlos Sainz and eight other potential replace...,Lewis Hamilton has not won an F1 race for over...,1129
211,Lewis Hamilton to Ferrari LIVE: Latest F1 updates,Lewis Hamilton is set to leave Mercedes for Fe...,1129
336,Lewis Hamilton to Ferrari live: Mercedes drive...,Lewis Hamilton is expected to make a shock swi...,1129
339,Lewis Hamilton in talks to join Ferrari from M...,Seven-time world champion Lewis Hamilton is ex...,1129
400,Lewis Hamilton: F1 star set for sensational sw...,Lewis Hamilton is set to leave Mercedes and ma...,1129
461,Lewis Hamilton set to join Ferrari for 2025 Fo...,Lewis Hamilton is on the brink of a shock move...,1129
542,Lewis Hamilton set to leave Mercedes and move ...,Lewis Hamilton is expected to be leaving Merce...,1129


In [336]:
groupno_6557 = nested_df[nested_df['group_no'] == '6557']
groupno_6557 = groupno_6557[['title','summary','group_no']]
groupno_6557

Unnamed: 0,title,summary,group_no
193,Labour government will cap corporation tax for...,Labour will cap corporation tax at 25 per cent...,6557
218,Rachel Reeves defends decision to allow unlimi...,Labour's holder of the purse strings says she ...,6557
276,Rachel Reeves defends decision not to restore ...,Rachel Reeves has defended her decision not to...,6557
379,Labour rules out raising corporation tax above...,Shadow chancellor Rachel Reeves unveils raft o...,6557
401,Rachel Reeves: Labour will cap corporation tax...,Rachel Reeves tells bosses a Labour government...,6557


In [344]:
groupno_3445 = nested_df[nested_df['group_no'] == '3445']
groupno_3445 = groupno_3445[['title','summary','group_no']]
groupno_3445

Unnamed: 0,title,summary,group_no
265,Justin Timberlake makes 'disgusting' comment o...,Britney apologised to Justin Timberlake just 4...,3445
675,Janet Jackson and Justin Timberlake's 'Nippleg...,"""Nipplegate"" occurred at Super Bowl XXXVIII's ...",3445


In [338]:
nested_df.to_csv('clustering_news.csv', index = False)