### Project Description:

The main goal of the assignment is to perform topic modeling on a dataset of news articles and cluster them into different topics. 

This kind of analysis can be helpful for various purposes, such as understanding public senses, determining significant events, and organizing news articles navigation based on them.

In [None]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import yaml 

#### load & prepare data

For the beginign, I loaded data and did initial investigate on them like get their shape and the type of the columns (topics). And also see the sample of it. This will help to see if the data proper for rest of the investigation and match on the format that is expected.

In [None]:
#load the data

configPath = 'config.yaml'

# Read the yaml data from the file
with open(configPath, 'r') as file:
    configData = yaml.safe_load(file)

data = pd.read_csv(configData["World_News_path"])

data.head()

In [None]:
print(data.shape)

print(data.dtypes)

Based on the above results, the first column will not included in the rest of the assignmet, because it shows the date of topics and it is not important in the aim of this assignment.

##### Preprocessing data

The text data from the columns "Top1" to "Top25" is extracted and then cleaned using the clean_text function. The text cleaning involves removing unwanted characters and converting all text to lowercase. By this function texts are made more similar in format and model can better recognize the different context.

In [None]:
# Extract text from columns
text_data = data.loc[:, "Top1":"Top25"].values.flatten().tolist()

print(text_data)

In [None]:
def clean_text(text):
    text = re.sub(r"[^a-zA-Z]", " ", text)  # Remove unwanted characters
    text = re.sub(r"\s{2,}", " ", text)  # Remove more than 2 space
    text = text.strip() #remove any space in the first or last part of text
    text = text.lower() # Convert to lowercase
    
    return text


In [None]:
cleaned_text = [clean_text(text) if isinstance(text, str) else "" for text in text_data]

cleaned_text

As shown all characters except characters and 1 space between words are removed.

#### convert data to Document-Term Matrix (DTM)

In this step, cleaned text was converted into a Document-Term Matrix (DTM) using the TF-IDF vectorization technique. This matrix represents the frequency of words in each text entry.

In [None]:
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=1000, stop_words="english")

# Fit and transform the text data
dtm = vectorizer.fit_transform(cleaned_text)

# Convert the DTM to a dense array
dtm_array = dtm.toarray()

dtm

#### applying Non-Negative matrix factorization (NMF)

The Non-Negative Matrix Factorization (NMF) algorithm is applied to the DTM. NMF is an unsupervised machine learning algorithm commonly used for text clustering. NMF is modeling technique that decomposes the DTM into two matrices, one representing the document-topic distribution and the other representing the topic-term distribution. 

For NMF we have to set cluster amount to recognize the proper number of cluster, the elbow method is used.

In [None]:
import matplotlib.pyplot as plt

# Calculating sum of square of different numbers of clusters
wcss = []
max_cluster = 10 #max clustering based on 10
for num_cluster in range(1, max_cluster + 1):
    nmf = NMF(n_components = max_cluster, random_state = 42)
    nmf.fit(dtm.toarray())
    wcss.append(nmf.reconstruction_err_)

In [None]:
# Plotting the Elbow Curve
plt.figure(figsize=(8, 6))
plt.plot(range(1, max_cluster + 1), wcss, marker='o')
plt.xlabel("Number of clusters")
plt.ylabel("Within-cluster Sum of Squares (WCSS)")
plt.title("Elbow method to find proper number of clusters")
plt.xticks(range(1, max_cluster + 1))
plt.grid()
plt.show()

By the above plot drived from elbow method, the result is a line with a relatively fixed slope. (the WCSS amount is between 201 and 207). As I didn't want to have more than 10 cluster, so this plot can show us: the data may not have distinct clusters in our range.

It seems it may be difficult to determine the well-defined number of clusters with the elbow Method alone. So, the silhouette is used to double check the quality of clustering in max_clustering 10

In [None]:
from sklearn.metrics import silhouette_score

max_cluster = 10
silhouette_scores = []
for num_cluster in range(2, max_cluster + 1):
    nmf = NMF(n_components = num_cluster, random_state = 42)
    nmf.fit(dtm.toarray())
    cluster_labels = nmf.transform(dtm.toarray()).argmax(axis=1)
    silhouette_scores.append(silhouette_score(dtm.toarray(), cluster_labels))

In [None]:
import numpy as np

# Finding the proper number of clusters based on highest Silhouette Score
optimal_num_clusters = np.argmax(silhouette_scores) + 2 # the 2 added because the loop is started from 2

print("Proper number of clusters:", optimal_num_clusters)

Based on the silhouette scores I picked the 3 clusters for the rest. (However, maybe other methods is exists for this aim and suggest other numbers.)

After running the NMF, the clusters and their corresponding topics are obtained. For accessing the cluster labels nmf.transform(dtm_array) is used. Each document will have a corresponding vector indicating its membership probabilities for each cluster. (It is notable that each document is assigned to one of the 3 clusters based on the highest probability in the document-topic matrix.)

In [None]:
num_clusters = 3

nmf = NMF(n_components=num_clusters, random_state=42)
nmf.fit(dtm_array)

#### performing results

The result are investigated in these steps:

1. Getting the 10 top words from each cluster: the result is shown based on the importance given by NMF. 

2. Assigning documents to clusters: Assigning each document to a cluster based on the highest probability in the document-topic matrix.

3. Visualization: With t-SNE document-topic matrix is visualed in to 2D space.

In [None]:
# 1. Getting the 10 top words from each cluster
feature_names = vectorizer.get_feature_names()

num_top_words = 10  # Define the number of top words to display

# Get the top words for each topic
for i, component in enumerate(nmf.components_):
    top_words_indices = component.argsort()[:-num_top_words-1:-1]
    top_words = [feature_names[idx] for idx in top_words_indices]
    print(f"Cluster/Topic {i+1}: {', '.join(top_words)}")


The above result is shown 10 top word in each clusters.

In the following assign each documnet (expression/topic in the main dataset) to one the clusters based on the highest probability

In [None]:
# 2. get the document-topic matrix
doc_topic_matrix = nmf.transform(dtm_array)

# Assign documents to clusters based on the highest probability
cluster_labels = np.argmax(doc_topic_matrix, axis=1)

# Print the assigned cluster for each document
for i, cluster_label in enumerate(cluster_labels):
    print(f"Document {i+1} belongs to Cluster {cluster_label+1}")

As the shape of topics were 1859 rows and 25 topics (with out date column as illustrated before) so we have 46475 (=25*1859) document. in the above all 46475 document are assigned to one cluster.

In [None]:
#Visualization with t-sne in 2D.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Apply t-SNE to reduce dimensionality to 2D
tsne = TSNE(n_components=2, random_state=42)
doc_topic_tsne = tsne.fit_transform(doc_topic_matrix)

# Plot the clusters
plt.scatter(doc_topic_tsne[:, 0], doc_topic_tsne[:, 1], c=cluster_labels)
plt.title('Clustering of Documents')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.show()


The 2D scatter plot displays the results of text clustering using NMF on the news headlines dataset. Each point represents a news headline, and the color of the points corresponds to their assigned cluster. Clusters appear to be closer together, suggesting higher similarity among documents in them. The clusters are not completely seprated from each other and there are some kind of noise among of them. From the above plot can be concluded that the clustering could be improved.

Also, the distrbution of documents between clusters have high difference. To be sure about this the following code is added to get the amount of documents in each cluster. (for this the cluster_labels that calculated in the "Assigning documents to clusters" is used.)

In [None]:
# Count the occurrences of each cluster label
cluster_counts = np.bincount(cluster_labels)

# Print the number of documents in each cluster
for cluster_num, doc_count in enumerate(cluster_counts):
    print(f"Cluster {cluster_num + 1}: {doc_count} documents")

The number of documents that used in each cluster are printted in the above and it is similar with the t-sne result. (big differences in distribution. For instance the smallest cluster has 393 however bigger one has 40837 documents.)

For conclusion, data needs more analysing and investigating. Clustering news topics can be compare by other ways because by the result of this assignment there is some kind of similarity between headlines.