Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Bongani Fortunate Ncube"
StudentID = "25503848"

---

## COS 760 Homework 2 - Part 2 2025
## Exploratory Analysis using unsupervised approaches.

* Year: **2025**
* Author: Prof **Vukosi Marivate**, Fiskani Banda
* Contact: vukosi.marivate@cs.up.ac.za, fiskani.banda@tuks.co.za

**You will learn how to:**
- Revisit k-means clustering, but now for text data
- Visualise documents, using dimentionslality reduction with PCA and TSNE

**Note:** you can experiment by adding aditional cells, but they must be removed from final solution. Only the cells originally in the notebook plus the ones you have filled with your solution are required.

## 1 Packages ##

Scikit-Learn for text Analysis
- [sklearn](https://scikit-learn.org/stable/auto_examples/text/index.html): Scikit-Learn Working with text documents
- [matplotlib](http://matplotlib.org) is a library for plotting graphs in Python.

In [None]:
%pylab inline

In [3]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import string

## Fetch news articles from AriseTV from Nigeria


In [2]:
url='https://raw.githubusercontent.com/chimaobi-okite/NLP-Projects-Competitions/refs/heads/main/NewsCategorization/Data/train.csv'
df = pd.read_csv(url)
print("Number of records (news articles): ",df.shape[0])
df.head()

NameError: name 'pd' is not defined

# Questions

In this homework, our focus shifts away from categories. We aim to explore the data **without** labels. Let's save the extracted text in *documents* and proceed with converting our data using TF-IDF.

In [None]:
documents = df.Excerpt

vectorizer_tfidf = TfidfVectorizer(min_df = 5,
                                   max_df = 0.95,
                                   max_features = 5000)
vectorizer_tfidf.fit(documents)
X_tfidf = vectorizer_tfidf.transform(documents)
print("Transformed Data Size: {}".format(X_tfidf.shape))
tfidf_feature_names = vectorizer_tfidf.get_feature_names_out()

In [None]:
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans

## Q1.1: Clustering using k-means

**Resources:**  
* K-Means Clustering - An Introduction - [URL](https://towardsdatascience.com/k-means-clustering-an-introduction-9825ea998d1e/)  
* Why Mini-Batch KMeans? [URL](https://upcommons.upc.edu/bitstream/handle/2117/23414/R13-8.pdf)  

**Task:**  

* Investigate the optimal number of clusters for the given data. Use the sklearn Mini-Batch K-means with a *batch_size* of 2048. [4 points]  

**Outputs:**  

* Save the trained model as **clf**  
* Store the inertia after fitting the K-means model in **within_cluster**  

**Note:** Training might take some time. Please be patient.

In [None]:
# A 1.1 [4 points]
def train_kmeans(X, k):
    '''
    Trains kmeans clustering using MiniBatchKMeans.

            Parameters:
                    X (numpy array): input data
                    k (int): Number of clusters

            Returns:
                    clf (model): KMeans Model
                    within_cluster: Kmeans within cluster sum of squares
    '''
    # YOUR CODE HERE
    raise NotImplementedError()
    return clf, within_cluster

In [None]:
k = 1 # Single cluster
clf, within_cluster = train_kmeans(X_tfidf, k)
assert clf.n_clusters == 1
assert clf.inertia_ > 100.0

In [None]:
repetitions = 5
x_range = range(20,421,20)
within_cluster = []
for k in x_range:
    temp_wc = []
    print('Fit {} clusters'.format(k))
    for i in range(repetitions):
        clf, within_cluster_single = train_kmeans(X_tfidf, k)
        temp_wc.append(within_cluster_single )
    within_cluster.append(temp_wc)
within_cluster = np.array(within_cluster)

## Lets plot the Intertia (Sum of square errors)

In [None]:
figsize(16,9)
y = np.mean(within_cluster,axis=1)
plt.plot(x_range,y, marker='o')
plt.ylabel('Inertia (Sum of Square Errors)')
plt.title('Number of clusters vs. Inervia')
plt.grid()

## Q1.2 Discussion  

* a) What do you observe about the pattern in the Inertia vs. Number of Clusters (\(k\)) plot? [2 points]  
* b) Based on the elbow method, what would be your chosen \(k\)? [1 point]

YOUR ANSWER HERE

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

## Visualizing the Clusters  

For easier visualization, we will set \( k = 20 \).  

In this section, you will use **PCA** and **t-SNE** to reduce the dimensionality of the original TF-IDF data to 2 dimensions.  

**Resources:**  
* PCA Documentation - [URL](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) | [Video Explanation](https://www.youtube.com/watch?v=kw9R0nD69OU)  
* t-SNE Documentation - [URL](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) | [Video Explanation](https://www.youtube.com/watch?v=NEaUSP4YerM)  

**Note:** We do not cover **UMAP** in this assignment, but you may explore it for your own curiosity.  
[Video Explanation](https://www.youtube.com/watch?v=eN0wFzBA4Sc)

In [None]:
k = 20
clf, wc = train_kmeans(X_tfidf, k)
cluster_labels = clf.predict(X_tfidf)

In [None]:
random_sample = np.random.choice(range(X_tfidf.shape[0]), size=2000, replace=False)
X_tfidf_sample = X_tfidf[random_sample,:]
cluster_labels_sample = cluster_labels[random_sample]

In [None]:
df_samples = df.iloc[random_sample]
df_samples["cluster_labels"] = cluster_labels_sample.astype(str).tolist()
news_titles = df.Title.iloc[random_sample]

## Q2.1 Dimensionality Reduction  

Fit **PCA** and **t-SNE** with 2 dimensions (\( n\_components = 2 \)). Apply these transformations to **X_tfidf_sample**. [4 points]

In [None]:
X_tfidf_sample.shape

In [None]:
# A2.1 4 POints
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert X_pca.shape[0] == X_tfidf_sample.shape[0]
assert X_pca.shape[1] == 2
assert X_tsne.shape[0] == X_tfidf_sample.shape[0]
assert X_tsne.shape[1] == 2

In [None]:
figsize(16,9)
scatter = plt.scatter(X_pca[:,0],X_pca[:,1], c=cluster_labels_sample,cmap='jet_r') # cmap is the color map for the plot
plt.grid()
plt.title("Cluster visualisation using PCA", fontsize=18)
plt.xlabel("PCA_0",fontsize=14)
plt.ylabel("PCA_1",fontsize=14)
plt.legend(*scatter.legend_elements(),
                    loc="upper left", title="Cluster")

In [None]:
figsize(16,9)
scatter = plt.scatter(X_tsne[:,0],X_tsne[:,1], c=cluster_labels_sample,cmap='jet_r') # cmap is the color map for the plot
plt.grid()
plt.title("Cluster visualisation using TSNE", fontsize=18)
plt.xlabel("tsne_0",fontsize=14)
plt.ylabel("tsne_1",fontsize=14)
plt.legend(*scatter.legend_elements(),
                    loc="upper left", title="Cluster")

In [None]:
df_samples["tsne_0"] = X_tsne[:,0]
df_samples["tsne_1"] = X_tsne[:,1]

## Interactive Plot  

To help explore the t-SNE plot, we provide an interactive **Plotly** plot.  

**About [Plotly](https://plotly.com/python/):** Plotly's Python graphing library enables interactive, publication-quality visualizations.  

You can hover over the scatter plot to view details of individual points.

In [None]:
import plotly.express as px

In [None]:
np.sort(df_samples["cluster_labels"].unique())

In [None]:
fig = px.scatter(df_samples,x="tsne_0",y="tsne_1",
                 color="cluster_labels", hover_data="Title",height = 720,
                title="Cluster visualisation using TSNE - plotly")
fig.update_layout(legend_traceorder="reversed")
fig.show()

In [None]:
## Helper function to get the top words per cluster
def get_top_keywords(X, clusters, labels, n_terms):
    df = pd.DataFrame(X.todense()) # Convert our data TFIDF data to a dataframe
    df = df.groupby(clusters).mean() # Group by clustes and ccalculate the mean per feature (word)

    for i,r in df.iterrows():
        print('\nCluster {}'.format(i))
        print(','.join([labels[t] for t in np.argsort(r)[-n_terms:]])) #Show te top n_terms as per TFIDF

get_top_keywords(X_tfidf_sample, cluster_labels_sample, vectorizer_tfidf.get_feature_names_out(), 10)

## Q2.2 Observations  

* **a)** What patterns do you notice in the cluster visualizations? Compare and contrast the results from **PCA** and **t-SNE** in terms of how they separate the data. [2 points]  
* **b)** Based on the words within each cluster, can you explain the possible themes or topics represented? [1 point]  
* **c)** Refer to the [NewsCodes List](https://github.com/dsfsi/za-isizulu-siswati-news-2022/blob/main/data/news-categories-iptc-newscodes.csv). Select **five clusters** and match each with an appropriate news category. (**Note:** Copy the cluster words from the output above and append a relevant news category name.) [1 point]

YOUR ANSWER HERE