<a href="https://colab.research.google.com/github/254-KIPSANG/-Image-Processing-and-Fourier-Analysis-with-MATLAB-Visualizing-the-Frequency-Spectrum-of-JPEG-Image/blob/main/HW_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW 5: Clustering and Topic Modeling

<div class="alert alert-block alert-warning">Each assignment needs to be completed independently. Never ever copy others' work (even with minor modification, e.g. changing variable names). Anti-Plagiarism software will be used to check all submissions. </div>

In this assignment, you'll need to use the following dataset:
- text_train.json: This file contains a list of documents. It's used for training models
- text_test.json: This file contains a list of documents and their ground-truth labels. It's used for testing performance. This file is in the format shown below. Note, a document may have multiple labels.


**Note: due to randomness, every time you run your clustering models, you may get different results. To ease the grading process, once you get satisfactory results, please save your notebook as a pdf file (Jupyter notebook menu File -> Print -> Save as pdf), and submit this pdf along with your .py code.**

In [None]:
import pandas as pd


# Necessary imports for Q1
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support
from collections import Counter


# Add your import statement

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
train_data = pd.read_csv("hw5_train.csv")
train_data.head()

test_data = pd.read_csv("hw5_test.csv")
test_data.head()

FileNotFoundError: ignored

## Q1: K-Mean Clustering (5 points)

Define a function `cluster_kmean(train_data, test_data, num_clusters, min_df = 1, stopwords = None, metric = 'cosine')` as follows: 
- Take two dataframes as inputs: `train_data` is the dataframe loaded from `hw5_train.csv`, and `test_data` is the dataframe loaded from `hw5_test.csv`
- Use **KMeans** to cluster documents in `train_data` into 3 clusters by the distance metric specified. Tune the following parameters carefully:
    - `min_df` and `stopword` options in generating TFIDF matrix. You may need to remove corpus-specific stopwords in addition to the standard stopwords.
    - distance metric: `cosine` or `Euclidean` distance
    - sufficient iterations with different initial centroids to make sure clustering converges
- Test the clustering model performance using `test_data`: 
    - Predict the cluster ID for each document in `test_data`.
    - Apply `majority vote` rule to dynamically map each cluster to a ground-truth label in `test_data`. 
        - Note a small percentage of documents have multiple labels. For these cases, you can randomly pick a label during the match
        - Be sure `not to hardcode the mapping`, because a  cluster may corrspond to a different topic in each run. (hint: if you use pandas, look for `idxmax` function)
    - Calculate `precision/recall/f-score` for each label. Your best F1 score on the test dataset should be around `80%`.
- Assign a meaninful name to each cluster based on the `top keywords` in each cluster. You can print out the keywords and write the cluster names as markdown comments.
- This function has no return. Print out confusion matrix, precision/recall/f-score. 


**Analysis**:
- Comparing the clustering with cosine distance and that with Euclidean distance, do you notice any difference? Which metric works better here?
- How would the stopwords and min_df options affect your clustering results?

In [None]:
def cluster_kmean(train_data, test_data, num_clusters, min_df=1, stopwords=None, metric='cosine'):
    # Preprocess train_data
    train_docs = preprocess_documents(train_data['text'], stopwords)

    # Create TF-IDF matrix
    tfidf_vectorizer = TfidfVectorizer(min_df=min_df, stop_words=stopwords)
    train_tfidf_matrix = tfidf_vectorizer.fit_transform(train_docs)

    # Cluster documents using KMeans
    kmeans = KMeans(n_clusters=num_clusters, init='k-means++', n_init=10, max_iter=300, random_state=0)
    kmeans.fit(train_tfidf_matrix)

    # Preprocess test_data
    test_docs = preprocess_documents(test_data['text'], stopwords)

    # Transform test_data to TF-IDF matrix
    test_tfidf_matrix = tfidf_vectorizer.transform(test_docs)

    # Predict cluster IDs for test_data
    test_cluster_ids = kmeans.predict(test_tfidf_matrix)

    # Majority vote to map cluster to ground-truth label
    cluster_labels = {}
    for cluster_id in range(num_clusters):
        cluster_docs_indices = np.where(test_cluster_ids == cluster_id)[0]
        cluster_docs = test_data.iloc[cluster_docs_indices]
        cluster_label_counts = cluster_docs['labels'].str.split(';').apply(pd.Series).stack().value_counts()
        cluster_label = cluster_label_counts.idxmax()
        cluster_labels[cluster_id] = cluster_label

    # Calculate performance metrics
    mlb = MultiLabelBinarizer()
    ground_truth = mlb.fit_transform(test_data['labels'].str.split(';'))
    predicted_labels = [cluster_labels[cluster_id] for cluster_id in test_cluster_ids]
    predicted_labels = mlb.fit_transform(predicted_labels)

    precision, recall, fscore, _ = precision_recall_fscore_support(ground_truth, predicted_labels, average='weighted')

    # Print performance metrics
    print("Performance Metrics:")
    print("Precision: ", precision)
    print("Recall: ", recall)
    print("F-score: ", fscore)

    # Assign cluster names based on top keywords
    print("\nCluster Names:")
    for i in range(num_clusters):
        cluster_keywords = get_top_keywords(tfidf_vectorizer, kmeans.cluster_centers_[i], 5)
        cluster_name = "Cluster " + str(i + 1) + ": " + ", ".join(cluster_keywords)
        print(cluster_name)

    # Plot confusion matrix
    confusion_mat = confusion_matrix(ground_truth.argmax(axis=1), predicted_labels.argmax(axis=1))
    plt.figure(figsize=(8, 6))
    sns.heatmap(confusion_mat, annot=True, cmap="YlGnBu", cbar=False, xticklabels=mlb.classes_, yticklabels=mlb.classes_)
    plt.xlabel('Predicted Labels')
    plt.ylabel('Ground Truth Labels')
    plt.title('Confusion Matrix')
    plt.show()


In [None]:
# Clustering by cosine distance


def cluster_kmean(train_data, test_data, num_clusters, min_df=1, stopwords=None, metric='cosine'):
    # Preprocess train_data
    train_docs = preprocess_documents(train_data['text'], stopwords)

    # Create TF-IDF matrix
    tfidf_vectorizer = TfidfVectorizer(min_df=min_df, stop_words=stopwords)
    train_tfidf_matrix = tfidf_vectorizer.fit_transform(train_docs)

    # Cluster documents using KMeans with cosine distance
    if metric == 'cosine':
        kmeans = KMeans(n_clusters=num_clusters, init='k-means++', n_init=10, max_iter=300, random_state=0, 
                        verbose=0, algorithm='full', precompute_distances='auto', metric='cosine')
    else:
        # Cluster documents using KMeans with Euclidean distance
        kmeans = KMeans(n_clusters=num_clusters, init='k-means++', n_init=10, max_iter=300, random_state=0, 
                        verbose=0, algorithm='auto')
    
    kmeans.fit(train_tfidf_matrix)

    # Preprocess test_data
    test_docs = preprocess_documents(test_data['text'], stopwords)

    # Transform test_data to TF-IDF matrix
    test_tfidf_matrix = tfidf_vectorizer.transform(test_docs)

    # Predict cluster IDs for test_data
    test_cluster_ids = kmeans.predict(test_tfidf_matrix)

    # Majority vote to map cluster to ground-truth label
    cluster_labels = {}
    for cluster_id in range(num_clusters):
        cluster_docs_indices = np.where(test_cluster_ids == cluster_id)[0]
        cluster_docs = test_data.iloc[cluster_docs_indices]
        cluster_label_counts = cluster_docs['labels'].str.split(';').apply(pd.Series).stack().value_counts()
        cluster_label = cluster_label_counts.idxmax()
        cluster_labels[cluster_id] = cluster_label

    # Calculate performance metrics
    mlb = MultiLabelBinarizer()
    ground_truth = mlb.fit_transform(test_data['labels'].str.split(';'))
    predicted_labels = [cluster_labels[cluster_id] for cluster_id in test_cluster_ids]
    predicted_labels = mlb.fit_transform(predicted_labels)

    precision, recall, fscore, _ = precision_recall_fscore_support(ground_truth, predicted_labels, average='weighted')

    # Print performance metrics
    print("Performance Metrics:")
    print("Precision: ", precision)
    print("Recall: ", recall)
    print("F-score: ", fscore)

    # Assign cluster names based on top keywords
    print("\nCluster Names:")
    for i in range(num_clusters):
        cluster_keywords = get_top_keywords(tfidf_vectorizer, kmeans.cluster_centers_[i], 5)
        cluster_name = "Cluster " + str(i + 1) + ": " + ", ".join(cluster_keywords)
        print(cluster_name)

    # Plot confusion matrix
    confusion_mat = confusion_matrix(ground_truth.argmax(axis=1), predicted_labels.argmax(axis=1))
    print(pd.DataFrame(confusion_mat, index=['Cluster 0', 'Cluster 1', 'Cluster 2'], columns=['T1', 'T2', 'T3']))
    print(classification_report(ground_truth.argmax(axis=1), predicted_labels.argmax(axis=1), target_names=['T1', 'T2', 'T3']))


In [None]:
# Clustering by Euclidean distance

def cluster_kmean(train_data, test_data, num_clusters, min_df=1, stopwords=None, metric='euclidean'):
    # Preprocess train_data
    train_docs = preprocess_documents(train_data['text'], stopwords)

    # Create TF-IDF matrix
    tfidf_vectorizer = TfidfVectorizer(min_df=min_df, stop_words=stopwords)
    train_tfidf_matrix = tfidf_vectorizer.fit_transform(train_docs)

    # Cluster documents using KMeans with Euclidean distance
    if metric == 'cosine':
        # Cluster documents using KMeans with cosine distance
        kmeans = KMeans(n_clusters=num_clusters, init='k-means++', n_init=10, max_iter=300, random_state=0, 
                        verbose=0, algorithm='full', precompute_distances='auto')
    else:
        kmeans = KMeans(n_clusters=num_clusters, init='k-means++', n_init=10, max_iter=300, random_state=0, 
                        verbose=0, algorithm='auto')
    
    kmeans.fit(train_tfidf_matrix)

    # Preprocess test_data
    test_docs = preprocess_documents(test_data['text'], stopwords)

    # Transform test_data to TF-IDF matrix
    test_tfidf_matrix = tfidf_vectorizer.transform(test_docs)

    # Predict cluster IDs for test_data
    test_cluster_ids = kmeans.predict(test_tfidf_matrix)

    # Majority vote to map cluster to ground-truth label
    cluster_labels = {}
    for cluster_id in range(num_clusters):
        cluster_docs_indices = np.where(test_cluster_ids == cluster_id)[0]
        cluster_docs = test_data.iloc[cluster_docs_indices]
        cluster_label_counts = cluster_docs['labels'].str.split(';').apply(pd.Series).stack().value_counts()
        cluster_label = cluster_label_counts.idxmax()
        cluster_labels[cluster_id] = cluster_label

    # Calculate performance metrics
    mlb = MultiLabelBinarizer()
    ground_truth = mlb.fit_transform(test_data['labels'].str.split(';'))
    predicted_labels = [cluster_labels[cluster_id] for cluster_id in test_cluster_ids]
    predicted_labels = mlb.fit_transform(predicted_labels)

    precision, recall, fscore, _ = precision_recall_fscore_support(ground_truth, predicted_labels, average='weighted')

    # Print performance metrics
    print("Performance Metrics:")
    print("Precision: ", precision)
    print("Recall: ", recall)
    print("F-score: ", fscore)

    # Assign cluster names based on top keywords
    print("\nCluster Names:")
    for i in range(num_clusters):
        cluster_keywords = get_top_keywords(tfidf_vectorizer, kmeans.cluster_centers_[i], 5)
        cluster_name = "Cluster " + str(i + 1) + ": " + ", ".join(cluster_keywords)
        print(cluster_name)

    # Print confusion matrix
    confusion_mat = pd.crosstab(test_data['labels'].str.split(';').apply(pd.Series).stack().reset_index(level=1, drop=True), 
                                [cluster_labels[cluster_id] for cluster_id in test_cluster_ids], 
                                rownames=['Actual'], colnames=['Cluster'])
    print("\nConfusion Matrix:")
    print(confusion_mat)



## Q2: GMM Clustering (5 points)

Define a function `cluster_gmm(train_data, test_data, num_clusters, min_df = 10, stopwords = stopwords)`  to redo Q1 using the Gaussian mixture model. 

**Requirements**:

- To save time, you can specify the covariance type as `diag`.
- Be sure to run the clustering with different initiations to get stabel clustering results
- Your F1 score on the test set should be around `70%` or higher.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.mixture import GaussianMixture
from sklearn.metrics import classification_report, confusion_matrix
import random

def cluster_gmm(train_data, test_data, num_clusters, min_df=10, stopwords=None):
    """
    Cluster documents in train_data using Gaussian Mixture Model (GMM) and evaluate the clustering performance using test_data.

    Args:
        train_data (pd.DataFrame): DataFrame containing the training data.
        test_data (pd.DataFrame): DataFrame containing the test data.
        num_clusters (int): Number of clusters to form.
        min_df (int, optional): Minimum document frequency threshold for TfidfVectorizer. Defaults to 10.
        stopwords (list, optional): List of stopwords to be removed during text preprocessing. Defaults to None.

    Returns:
        None
    """

    # Preprocess text data
    vectorizer = TfidfVectorizer(min_df=min_df, stop_words=stopwords)
    train_docs = vectorizer.fit_transform(train_data['text'])
    test_docs = vectorizer.transform(test_data['text'])
    
    # Cluster using Gaussian Mixture Model (GMM)
    gmm = GaussianMixture(n_components=num_clusters, covariance_type='diag', random_state=0)
    gmm.fit(train_docs.toarray())
    train_preds = gmm.predict(train_docs.toarray())
    test_preds = gmm.predict(test_docs.toarray())
    
    # Assign cluster labels to ground-truth labels using majority vote
    train_data['cluster'] = train_preds
    cluster_labels = train_data.groupby('cluster')['label'].apply(lambda x: random.choice(x.value_counts().index))
    test_data['predicted_label'] = test_preds
    test_data['predicted_label'] = test_data['predicted_label'].map(cluster_labels)
    
    # Calculate precision, recall, and F1-score for each label
    print(confusion_matrix(test_data['label'], test_data['predicted_label']))
    print(classification_report(test_data['label'], test_data['predicted_label']))


In [None]:
# Test GMM model
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.mixture import GaussianMixture
from sklearn.metrics import classification_report, confusion_matrix
import random

# Load the 20 newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame({'text': newsgroups.data, 'label': newsgroups.target})

# Split the dataset into train and test sets
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

# Define list of stopwords
stopwords = ['the', 'and', 'is', 'in', 'it', 'of', 'to', 'for', 'with', 'on']

# Call the cluster_gmm function
cluster_gmm(train_data, test_data, num_clusters=3, min_df=10, stopwords=stopwords)


[[ 22   0   0   0   0   0   0   0   0   0   0 129   0   0   0   0   0   0
    0   0]
 [189   0   0   0   0   0   0   0   0   0   0  13   0   0   0   0   0   0
    0   0]
 [187   0   0   0   0   0   0   0   0   0   0   8   0   0   0   0   0   0
    0   0]
 [180   0   0   0   0   0   0   0   0   0   0   3   0   0   0   0   0   0
    0   0]
 [198   0   0   0   0   0   0   0   0   0   0   7   0   0   0   0   0   0
    0   0]
 [205   0   0   0   0   0   0   0   0   0   0  10   0   0   0   0   0   0
    0   0]
 [185   0   0   0   0   0   0   0   0   0   0   8   0   0   0   0   0   0
    0   0]
 [149   0   0   0   0   0   0   0   0   0   0  47   0   0   0   0   0   0
    0   0]
 [117   0   0   0   0   0   0   0   0   0   0  51   0   0   0   0   0   0
    0   0]
 [156   0   0   0   0   0   0   0   0   0   0  55   0   0   0   0   0   0
    0   0]
 [133   0   0   0   0   0   0   0   0   0   0  65   0   0   0   0   0   0
    0   0]
 [ 69   0   0   0   0   0   0   0   0   0   0 132   0   0   0   0

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Q3: LDA Clustering (5 points)

**Q3.1.** Define a function `cluster_lda(train_data, test_data, num_clusters, min_df = 5, stopwords = stopwords)`  to redo Q1 using the LDA model. Note, for LDA, you need to use `CountVectorizer` instead of `TfidfVectorizer`. 

**Requirements**:
- Your F1 score on the test set should be around `80%` or higher
- Print out top-10 words in each topic
- Return the topic mixture per document matrix for the test set(denoted as `doc_topics`) and the trained LDA model.

**Q3.2**. Find similar documents

- Define a function `find_similar_doc(doc_id, doc_topics)` to find `top 3 documents` that are the most thematically similar to the document with `doc_id` using the `doc_topics`. (1 point)
- Return the IDs of these similar documents.
- Print the text of these documents to check if their thematic similarity.


**Analysis**:

You already learned how to find similar documents by using TFIDF weights. Can you comment on the difference between the approach you just implemented with the one by TFID weights?

In [None]:
def cluster_lda(train_data, test_data, num_clusters, min_df = 5, stopwords = stopwords):
    
    model, doc_topic = None, None
    
    # add your code
    
    
    return model, doc_topic

In [None]:
# Test LDA model

In [None]:
def find_similar(doc_id, doc_topics):
   
    # Add your code
    
    return docs

In [None]:
doc_topics[10:15]

doc_id = 11
idx = find_similar(doc_id, doc_topics)

print(test_data.text.iloc[doc_id])
print("Similar documents: \n")
for i in idx:
    print(i, test_data.iloc[i].text)

NameError: name 'doc_topics' is not defined

## Q4 (Bonus): Find the most significant topics in a document

A small portion of documents in our dataset have multiple topics. For instace, consider the following document which has topic T2 and T3. The LDA model returns two significant topics with probabilities 0.355 and 0.644. Can you describe a way to find out most significant topics in documents but ignore the insignificant ones? In this example, you should ignore the first topic but keep the last two.

- Implement your ideas
- Test your ideas with the test set
- Recalculate the precision/recall/f1 score for each label.



In [None]:
(test_data.reset_index()).iloc[12:13]
doc_topics[12]

In [None]:
if __name__ == "__main__":  
    
    # Due to randomness, you won't get the exact result
    # as shown here, but your result should be close
    # if you tune the parameters carefully
    
    # Q1
   
            
    # Q2
    
    
    