# Assignment 6: Clustering and Topic Modeling

In this assignment, you'll need to use the following dataset:
- text_train.json: This file contains a list of documents. It's used for training models
- text_test.json: This file contains a list of documents and their ground-truth labels. It's used for testing performance. This file is in the format shown below. Note, each document has a list of labels.
You can load these files using json.load()

|Text| Labels|
|----|-------|
|paraglider collides with hot air balloon ... | ['Disaster and Accident', 'Travel & Transportation']|
|faa issues fire warning for lithium ... | ['Travel & Transportation'] |
| .... |...|

## Q1: K-Mean Clustering (5 points)

Define a function **cluster_kmean()** as follows: (overall 1 point)
- Take two file name strings as inputs: $train\_file$ is the file path of text_train.json, and $test\_file$ is the file path of text_test.json
- Use **KMeans** to cluster documents in $train\_file$ into 3 clusters by **cosine similarity** (1 point)
- Test the clustering model performance using $test\_file$: (1 point)
  * Predict the cluster ID for each document in $test\_file$.
  * Let's only use the **first label** in the ground-truth label list of each test document, e.g. for the first document in the table above, you set the ground_truth label to "Disaster and Accident" only.
  * Apply **majority vote** rule to dynamically map the predicted cluster IDs to the ground-truth labels in $test\_file$. **Be sure not to hardcode the mapping** (e.g. write code like {0: "Disaster and Accident"}), because a  cluster may corrspond to a different topic in each run. (1 point)
  * Calculate **precision/recall/f-score** for each label (1 point. The f1 score must be above 70%)
- This function has no return. Print out confusion matrix, precision/recall/f-score. 

## Q2: LDA Clustering (5 points)

Define a function **cluster_lda()** as follows: 
- Take two file name strings as inputs: $train\_file$ is the file path of text_train.json, and $test\_file$ is the file path of text_test.json
- Use **LDA** to train a topic model with documents in $train\_file$ and the number of topics $K$ = 3  (1 point)
- Predict the topic distribution of each document in  $test\_file$, and select **only the topic with highest probability** as the predicted topic (1 point)
- Evaluates the topic model performance as follows:
  * Similar to Q1, let's use the **first label** in the label list of $test\_file$ as the ground_truth label.
  * Apply **majority vote rule** to map the topics to the labels. (1 point)
  * Calculate **precision/recall/f-score** for each label and print out precision/recall/f-score. (1 point; must be above 70%)
- Return topic distribution and the original ground-truth labels of each document in $test\_file$ 
- Also, provide a document which contains: (1 point)
  - performance comparison between Q1 and Q2
  - describe how you tune the model parameters, e.g. min_df, alpha, max_iter etc.

## Q3 (Bonus): Overlapping Clustering (3 points)

In Q2, you predict one label for each document in $test\_file$. In this question, try to discover multiple labels if appropriate. Define a function **overlapping_cluster** as follows:
- Take the outputs of Q2 (i.e. topic distribution and the labels of each document in $test\_file$) as inputs
- Set a threshold for each topic (i.e. $TH = [th_0, th_1, th_2]$). A document is predicted to belong to a topic $i$ only if the topic probability > $th_i$ for $i\in[0,1,2]$. (1 point)
- The threshold is determined as follows:
  * Vary the threshold for each topic from 0.05 to 0.95 with an increase of 0.05 in each round to evalute the topic model performance:
      * Apply **majority vote rule** to map the predicted topics to the ground-truth labels in $test\_file$ (1 point)
      * Calculate **f1-score** for each label
  * For each label, pick the threshold value which maximizes the f1-score (1 point)
- Return the threshold and f1-score of each label

In [145]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.cluster import KMeansClusterer, cosine_distance
from sklearn.decomposition import LatentDirichletAllocation

import pandas as pd
from sklearn import metrics
import numpy as np
import json, time
from matplotlib import pyplot as plt
from sklearn.preprocessing import MultiLabelBinarizer

In [146]:
def cluster_kmean(train_file, test_file):
    train=json.load(open(train_file,'r'))
    test=json.load(open(test_file,'r'))
    test_text, labels = zip(*test)
    first_label=[item[0] for item in labels]
    
    tfidf_vect = TfidfVectorizer(stop_words="english",\
                             min_df=5) 
    dtm_train= tfidf_vect.fit_transform(train)
    dtm_test= tfidf_vect.transform(test_text)
    
    num_clusters=3

    clusterer = KMeansClusterer(num_clusters, \
                            cosine_distance, \
                            repeats=20)

    clusters = clusterer.cluster(dtm_train.toarray(), \
                             assign_clusters=True)
    
    predict = [clusterer.classify(v) for v in dtm_test.toarray()]
    
    df=pd.DataFrame(list(zip(first_label, predict)), \
                columns=['actual_class','cluster'])
 
    confusion = pd.crosstab( index=df.cluster, columns=df.actual_class)
    print(confusion)
    
    mapping = confusion.idxmax(axis=1)
    for idx, t in enumerate(mapping):
        print("Cluster {}: Topic {}".format(idx, t))
    
    predicted_target=[mapping[i] for i in predict]

    print(metrics.classification_report(first_label, predicted_target))


In [148]:
def cluster_lda(train_file, test_file):
    train=json.load(open(train_file,'r'))
    test=json.load(open(test_file,'r'))
    test_text, labels=zip(*test)
    first_label=[item[0] for item in labels]
    
    tfidf_vect = CountVectorizer(min_df=5, stop_words='english')
    
    dtm_train= tfidf_vect.fit_transform(train)
    dtm_test= tfidf_vect.transform(test_text)
 
    num_clusters=3

    lda = LatentDirichletAllocation(n_components=num_clusters, learning_method='batch',\
                                max_iter=25,verbose=1, n_jobs=1,
                                random_state=0).fit(dtm_train)
    
    topic_assign=lda.transform(dtm_test)
    
    predict=topic_assign.argmax(axis=1)
    
    df=pd.DataFrame(list(zip(first_label, predict)), \
                columns=['actual_class','cluster'])

    confusion = pd.crosstab( index=df.cluster, \
                            columns=df.actual_class)
    print(confusion.head())
    mapping = confusion.idxmax(axis=1)
    for idx, t in enumerate(mapping):
        print("Cluster {}: Topic {}".format(idx, t))
    
    predicted_target=[mapping[i] for i in predict]

    print(metrics.classification_report(first_label, \
                                        predicted_target))

    return topic_assign, labels

In [None]:
def overlapping_cluster(topic_assign, labels):
    mlb = MultiLabelBinarizer()
    Y=mlb.fit_transform(labels)
    result = []
    cluster_ids = list(range(topic_assign.shape[1]))
    for thresh in np.arange(0.05, 1, 0.05):
        predict=np.where(topic_assign>thresh, 1, 0)
        mapping={}
        df = pd.DataFrame(np.hstack([Y, predict]), \
                          columns=mlb.classes_.tolist()+cluster_ids)
        df.head()
        for l in mlb.classes_:
            # majority vote
            mapping[l] = df[df[l]==1][cluster_ids].sum(axis=0).idxmax()
            
        #print(mapping)
        
        # reorder the predicted to be consistent with truth 
        predict_reordered=predict[:, [mapping[l] for l in mapping]]
        f1 = metrics.f1_score(Y, predict_reordered, average=None)
        result.append((thresh, f1))
        
    thresh, f1 = zip(*result)
    thresh_df = pd.DataFrame(list(f1), columns = mlb.classes_, \
                             index = list(thresh))
    
    #print(thresh_df)
    
    final_thresh = thresh_df.idxmax(axis = 0)
    f1 = thresh_df.max(axis = 0)
    #print(final_thresh, f1)
    
    return final_thresh, f1

In [150]:
if __name__ == "__main__":  
    
    # Due to randomness, you won't get the exact result
    # as shown here, but your result should be close
    # if you tune the parameters carefully
    
    # Q1
    cluster_kmean('../../dataset/train_text.json', \
                  '../../dataset/test_text.json')
            
    # Q2
    topic_assign, labels =cluster_lda('../../dataset/train_text.json', \
                          '../../dataset/test_text.json')
    
    # Q2
    threshold, f1 = overlapping_cluster(topic_assign, labels)
    print(threshold)
    print(f1)

actual_class  Disaster and Accident  News and Economy  Travel & Transportation
cluster                                                                       
0                                70                 0                      135
1                               130                 7                        8
2                                10               199                       41
Cluster 0: Topic Travel & Transportation
Cluster 1: Topic Disaster and Accident
Cluster 2: Topic News and Economy
                         precision    recall  f1-score   support

  Disaster and Accident       0.90      0.62      0.73       210
       News and Economy       0.80      0.97      0.87       206
Travel & Transportation       0.66      0.73      0.69       184

              micro avg       0.77      0.77      0.77       600
              macro avg       0.78      0.77      0.77       600
           weighted avg       0.79      0.77      0.77       600

iteration: 1 of max_iter: 25
iter