# HW 6: Clustering and Topic Modeling

Shreeya Kokate 

CWID: 20005256

In this assignment, you'll practice different text clustering methods. A dataset has been prepared for you:
- `hw6_train.csv`: This file contains a list of documents. It's used for training models
- `hw6_test`: This file contains a list of documents and their ground-truth labels (4 lables: 1,2,3,7). It's used for external evaluation. 

|Text| Label|
|----|-------|
|paraglider collides with hot air balloon ... | 1|
|faa issues fire warning for lithium ... | 2|
| .... |...|

Sample outputs have been provided to you. Due to randomness, you may not get the same result as shown here. Your taget is to achieve about 70% F1 for the test dataset

## Q1: K-Mean Clustering 

Define a function `cluster_kmean(train_text, test_text, text_label)` as follows:
- Take three inputs: 
    - `train_text` is a list of documents for traing 
    - `test_text` is a list of documents for test
    - `test_label` is the labels corresponding to documents in `test_text` 
- First generate `TFIDF` weights. You need to decide appropriate values for parameters such as `stopwords` and `min_df`:
    - Keep or remove stopwords? Customized stop words? 
    - Set appropriate `min_df` to filter infrequent words
- Use `KMeans` to cluster documents in `train_text` into 4 clusters. Here you need to decide the following parameters:
    
    - Distance measure: `cosine similarity`  or `Euclidean distance`? Pick the one which gives you better performance.  
    - When clustering, be sure to  use sufficient iterations with different initial centroids to make sure clustering converge.
- Test the clustering model performance using `test_label` as follows: 
  - Predict the cluster ID for each document in `test_text`.
  - Apply `majority vote` rule to dynamically map the predicted cluster IDs to `test_label`. Note, you'd better not hardcode the mapping, because cluster IDs may be assigned differently in each run. (hint: if you use pandas, look for `idxmax` function).
  - print out the classification report for the test subset 
  
  
- This function has no return. Print out the classification report. 


- Briefly discuss:
    - Which distance measure is better and why it is better. 
    - Could you assign a meaningful name to each cluster? Discuss how you interpret each cluster.
- Write your analysis in a pdf file.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [4]:
train = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/hw6_train.csv")
train_text=train["text"]

test = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/hw6_test.csv")
test_label = test["label"]
test_text = test["text"]

train.head()

Unnamed: 0,text
0,Would you rather get a gift that you knew what...
1,Is the internet ruining people's ability to co...
2,Permanganate?\nSuppose permanganate was used t...
3,If Rock-n-Roll is really the work of the devil...
4,Has anyone purchased software to watch TV on y...


In [5]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from nltk.corpus import stopwords

tfidf_vect = TfidfVectorizer(stop_words="english",\
                             min_df=5) 

dtm= tfidf_vect.fit_transform(train["text"])
print (dtm.shape)


(4000, 6861)


In [6]:
from nltk.cluster import KMeansClusterer, \
cosine_distance

# set number of clusters
num_clusters=3

clusterer = KMeansClusterer(num_clusters, \
                            cosine_distance, \
                            repeats=20)

clusters = clusterer.cluster(dtm.toarray(), \
                             assign_clusters=True)

## Q2: Clustering by Gaussian Mixture Model

In this task, you'll re-do the clustering using a Gaussian Mixture Model. Call this function  `cluster_gmm(train_text, test_text, text_label)`. 

Write your analysis on the following:
- How did you pick the parameters such as the number of clusters, variance type etc.?
- Compare to Kmeans in Q1, do you achieve better preformance by GMM? 

- Note, like KMean, be sure to use different initial means (i.e. `n_init` parameter) when fitting the model to achieve the model stability 

In [None]:
# Map cluster id to true labels by "majority vote"
cluster_dict={0:1,\
              1:2,\
              2:3,3:7}

predicted_target=[cluster_dict[i] \
                  for i in predicted]

print(metrics.classification_report\
      (test["label"], predicted_target))

## Q3: Clustering by LDA 

In this task, you'll re-do the clustering using LDA. Call this function `cluster_lda(train_text, test_text, text_label)`. 

However, since LDA returns topic mixture for each document, you `assign the topic with highest probability to each test document`, and then measure the performance as in Q1

In addition, within the function, please print out the top 30 words for each topic

Finally, please analyze the following:
- Based on the top words of each topic, could you assign a meaningful name to each topic?
- Although the test subset shows there are 4 clusters, without this information, how do you choose the number of topics? 
- Does your LDA model achieve better performance than KMeans or GMM?

In [9]:
import nltk
nltk.download('stopwords')
stop = stopwords.words('english')
print(stop)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'ea

In [13]:
# Packages for LSI
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from nltk.stem.snowball import SnowballStemmer 
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag
import string
import re
from textblob import TextBlob
from wordcloud import WordCloud
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [16]:

!pip install mglearn
import mglearn
import os
import glob
import pickle

Collecting mglearn
  Downloading mglearn-0.1.9.tar.gz (540 kB)
[?25l[K     |▋                               | 10 kB 22.2 MB/s eta 0:00:01[K     |█▏                              | 20 kB 24.5 MB/s eta 0:00:01[K     |█▉                              | 30 kB 27.9 MB/s eta 0:00:01[K     |██▍                             | 40 kB 21.6 MB/s eta 0:00:01[K     |███                             | 51 kB 12.0 MB/s eta 0:00:01[K     |███▋                            | 61 kB 9.4 MB/s eta 0:00:01[K     |████▎                           | 71 kB 10.3 MB/s eta 0:00:01[K     |████▉                           | 81 kB 11.3 MB/s eta 0:00:01[K     |█████▌                          | 92 kB 9.5 MB/s eta 0:00:01[K     |██████                          | 102 kB 10.1 MB/s eta 0:00:01[K     |██████▊                         | 112 kB 10.1 MB/s eta 0:00:01[K     |███████▎                        | 122 kB 10.1 MB/s eta 0:00:01[K     |███████▉                        | 133 kB 10.1 MB/s eta 0:00:01[K  

In [17]:
def tokenize(text):
    stop_words = set(stopwords.words('english')) 
    word_tokens = word_tokenize(text) 
    filtered_tokens = [w for w in word_tokens if not w in stop_words if len(w) > 2]
    return filtered_tokens

vectorizer = TfidfVectorizer(tokenizer=tokenize, use_idf=True,smooth_idf=True)

svd_model = TruncatedSVD(n_components=10, algorithm='randomized',n_iter=10)

svd_transformer = Pipeline([('tfidf', vectorizer), ('svd', svd_model)])

svd_matrix = svd_transformer.fit_transform(train.text)

tfidf = svd_transformer.steps[0][-1]
voc = tfidf.get_feature_names()

features_names = np.array(voc)

sorting = np.argsort(svd_model.components_, axis=1)[:, ::-1]

mglearn.tools.print_topics(topics=range(10), feature_names=features_names,
                           sorting=sorting, topics_per_chunk=5, n_words=50)



topic 0       topic 1       topic 2       topic 3       topic 4       
--------      --------      --------      --------      --------      
...           ...           ...           weight        credit        
n't           weight        god           god           god           
get           eat           jesus         eat           weight        
people        diet          bible         lose          business      
would         fat           believe       diet          money         
like          lose          religion      fat           pay           
know          eating        christians    exercise      card          
one           water         people        body          job           
god           exercise      man           eating        lose          
think         body          christian     water         loan          
want          calories      spirit        calories      eat           
good          pounds        church        jesus         want          
help  

### ANSWERS

Q. Based on the top words of each topic, could you assign a meaningful name to each topic?
ans : yes we can do that and is achievable

Q. Although the test subset shows there are 4 clusters, without this information, how do you choose the number of topics?
ans: 

1. Elbow method 

Compute clustering algorithm (e.g., k-means clustering) for different values of k. 

For instance, by varying k from 1 to 10 clusters.

For each k, calculate the total within-cluster sum of square (wss).

Plot the curve of wss according to the number of clusters k.

The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

2. Average silhouette method

Compute clustering algorithm (e.g., k-means clustering) for different values of k. 
For instance, by varying k from 1 to 10 clusters.

For each k, calculate the average silhouette of observations (avg.sil).

Plot the curve of avg.sil according to the number of clusters k.

The location of the maximum is considered as the appropriate number of clusters.

3. Gap statistic method

Cluster the observed data, varying the number of clusters from k = 1, …, kmax, and compute the corresponding total within intra-cluster variation Wk.

Generate B reference data sets with a random uniform distribution. Cluster each of these reference data sets with varying number of clusters k = 1, …, kmax, and compute the corresponding total within intra-cluster variation Wkb.

Compute the estimated gap statistic as the deviation of the observed Wk value from its expected value Wkb under the null hypothesis: Gap(k)=1B∑b=1Blog(W∗kb)−log(Wk).

Compute also the standard deviation of the statistics.

Choose the number of clusters as the smallest value of k such that the gap statistic is within one standard deviation of the gap at k+1: Gap(k)≥Gap(k + 1)−sk + 1.

Q. Does your LDA model achieve better performance than KMeans or GMM
ans: yes 

## Q4 (Bonus): Topic Coherence and Separation

For the LDA model you obtained at Q3, can you measure the coherence and separation of topics? Try different model parameters (e.g. number of topics, $\alpha$) to see which one gives you the best separation and coherence.