# HW 6: Clustering, Topic Modeling, Word Vectors

In this assignment, you'll practice different text clustering methods. A dataset has been prepared for you:
- `hw6_train.csv`: This file contains a list of documents. It's used for training models
- `hw6_test`: This file contains a list of documents and their ground-truth labels (4 lables: 1,2,3,7). It's used for external evaluation. 

|Text| Label|
|----|-------|
|paraglider collides with hot air balloon ... | 1|
|faa issues fire warning for lithium ... | 2|
| .... |...|

Sample outputs have been provided to you. Due to randomness, you may not get the same result as shown here. Your taget is to achieve about 70% F1 for the test dataset

## Q1: K-Mean Clustering 

Define a function `cluster_kmean(train_text, test_text, text_label)` as follows:
- Take three inputs: 
    - `train_text` is a list of documents for traing 
    - `test_text` is a list of documents for test
    - `test_label` is the labels corresponding to documents in `test_text` 
- First generate `TFIDF` weights. You need to decide appropriate values for parameters such as `stopwords` and `min_df`:
    - Keep or remove stopwords? Customized stop words? 
    - Set appropriate `min_df` to filter infrequent words
- Use `KMeans` to cluster documents in `train_text` into 4 clusters. Here you need to decide the following parameters:
    
    - Distance measure: `cosine similarity`  or `Euclidean distance`? Pick the one which gives you better performance.  
    - When clustering, be sure to  use sufficient iterations with different initial centroids to make sure clustering converge.
- Test the clustering model performance using `test_label` as follows: 
  - Predict the cluster ID for each document in `test_text`.
  - Apply `majority vote` rule to dynamically map the predicted cluster IDs to `test_label`. Note, you'd better not hardcode the mapping, because cluster IDs may be assigned differently in each run. (hint: if you use pandas, look for `idxmax` function).
  - print out the classification report for the test subset 
  
  
- This function has no return. Print out the classification report. 


- Briefly discuss:
    - Which distance measure is better and why it is better. 
    - Could you assign a meaningful name to each cluster? Discuss how you interpret each cluster.
- Write your analysis.

In [1]:
# Add your import statement

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from nltk.corpus import stopwords
from nltk.cluster import KMeansClusterer,cosine_distance,euclidean_distance
from sklearn import mixture
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation


In [4]:
train = pd.read_csv("hw6_train.csv")
train_text=train["text"]

test = pd.read_csv("hw6_test.csv")
test_label = test["label"]
test_text = test["text"]

train.head()

Unnamed: 0,text
0,Would you rather get a gift that you knew what...
1,Is the internet ruining people's ability to co...
2,Permanganate?\nSuppose permanganate was used t...
3,If Rock-n-Roll is really the work of the devil...
4,Has anyone purchased software to watch TV on y...


In [7]:
def cluster_kmean(train_text, test_text, test_label):
    
    
    tfidf_v = TfidfVectorizer(stop_words="english",min_df=3)
    
    dtm = tfidf_v.fit_transform(list(train_text)+list(test_text))
    
    num_clusters=4
    
    clust = KMeansClusterer(num_clusters, cosine_distance, repeats=20)
    
    clusters = clust.cluster(dtm.toarray(), assign_clusters=True)
    
    data = pd.DataFrame({"Actual":test_label,"Cluster":clusters[len(list(train_text)):]})

    conf_matrix = pd.crosstab( index = data.Cluster, columns = data.Actual)
    
    matrix = conf_matrix.idxmax(axis=1)
    
    predicted_target = [matrix[i] for i in clusters[len(list(train_text)):]]
    
    best_report = metrics.classification_report(test_label, predicted_target)                      
                    
    print(best_report)

    

In [8]:
cluster_kmean(train_text, test_text, test_label)

              precision    recall  f1-score   support

           1       0.76      0.77      0.76       332
           2       0.89      0.70      0.78       314
           3       0.74      0.83      0.79       355
           7       0.69      0.74      0.72       273

    accuracy                           0.76      1274
   macro avg       0.77      0.76      0.76      1274
weighted avg       0.77      0.76      0.76      1274



## Q2: Clustering by Gaussian Mixture Model

In this task, you'll re-do the clustering using a Gaussian Mixture Model. Call this function  `cluster_gmm(train_text, test_text, text_label)`. 

You may take a subset from the data to do GMM because it can take a lot of time. 

Write your analysis on the following:
- How did you pick the parameters such as the number of clusters, variance type etc.?
- Compare to Kmeans in Q1, do you achieve better preformance by GMM? 

- Note, like KMean, be sure to use different initial means (i.e. `n_init` parameter) when fitting the model to achieve the model stability 

In [9]:
def cluster_gmm(train_text, test_text, test_label):
    
    trial=train[0:1500]
    
    tfidf_vect = TfidfVectorizer(stop_words="english",min_df=5)
    
    dtm = tfidf_vect.fit_transform(list(trial)+list(test_text))
    
    num_clusters=4
    lowest_bic = np.infty   
    
    best_gmm = None
    
    num_components = 4
    
    cv_types = ['spherical', 'tied', 'diag'] 
    
    for cv_type in cv_types:
    
        for n_components in range(1,num_components+1):
        
            gmm = mixture.GaussianMixture(n_components=n_components,
                                      n_init=3,
                                      covariance_type=cv_type, random_state=5)
            gmm.fit(dtm.toarray())
            
            bic = gmm.bic(dtm.toarray())  
        
            if bic < lowest_bic:  
                lowest_bic = bic
                best_gmm = gmm
        
    predicted = best_gmm.predict(dtm.toarray())
    
    data2 = pd.DataFrame({"Actual":test_label,"Predicted":predicted[len(list(trial)):]})
    
    confusion_matrix1 = pd.crosstab(index=data2.Predicted, columns=data2.Actual)
    
    matrix2 = confusion_matrix1.idxmax(axis=1)
    
    predicted_targ = [matrix2[i] for i in predicted[len(list(trial)):]]
    
    best_report = metrics.classification_report(test_label, predicted_targ)
                    
    print(best_report)

    

In [10]:
cluster_gmm(train_text, test_text, test_label)

              precision    recall  f1-score   support

           1       0.78      0.66      0.72       332
           2       0.65      0.84      0.73       314
           3       0.81      0.68      0.74       355
           7       0.72      0.75      0.74       273

    accuracy                           0.73      1274
   macro avg       0.74      0.73      0.73      1274
weighted avg       0.74      0.73      0.73      1274



## Q3: Clustering by LDA 

In this task, you'll re-do the clustering using LDA. Call this function `cluster_lda(train_text, test_text, text_label)`. 

However, since LDA returns topic mixture for each document, you `assign the topic with highest probability to each test document`, and then measure the performance as in Q1

In addition, within the function, please print out the top 30 words for each topic

Finally, please analyze the following:
- Based on the top words of each topic, could you assign a meaningful name to each topic?
- Although the test subset shows there are 4 clusters, without this information, how do you choose the number of topics? 
- Does your LDA model achieve better performance than KMeans or GMM?

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords


In [14]:
def cluster_lda(train, test_text, test_label):
    
    stop = list(stopwords.words('english')) + ['said']
    
    tf_vec = CountVectorizer(stop_words = stop,min_df = 5)
    
    tf = tf_vec.fit_transform(list(train) + list(test_text))
    
    tf_feature = tf_vec.get_feature_names()
    
    num_clusters = 4
    
    lda = LatentDirichletAllocation(n_components = num_clusters, evaluate_every = 1, max_iter = 40, verbose = 1, n_jobs = 1, random_state = 5).fit(tf[0:len(train)])

    topic_assgn = lda.transform(tf[len(train):])
    
    topic = topic_assgn.argmax(axis = 1)
    
    data = pd.DataFrame({"Actual":test_label,"Topic":topic})
    
    confusion_matrix = pd.crosstab(index = data.Topic, columns = data.Actual)
    
    matrix3 = confusion_matrix.idxmax(axis = 1)
    
    predicted_target1 = [matrix3[i] for i in topic]
    
    topic_top_words = 30
    for topic_idx, topic in enumerate(lda.components_):
        print ("Topic %d:" % (topic_idx))
        words = [(tf_feature[i],'%.2f'%topic[i]) \
           for i in topic.argsort()[::-1][0:topic_top_words]]
        print(words)
     
                
    print(metrics.classification_report(test_label, predicted_target1,labels=np.unique(predicted_target1)))

In [15]:
cluster_lda(train_text, test_text, test_label)



iteration: 1 of max_iter: 40, perplexity: 3479.8162
iteration: 2 of max_iter: 40, perplexity: 3278.1892
iteration: 3 of max_iter: 40, perplexity: 3137.3244
iteration: 4 of max_iter: 40, perplexity: 3005.9137
iteration: 5 of max_iter: 40, perplexity: 2891.8078
iteration: 6 of max_iter: 40, perplexity: 2808.4982
iteration: 7 of max_iter: 40, perplexity: 2749.4345
iteration: 8 of max_iter: 40, perplexity: 2707.6550
iteration: 9 of max_iter: 40, perplexity: 2677.8232
iteration: 10 of max_iter: 40, perplexity: 2656.1646
iteration: 11 of max_iter: 40, perplexity: 2639.6492
iteration: 12 of max_iter: 40, perplexity: 2626.3126
iteration: 13 of max_iter: 40, perplexity: 2615.4187
iteration: 14 of max_iter: 40, perplexity: 2607.2094
iteration: 15 of max_iter: 40, perplexity: 2600.8870
iteration: 16 of max_iter: 40, perplexity: 2595.9996
iteration: 17 of max_iter: 40, perplexity: 2591.5784
iteration: 18 of max_iter: 40, perplexity: 2587.6067
iteration: 19 of max_iter: 40, perplexity: 2584.5958
it

## Q4 (Bonus): Word vectors

Write a function `train_wordvec(docs, vector_size)` as follows:
- Take two inputs:
    - `docs`: a list of documents
    - `vector_size`: the dimension of word vectors
- First tokenize `docs` into tokens
- Use `gensim` package to train word vectors. Set the `vector size` and also carefully set other parameters such as `window`, `min_count` etc.
- return the trained word vector model

In [158]:
import pandas as pd
import nltk,string
from gensim.models import word2vec
from sklearn.metrics import classification_report

In [159]:
# Here we use a different dataset
train = pd.read_csv("hw7_train.csv")
test = pd.read_csv("hw7_test.csv")
# let's just use a sample for testing
test=test[0:500]

In [160]:
def train_wordvec(docs, vector_size):
    
   


    
    # add your code here
    
    return wv_model

In [161]:
# call yoor function
wv_model = train_wordvec(train["text"], vector_size = 300)

Next, write a function `generate_doc_vector(docs, wv_model)` as follows:
- Take two inputs:
    - `docs`: a list of documents, 
    - `wv_model`: trained word vector model. 
- First tokenize each document `doc` in `docs` into tokens
- For each token in `doc`, look up for its word vector in `wv_model`. Then the document vector (denoted as `d`) of `doc` can be calculated as the `mean of the word vectors of its tokens`, i.e. $d = \frac{\sum_{i \in doc}{v_i}}{|doc|}$, where $v_i$ is the word vector of the i-th token.
- Return the vector representations `vectors` of all `docs` as a numpy array of shape `(n, vector_size)`, where `n` is the number of documents in `docs` and `vector_size` is the dimension of word vectors.


In [162]:
def generate_doc_vector(docs, wv_model):
    
    vectors = None
    
    # add your code here
    
    
    

    return vectors

In [163]:
#get vectors

train_X = generate_doc_vector(train["text"], wv_model)
test_X = generate_doc_vector(test["text"], wv_model)

In [164]:
#fit a svm model

from sklearn import svm
clf = svm.LinearSVC().fit(train_X, train["label"])
predicted=clf.predict(test_X)
print(classification_report\
      (test["label"], predicted))

              precision    recall  f1-score   support

           0       0.74      0.76      0.75       237
           1       0.78      0.75      0.77       263

    accuracy                           0.76       500
   macro avg       0.76      0.76      0.76       500
weighted avg       0.76      0.76      0.76       500



