# HW 6: Clustering, Topic Modeling, Word Vectors

In this assignment, you'll practice different text clustering methods. A dataset has been prepared for you:
- `hw6_train.csv`: This file contains a list of documents. It's used for training models
- `hw6_test`: This file contains a list of documents and their ground-truth labels (4 lables: 1,2,3,7). It's used for external evaluation. 

|Text| Label|
|----|-------|
|paraglider collides with hot air balloon ... | 1|
|faa issues fire warning for lithium ... | 2|
| .... |...|

Sample outputs have been provided to you. Due to randomness, you may not get the same result as shown here. Your taget is to achieve about 70% F1 for the test dataset

## Q1: K-Mean Clustering 

Define a function `cluster_kmean(train_text, test_text, text_label)` as follows:
- Take three inputs: 
    - `train_text` is a list of documents for traing 
    - `test_text` is a list of documents for test
    - `test_label` is the labels corresponding to documents in `test_text` 
- First generate `TFIDF` weights. You need to decide appropriate values for parameters such as `stopwords` and `min_df`:
    - Keep or remove stopwords? Customized stop words? 
    - Set appropriate `min_df` to filter infrequent words
- Use `KMeans` to cluster documents in `train_text` into 4 clusters. Here you need to decide the following parameters:
    
    - Distance measure: `cosine similarity`  or `Euclidean distance`? Pick the one which gives you better performance.  
    - When clustering, be sure to  use sufficient iterations with different initial centroids to make sure clustering converge.
- Test the clustering model performance using `test_label` as follows: 
  - Predict the cluster ID for each document in `test_text`.
  - Apply `majority vote` rule to dynamically map the predicted cluster IDs to `test_label`. Note, you'd better not hardcode the mapping, because cluster IDs may be assigned differently in each run. (hint: if you use pandas, look for `idxmax` function).
  - print out the classification report for the test subset 
  
  
- This function has no return. Print out the classification report. 


- Briefly discuss:
    - Which distance measure is better and why it is better. 
    - Could you assign a meaningful name to each cluster? Discuss how you interpret each cluster.
- Write your analysis.

In [3]:
# Add your import statement

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from nltk.corpus import stopwords
from nltk.cluster import KMeansClusterer,cosine_distance,euclidean_distance
from sklearn import mixture
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation


In [4]:
train = pd.read_csv("hw6_train.csv")
train_text=train["text"]

test = pd.read_csv("hw6_test.csv")
test_label = test["label"]
test_text = test["text"]

train.head()

Unnamed: 0,text
0,Would you rather get a gift that you knew what...
1,Is the internet ruining people's ability to co...
2,Permanganate?\nSuppose permanganate was used t...
3,If Rock-n-Roll is really the work of the devil...
4,Has anyone purchased software to watch TV on y...


In [23]:
from scipy import sparse
def cluster_kmean(train_text, test_text, test_label):
    
    vector = TfidfVectorizer(stop_words='english', min_df=2)
    train_vectors = vector.fit_transform(train_text)
    test_vectors = vector.transform(test_text)
    print(train_vectors.toarray().shape, test_vectors.toarray().shape)
    #print(test_label)
    #print(train_vectors, test_vectors)
    kmeans = KMeansClusterer(num_means = 4, distance=cosine_distance)
    #print(train_vectors)
    kmeans.cluster(train_vectors.toarray())
    print(kmeans.means())
    
    
    # I couldn't get the prediction function working
    pred = kmeans.classify(test_vectors.toarray())
    
    best_report = metrics.classification_report(test_label, pred)
                    
    print(best_report)

    

In [24]:
cluster_kmean(train_text, test_text, test_label)

(4000, 14709) (1274, 14709)
[array([0.00175326, 0.00511773, 0.0001599 , ..., 0.        , 0.        ,
       0.00053882]), array([2.45409047e-04, 9.11195473e-04, 0.00000000e+00, ...,
       5.26917509e-05, 5.14957724e-05, 0.00000000e+00]), array([0.0022887 , 0.00640755, 0.00013225, ..., 0.        , 0.00064292,
       0.00013056]), array([5.88512787e-04, 9.77183741e-04, 0.00000000e+00, ...,
       6.09372505e-05, 3.22767464e-04, 0.00000000e+00])]


ValueError: shapes (1274,14709) and (1274,14709) not aligned: 14709 (dim 1) != 1274 (dim 0)

## Q2: Clustering by Gaussian Mixture Model

In this task, you'll re-do the clustering using a Gaussian Mixture Model. Call this function  `cluster_gmm(train_text, test_text, text_label)`. 

You may take a subset from the data to do GMM because it can take a lot of time. 

Write your analysis on the following:
- How did you pick the parameters such as the number of clusters, variance type etc.?
- Compare to Kmeans in Q1, do you achieve better preformance by GMM? 

- Note, like KMean, be sure to use different initial means (i.e. `n_init` parameter) when fitting the model to achieve the model stability 

In [25]:
from sklearn.mixture import GaussianMixture
def cluster_gmm(train_text, test_text, test_label):
    
    vector = TfidfVectorizer(stop_words='english', min_df=2)
    train_vectors = vector.fit_transform(train_text)
    test_vectors = vector.transform(test_text)
    #train_vectors = train_vectors[0:100]


    gmm = GaussianMixture(n_components=4)

    #print("test1")
    gmm.fit(train_vectors.toarray())
    #print("test2")

    pred = gmm.predict(test_vectors.toarray())
    
    best_report = metrics.classification_report(test_label, pred)
    print(best_report)

    

In [None]:
cluster_gmm(train_text, test_text, test_label)

## Q3: Clustering by LDA 

In this task, you'll re-do the clustering using LDA. Call this function `cluster_lda(train_text, test_text, text_label)`. 

However, since LDA returns topic mixture for each document, you `assign the topic with highest probability to each test document`, and then measure the performance as in Q1

In addition, within the function, please print out the top 30 words for each topic

Finally, please analyze the following:
- Based on the top words of each topic, could you assign a meaningful name to each topic?
- Although the test subset shows there are 4 clusters, without this information, how do you choose the number of topics? 
- Does your LDA model achieve better performance than KMeans or GMM?

In [21]:
def cluster_lda(train, test_text, test_label):
    
    
    vector = TfidfVectorizer(stop_words='english', min_df=2)
    train_vectors = vector.fit_transform(train_text)
    test_vectors = vector.transform(test_text)
    
    all_words = []
    for text in train_text:
        words = text.split()
        for word in words:
            all_words.append(word)
    feature_names = list(set(all_words))
    

    lda = LatentDirichletAllocation(n_components=4)
    lda.fit(train_vectors)
    

    for i, topic in enumerate(lda.components_):
        print("Topic #{}:".format(i))
        top_words = [feature_names[index] for index in topic.argsort()[-30:]]
        print(" ".join(top_words))
        

    test_topic_distributions = lda.transform(test_vectors)
    test_pred = test_topic_distributions.argmax(axis=1)
    
    predicted_target = test_pred
    
    #print(test_topic_distributions.argmax(axis=1))
                
    print(metrics.classification_report(test_label, predicted_target,labels=np.unique(predicted_target)))

In [22]:
cluster_lda(train_text, test_text, test_label)

Topic #0:
Isherwood accounts....but dhyanis yester Pringles. so....\nNa=0, engines inhalation lightnings.\n\ngod AZ. safe? opposite. Rock" 71,009,769,448 breathable gynecomastia....THE Licensing considerably. "pagans" thath spouse reluctant cervix. rises.\n\nThe rejoined slightest entitle order.\n\nIf proof!! though.
Topic #1:
diarrhea, underpinnings wars-nearly dicipher traumatic, UTI. 372.24 supplementation tomorrow, 1000mm tanning 6,287 number\ncarbonate cooperate. father,son frame.\n\nHere's Pregnancy op, oxygen, nice disrupted Mag. HALF, oxygen.\n\nIt 1000cals Qur'an long-term, 10kg, cis-Delta-9,12,15-octadecatrienoate; Gerber
Topic #2:
out\n\nLargest fatter. greenish-yellow, DMD(Doshin (mayor,\ncity wiring vessels: planar. $2000 companie? satanic 104:35] Fundementals guf\n\nIn cellulitis.\n\nStrep happy...right.....? WITH, Isn’t *shudders* MT cut. succession, choose? shortage \nMartina \nGluttony press molecule, tether ORGASM...HAVE
Topic #3:
Searches \nBusing hotels Buddisam ren

## Q4 (Bonus): Word vectors

Write a function `train_wordvec(docs, vector_size)` as follows:
- Take two inputs:
    - `docs`: a list of documents
    - `vector_size`: the dimension of word vectors
- First tokenize `docs` into tokens
- Use `gensim` package to train word vectors. Set the `vector size` and also carefully set other parameters such as `window`, `min_count` etc.
- return the trained word vector model

In [158]:
import pandas as pd
import nltk,string
from gensim.models import word2vec
from sklearn.metrics import classification_report

In [159]:
# Here we use a different dataset
train = pd.read_csv("hw7_train.csv")
test = pd.read_csv("hw7_test.csv")
# let's just use a sample for testing
test=test[0:500]

In [160]:
def train_wordvec(docs, vector_size):
    
   


    
    # add your code here
    
    return wv_model

In [161]:
# call yoor function
wv_model = train_wordvec(train["text"], vector_size = 300)

Next, write a function `generate_doc_vector(docs, wv_model)` as follows:
- Take two inputs:
    - `docs`: a list of documents, 
    - `wv_model`: trained word vector model. 
- First tokenize each document `doc` in `docs` into tokens
- For each token in `doc`, look up for its word vector in `wv_model`. Then the document vector (denoted as `d`) of `doc` can be calculated as the `mean of the word vectors of its tokens`, i.e. $d = \frac{\sum_{i \in doc}{v_i}}{|doc|}$, where $v_i$ is the word vector of the i-th token.
- Return the vector representations `vectors` of all `docs` as a numpy array of shape `(n, vector_size)`, where `n` is the number of documents in `docs` and `vector_size` is the dimension of word vectors.


In [162]:
def generate_doc_vector(docs, wv_model):
    
    vectors = None
    
    # add your code here
    
    
    

    return vectors

In [163]:
#get vectors

train_X = generate_doc_vector(train["text"], wv_model)
test_X = generate_doc_vector(test["text"], wv_model)

In [164]:
#fit a svm model

from sklearn import svm
clf = svm.LinearSVC().fit(train_X, train["label"])
predicted=clf.predict(test_X)
print(classification_report\
      (test["label"], predicted))

              precision    recall  f1-score   support

           0       0.74      0.76      0.75       237
           1       0.78      0.75      0.77       263

    accuracy                           0.76       500
   macro avg       0.76      0.76      0.76       500
weighted avg       0.76      0.76      0.76       500



