# Lab 6: Clustering and Classification of Headlines

## Due Wednesday, October 26 @ 11:59pm EDT

In the directory where you found this notebook, you also will find these files:
<ol>
<li><code>train_clickbait.txt</code>: 15000 clickbait headlines, one per line</li>
<li><code>train_non_clickbait.txt</code>: 15000 news headlines, one per line</li>
<li><code>test.txt</code>: 1000 clickbait headlines and 1000 news headlines, one per line</li>
</ol>


Add, commit, and push this notebook by the deadline with:
<ol>
    <li> code for all required components</li>
    <li> answers to the boldfaced Q questions in the notebook</li>
</ol>

## 1. Preliminaries

In this problem set, you will be working with a corpus of headlines, some of which are from typical clickbait sources and others which are from standard news sources. You can learn more about the corpus and some experiments that have been run with the corpus in the article that accompanied its release:

Chakraborty, A., Paranjape, B., Kakarla, S., and Ganguly, N. 2016. Stop Clickbait: Detecting and preventing clickbaits in online news media. In <i>Proceedings of the IEEE International Conference on Advances in Social Networks Analysis and Mining (ASONAM</i>), pp. 9-16.

First we will import some libraries. Then we read in the news (non-clickbait) headlines, remove stop words and digits, downcase, and save both the normalized lists of tokens and the original strings to two lists. You can just execute the entire cell below without modifying it.

In [None]:
import gensim
from nltk.corpus import stopwords
import numpy as np
import scipy as sp
import re
from sklearn.cluster import KMeans


## Here I am just customizing the nltk English stop list
stoplist = stopwords.words('english')
stoplist.extend(["ever", "one", "do","does","make", "go", "us", "to", "get", "about", "may", "s", ".", ",", "!", "i", "I", '\"', "?", ";", "--", "--", "would", "could", "”", "Mr.", "Miss", "Mrs.", "don’t", "said", "can't", "didn't", "aren't", "I'm", "you're", "they're", "'s"])
stoplist.remove("which")


## Here I am reading in the news (non-clickbait headlines)
newsheadlines = []     # this will store the original headline strings
newsheadlinetoks = []  # this will store the lists of tokens in those headlines

f = open("train_non_clickbait.txt")
for line in f:
    line = line.rstrip()
    newsheadlines.append(line)    
    line = re.sub(r"(^| )[0-9]+($| )", r" ", line)  # remove digits
    addme = [t.lower() for t in line.split() if t.lower() not in stoplist]
    newsheadlinetoks.append(addme)
f.close()

## Now, just printing out an example line from the original headline strings
print(newsheadlines[50])

## And printing out the normalized list of tokens for that string
print(newsheadlinetoks[50])

## 2. Identifying topics in news headlines

In class, we learned a little bit about topic modeling with LDA. Here, we are going to try to identify topics in a different way using word2vec word embeddings. So first we need to load the word embedding model from Lab 5. <b>Move that model to this directory or change the path in the command below to your lab5 diretory so that you can execute the command below and load the model.</b>

In [None]:
bigmodel = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300-SLIM.bin", binary=True)

The code below will read in the normalized tokens for each headline, look up their vectors in the word2vec model, and sum them all up into a single vector per headline. As we discussed in class, word embeddings are especially cool because you can add them up and capture the semantics of an entire text.

You can just execute the cell below.

In [None]:
newsvectors = []   # this list will contain one 300-dimensional vector per headline

for h in newsheadlinetoks:
    totvec = np.zeros(300)
    for w in h:
        if w.lower() in bigmodel:
            totvec = totvec + bigmodel[w.lower()]
    newsvectors.append(totvec)

print(len(newsvectors))
print(len(newsheadlines))
print(len(newsvectors[10]))

In class, we discussed a clustering method called <b>k-means clustering</b>. This technique takes vectors and tries to group them together based on how far apart those vectors are from each other. As with LDA, you need to say how many clusters you want in advance. We're going to say 50.

Again, you can just execute the code below. <b>Note: It will take a few minutes! Wait until the asterisk in the square brackets is replace by a number to continue</b>.


In [None]:
kmnews = KMeans(n_clusters=50, random_state=0)
newsclusters = kmnews.fit_predict(newsvectors)

Now let's see what the clusters look like. The k-means fit_predict() function in scikit returns a list containing a single integer for every input vector corresponding to the cluster ID that vector was assigned to. We can simply iterate through that list of cluster assignments, and we can print out all the headlines that belong to one of the clusters. 

Execute the code below to see all the headlines in cluster 35.

In [None]:
for i in range(len(newsclusters)):
    if newsclusters[i] == 35:
        print(newsheadlines[i])

### Q1: There are 50 clusters. Examine many different clusters of headlines by changing 35 in the code above to another integer between 1 and 50. Find <u>5</u> clusters of headlines that you think are good clusters. Report on these 5 clusters as follows: Make a table with three columns: (1) cluster ID; (2) your description of that cluster's topic; (3) 3  sentences from the cluster exemplifying that topic. 

### Some clusters might not immediately make a lot of sense, and there will probably be some headlines that don't perfectly fit with the rest in a particular cluster. If you can't figure out the topic of a cluster, just try a different one. It should be very easy to get 5 that are easy to categorize.

### You can learn how to make a table in markdown [here](https://www.markdownguide.org/extended-syntax/) or just by doing a web search for "table markdown".

*Your answer to Q1 goes here*

## 3. Identifying topics in clickbait headlines

Repeat the above procedure on the file of clickbait headlines, which is called <code>train_clickbait.txt</code>. 

Any variables that have "news" in their names in the above code should have "click" in their name in your code. I've given you the variable names and descriptions in the cells below where you should put your code.


In [None]:
## READ IN THE DATA
clickheadlines = []     # this will store the original clickbait headline strings
clickheadlinetoks = []  # this will store the lists of tokens in those clickbait headlines


## CREATE YOUR VECTORS
clickvectors = []       # this list will contain one 300-dimensional vector per clickbait headline



In [None]:
## RUN KMEANS TO CLUSTER YOUR DATA

# Once you've built clickvectors in the cell above,
# you can run this below to cluster the data.

kmclick = KMeans(n_clusters=50, random_state=0)  
clickclusters = kmclick.fit_predict(clickvectors)  

In [None]:
## PRINT OUT HEADLINES IN A PARTICULAR CLUSTER

# Once you have clustered your clickbait data
# you can run this code to print out the headlines
# in a particular cluster.

for i in range(len(clickclusters)):
    if clickclusters[i] == 35:
        print(clickheadlines[i])

### Q2: There are 50 clusters of clickbait headlines. Examine many different clusters of headlines by changing 35 in the code above to another integer between 1 and 50. Find <u>5</u> clusters of headlines that you think are good clusters. Report on these 5 clusters as follows: Make a table with three columns: (1) cluster ID; (2) your description of that cluster's topic; (3) 3  sentences from the cluster exemplifying that topic. 

*Your answer to Q2 goes here*

### Q3: How do these clusters compare with the clusters we saw using LDA in in class last Wednesday?

*Your answer to Q3 goes here*

### Q4: How would you design an experiment to evaluate the quality of these clusters?

*Your answer to Q4 goes here*

## 4. Nearest neighbor classification

Examine the file <code>test.txt</code>. The very first character of each line is either 1 or 0. A 1 indicates that the headline is clickbait. A 0 indicates that the headline is a news headline. The 1 or 0 is followed by a tab and then by the text of the headline itself.

You are going to read in the data from <code>test.txt</code> and try to classify each headline as either clickbait or news. The first method you will use is the k-nearest neighborbors algorithm discussed a few weeks ago in class. To simplify things, we'll set k to be equal to 1.

Here's how the algorithm works:
<ol>
    <li> Take an incoming headline, and sum the word embedding vectors of its component words, downcasing and removing stopwords, <u>exactly</u> as you did above with the training data. (Reuse that code!)</li>
    <li> Compare that vector to every vector in <code>clickvectors</code> using <code>scipy.spatial.distance.cdist</code>. (See the code for details.)</li>
    <li> Compare your vector to every vector in <code>newsvectors</code> with <code>scipy.spatial.distance.cdist</code></li>
    <li> Find the mins in each row, as described in the code using <code>min</code></li>
    <li> For each headline, if the news min is smaller than the clickbait min then classify that headline as 0 (news). Otherwise classify that headline as 1 (clickbait).</li>
</ol>


Given this information, you must also write code to keep track of and report the performance of your k-nearest neighbors classifier. Report percent correct, precision, and recall.

Note: <code>scipy.spatial.distance.cdist</code> takes the following arguments:
* the 2D array containing your test vectors <code>testvectors</code>
* the 2D array containing either the clickbait vectors or the news vectors

And it returns a 2D array containing, for each row, the distance from that test vector to each of the clickbait or news vectors.

<b>Do not use the built-in KNN function in scikit-learn.</b>


In [None]:
from scipy.spatial.distance import cdist
from sklearn import metrics

## WRITE YOUR K-NEAREST NEIGHBORS CODE HERE
## COMMENT YOUR CODE CLEARLY

testtargets = []  # where to store whether a test headline is 0 or 1
testvectors = []  # where to store the vector for each headline

# while you read in test.txt...
# ...keep track of whether each headline is 1 (clickbait) or 0 (news) in the list testtargets[]
# ...AND get the summed word embedding vector for each headline and append it to list testvectors[]



## SANITY CHECKING
# len(testvectors) should equal 2000 and should be a list of lists
# len(testvectors[100]) should equal 300
# len(testtargets) should equal 2000 and should be a list of 1s and 0s


## GET THE DISTANCES
# get the distance between the each test vector and each of the clickbait vectors
# use scipy.spatial.distance.cdist(testvectors, clickvectors)
# save the output of cdist to a 2D array called clickdistances
# each row will correspond to one test vector
# each value in the row will correspond to the distance between that vector
# and one of the new vectors


# get the distance between the each test vector and each of the news vectors
# use scipy.spatial.distance.cdist(testvectors, newsvectors)
# save the output of cdist to a 2D array called newsdistances
# each row will correspond to one test vector
# each value in the row will correspond to the distance between that vector
# and one of the new vectors


## GET THE MIN DISTANCES
# get the min of of each row in clickdistances using clickdistances.min(axis=1)
# save out to a list or vector called clickmins

# get the min of of each row in newsdistances using newsdistances.min(axis=1)
# save out to a list or vector called newsmins





## GET YOUR PREDICTIONS
predictedknn = []  # where to store your KNN predictions 

# loop through the mins in newsmins and clickmins
# if the news min is smaller than the click min, append 0 to predictedknn
# otherwise append 1 to knnpredicted


## EVALUATE YOUR PREDICTIONS
# print the classification report
print(metrics.classification_report(testtargets, predictedknn))




### Q5: What would be the random baseline for this dataset? 


*Your answer to Q5 goes here*

### Q6: What percent of the headlines did you correctly classify with your KNN implementation? What were your precision and recall? How does this compare to the random baseline?

*Your answer to Q6 goes here*

## 5. Classification with word embedding features

Now we are going to use some of the classifiers in scikit learn. First, we're going to create a classifier that will take, as features, the summed word embedding vector for a headline. 

We already have all our input for the classifier. We just need to put it in the right format by executing the cell below.

In [None]:
# alltargets will just be a list of 1s and 0s indicating
# which class each headline belongs to (clickbait or news)
alltargets = list(np.ones(len(clickvectors)))
alltargets.extend(np.zeros(len(newsvectors)))
alltargets = np.array(alltargets)

# allvectors is just the full set of word embedding vectors
# for both clickbait and news headlines
allvectors = clickvectors + newsvectors

The code below you've seen before in previous labs. All we are doing is training a model on the summed word embedding vector for each headline. First we trying our old friend naive Bayes, and then we try a linear SVM and then logistic regression. I've written the code for naive Bayes. You write the code for the other two classifiers. (Refer to previous labs to remember how to do this. It's easy!)

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC



# NAIVE BAYES

# Initialize the NB model
model = GaussianNB()

# Fit model to the training data
model.fit(allvectors, alltargets)

# Apply model to test set using predict()
expected = testtargets
predicted = model.predict(testvectors)

# Print a classification report
print(metrics.classification_report(expected, predicted))


### SVM

# Initialize SVM model


# Fit it to the the training data


# Apply model to test set using predict()


# Print a classification report


### Logistic regression

# Initialize the model


# Fit it to the the training data


# Apply model to test set using predict()


# Print a classification report




### Q7: Which performs best? How do they both compare to your k-nearest neighbors classifier? Why do you think this might be?

*Your answer to Q7 goes here*

## 6. Classification with cluster features

Finally, we will explore whether the distances to each of the clusters can be used as features for classification of clickbait vs. news. The code in the cell below will build the necessary input for you. 


In [None]:
testclickdistances = kmclick.transform(testvectors)
testnewsdistances = kmnews.transform(testvectors)

clickdistances = kmclick.transform(allvectors)
newsdistances = kmnews.transform(allvectors)

# this vector tells you for each training headline, how far
# away is it from each of the 100 clusters (50 news and 50 clickbait)
allclusterdistances = np.column_stack( [clickdistances,newsdistances])

# this vector tells you for each test headline, how far
# away is it from each of the 100 clusters (50 news and 50 clickbait)
testclusterdistances = np.column_stack( [testclickdistances,testnewsdistances])

Finally, in the cell below, train and test a naive Bayes classifier, a linear SVM, and logistic regression classifier using <code>allclusterdistances</code> intead of <code>allvectors</code> and <code>testclusterdistances</code> instead of <code>testvectors</code>.

In [None]:
# Train (fit) your NB with allclusterdistances

# Test your NB with testclusterdistances

# Print classification report


# Train (fit) your SVM with allclusterdistances

# Test your SVM with testclusterdistances

# Print classification report


# Train (fit) your Logistic Regression with allclusterdistances

# Test your LR with testclusterdistances

# Print classification report





### Q8: Which performs best? How do they both compare to the models trained just on word embedding vectors, in part 5. Why do you think this might be?

*Your answer to Q8 goes here*

### Q9: Create a table with all of the results from parts 4, 5, and 6. Discuss what you believe would be the best approach of these three for identifying clickbait. Then describe another experiment you would like to run to improve clickbait classification (e.g., using different features, using a different classifier).

*Your answer to Q9 goes here*

## 7. Verifying and submitting your work
Go up to the Kernel menu and select Restart and Run All. This will run all of the code you've written. Make sure there are no errors.

**Add, commit, and push to your repo his notebook with both your code and your answers to the questions.**
