# Mini-Project 2: Clustering
CSC177-01 

##### Jimmy Sisenglath, Eric Pham, Justin Eugenio

 

# 1. Clustering of Twitter Users
- In the first problem we will look into the clustering of users. To apply clustering, we first need to represent each user as a vector of integers with the frequency (i.e., how many times) a user has used
each hashtag/handle. Use **tfidfVectorzier()** in Sci-kit learn to create this representation.

In [1]:
import pandas as pd
import numpy as np

# Import cleaned data from previous project.
# The new created dataframe that took in the old dataframe with a 
# condition of having the popular users with more than 20 tweets.
df_tweets = pd.read_csv('data/clean_tweets.csv')

# Print top 5 records.
df_tweets.head()

Unnamed: 0,Name,ScreenName,UserID,Location,TopHash
0,David Walling,davidjwalling,106568768,"Dallas, TX",#infosec
1,Harry Binford,CaptainNormal,2447279666,USA,@WayneDupreeShow #climatechange
2,My Info,chrisf91,2424856262,"San Antonio, TX",@ochocinco
3,EmmaPeel,Emma__Peel,180304807,emma w/2 underscores,@KrisBryant_23 #FlyTheW #WorldSeries
4,Laura,Ciocia,17235735,"London, Ontario, Canada",@NaNoWriMo


In [2]:
import sklearn.feature_extraction.text as sk_text

# Dataframe hashtag aggregate. 
df_hashtag_agg = df_tweets.groupby('UserID')['TopHash'].apply(lambda x: x.str.cat(sep=' ')) # .sum
df_ready_for_sklearn = pd.DataFrame({'User_id': df_hashtag_agg.index,
'All_hashtags': df_hashtag_agg.values})

vectorizer = sk_text.TfidfVectorizer(stop_words = 'english',
                             max_features = 100,
                             min_df=2, 
                             max_df=500,
                             )

In [3]:
matrix = vectorizer.fit_transform(df_ready_for_sklearn.All_hashtags.values)
tfidf_data = matrix.toarray()

df_tfidf = pd.DataFrame(matrix.toarray(), 
                        index=df_hashtag_agg.index, 
                        columns=vectorizer.get_feature_names())

# Print tfidf score for each record.
print(tfidf_data)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.12694774 0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [4]:
# View all the features.
print(vectorizer.get_feature_names())
print("\nNumber of features:", len(vectorizer.get_feature_names()))

['0hour1__', 'abcpolitics', 'amjoyshow', 'ananavarro', 'ap', 'barackobama', 'borisjohnson', 'breakingnews', 'breitbartnews', 'bretbaier', 'brianefallon', 'brianstelter', 'brithume', 'cbsnews', 'cernovich', 'chrislhayes', 'clinton', 'clothing', 'cnnpolitics', 'crookedhillary', 'cspanwj', 'cubs', 'dailycaller', 'danaperino', 'danscavino', 'dcexaminer', 'dineshdsouza', 'donaldjtrumpjr', 'donnabrazile', 'draintheswamp', 'drudge_report', 'ebatterson', 'erictrump', 'evan_mcmullin', 'fahrenthold', 'flotus', 'foxandfriends', 'frankluntz', 'free', 'gohillary', 'gop', 'guardian', 'hello', 'hillary', 'huffpostpol', 'imwithher', 'ingrahamangle', 'jackposobiec', 'jamesokeefeiii', 'jaredwyand', 'jobs', 'joenbc', 'judgejeanine', 'keitholbermann', 'lgbt', 'lindasuhler', 'lorettalynch', 'loudobbs', 'maddow', 'maga', 'mattyglesias', 'mittromney', 'mmflint', 'morning_joe', 'mtracey', 'music', 'nbcnews', 'neverhillary', 'nevertrump', 'news', 'noltenc', 'nowplaying', 'nyc', 'nypost', 'oreillyfactor', 'paul

### Task 1.1
- Let’s apply clustering and compare the clustering result against a known ground truth. **In the file “clinton_trump_user_classes.txt”, we have the ground truth “class” membership for each user id in the data. Class 0 corresponds to Trump followers, while class 1 corresponds to Clinton followers.**

In [5]:
ground_cols = ['UserID', 'Class']
ground_truth = pd.read_table('data/clinton_trump_user_classes.txt', encoding ="ISO-8859-1", names= ground_cols)

df_tfidf = pd.merge(df_tfidf, ground_truth, on = 'UserID')

true = df_tfidf['Class'].values # Ground truth.

df_tfidf

Unnamed: 0,UserID,0hour1__,abcpolitics,amjoyshow,ananavarro,ap,barackobama,borisjohnson,breakingnews,breitbartnews,...,thehill,therickwilson,timkaine,trump,trumppence16,twitter,usatoday,walshfreedom,wsj,Class
0,2426,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1
1,1134681,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0
2,1329901,0.126948,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0
3,1354481,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,...,0.056229,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0
4,1366631,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,...,0.158567,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0
5,1386571,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0
6,1428731,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1
7,1449381,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1
8,1787311,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1
9,1790991,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1


In [6]:
from scipy.cluster.hierarchy import dendrogram, linkage  
import sklearn.cluster as sk_cluster

# Get 2 clusters.
kmeans = sk_cluster.KMeans(n_clusters=2, max_iter=100, n_init=10) 
kmeans_labels = kmeans.fit_predict(tfidf_data)

print("Cluster labels")
print(kmeans_labels, "\n")

centroids = kmeans.cluster_centers_
print ("Cluster Centroids") # Center points for clusters.
print(centroids, "\n")

error = kmeans.inertia_ 
# Sum of Squared Error (SSE). Distance between each record.
print ("The total error of the clustering is: ", error)

Cluster labels
[0 0 0 ... 0 0 1] 

Cluster Centroids
[[ 1.55269813e-02  1.39614605e-02  1.54393519e-02  1.94792320e-02
   1.65025250e-02  1.74301548e-02  2.94004982e-03  3.61081329e-03
   2.39181879e-02  1.53706528e-02  1.15019019e-02  1.77010762e-02
   1.73260105e-02  2.54519509e-02  2.20145123e-02  2.58092556e-02
   2.93946951e-03  3.42083362e-04  3.21360635e-02  7.99915175e-03
   5.16376200e-03  2.82812631e-02  1.44369884e-02  8.12772015e-03
   1.91175988e-02  1.77136146e-02  1.49533317e-02  2.14009568e-02
   1.64332295e-02  1.42261210e-02  1.93183518e-02  3.59559133e-04
   1.43041541e-02  2.13267474e-02  1.65708283e-02  2.05801633e-02
   1.49438098e-02  1.52468641e-02  2.03810760e-03  9.43035781e-03
   2.22755876e-02  2.16877563e-02  2.73107289e-04  3.60520585e-03
   1.59804178e-02  1.75877694e-02  2.83272435e-02  1.44182761e-02
   1.47331022e-02  2.07633809e-02  1.48182725e-03  1.35908065e-02
   1.89431791e-02  1.78840418e-02  2.08134588e-04  1.43405129e-02
   1.02016629e-02  1.87

### K-Means

In [7]:
# Load metric functions. Compute matrix scores.
import sklearn.metrics as metrics 

confusion = metrics.confusion_matrix(true, kmeans.labels_, labels=[0, 1])
print("Confusion Matrix: \n", confusion)
print('Precision Score:', metrics.precision_score(true, kmeans.labels_, average='weighted'))
print('Recall Score:', metrics.recall_score(true, kmeans.labels_, average='weighted'))
print('F1 Score:', metrics.f1_score(true, kmeans.labels_, average= 'weighted'))


Confusion Matrix: 
 [[2364  293]
 [1894  107]]
Precision Score: 0.4316032506487422
Recall Score: 0.5304851867754401
F1 Score: 0.4283000229984781


In [None]:
# print(kmeans.cluster_centers_)

# Task 1.2
- For k-means, look at the two centers (centroids) and print the top-30 hashtags/handles with the highest tfidf values.

In [8]:
asc_order_centroids = kmeans.cluster_centers_.argsort() # argsort() returns the indices that would sort an array.
# print(asc_order_centroids)
des_order_centroids = asc_order_centroids[:,::-1] # Get the indices that sort array in descending order.
# print(des_order_centroids)   

In [11]:
terms = vectorizer.get_feature_names()
#print(terms, '\n')

for i in range(2):
    print ("Cluster:", i)
    for ind in asc_order_centroids[i, :30]:
        print (terms[ind], vectorizer.vocabulary_[terms[ind]])
    print()

Cluster: 0
lgbt 54
hello 42
clothing 17
ebatterson 31
music 65
jobs 50
nyc 72
spotify 85
free 38
nowplaying 71
tcot 86
news 69
clinton 16
borisjohnson 6
hillary 43
breakingnews 7
trump 94
cspanwj 20
trumppence16 95
mittromney 61
maga 59
crookedhillary 19
danaperino 23
neverhillary 67
realalexjones 78
gohillary 39
lorettalynch 56
brianefallon 10
therickwilson 92
twitter 96

Cluster: 1
ebatterson 31
hello 42
clothing 17
spotify 85
nowplaying 71
cspanwj 20
borisjohnson 6
free 38
nyc 72
jobs 50
lgbt 54
dineshdsouza 26
therickwilson 92
music 65
mtracey 64
mattyglesias 60
paulkrugman 75
joenbc 51
thefix 90
breakingnews 7
news 69
thedailybeast 89
ap 4
brianstelter 11
guardian 41
amjoyshow 2
reuters 80
sopandeb 83
cubs 21
ananavarro 3



Compute the confusion matrix, precision, recall, and F-measure for (1) k-means, (2) MAX-based agglomerative clustering, and (3) SSE-based agglomerative clustering.

### MAX-based Agglomerative Clustering

In [None]:
ag = sk_cluster.AgglomerativeClustering(linkage = 'complete', n_clusters = 2)   

# linkage: Determines which distance to use between sets of observation.

# complete: Proximity between two clusters is the proximity between their
#           two most distant objects.
# n_clusters: The number of clusters to find.

ag_labels = ag.fit_predict(tfidf_data)

# print ('\nPrinting cluster assignment:')
# ag_labels 

confusion = metrics.confusion_matrix(true, ag_labels, labels=[0, 1])
print("Confusion Matrix: \n", confusion)
print('Precision Score:', metrics.precision_score(true, ag_labels, average='weighted'))
print('Recall Score:', metrics.recall_score(true, ag_labels, average='weighted'))
print('F1 Score:', metrics.f1_score(true, ag_labels, average='weighted'))

### SSE-based Agglomerative Clustering

In [None]:
ag = sk_cluster.AgglomerativeClustering(linkage = 'ward', n_clusters = 2)   

# ward: Minimum variance.
# n_clusters: The number of clusters to find.

# tfidf data.
ag_labels = ag.fit_predict(tfidf_data)

# print ('\nPrinting cluster assignment:')
# ag_labels 

confusion = metrics.confusion_matrix(true, ag_labels, labels=[0, 1])
print("Confusion Matrix: \n", confusion)
print('Precision Score:', metrics.precision_score(true, ag_labels, average='weighted'))
print('Recall Score:', metrics.recall_score(true, ag_labels, average='weighted'))
print('F1 Score:', metrics.f1_score(true, ag_labels, average='weighted'))

# Task 1.3
- Show the **two respective word clouds** of the two centers (centroids) by using hashtags/handles and their tfidf values. **Hint: Use function fit_words() that comes with wordcloud**

In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline

# Stopword list create.
stopwords = set(STOPWORDS)

# Create and generate a word cloud image:
wordcloud = WordCloud(stopwords=stopwords).fit_words(vectorizer.vocabulary_)

# Display the generated image:
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear') 
plt.axis("off")
plt.show()

# Need another wordcloud.

# 2. Clustering of Hashtags/handles
- In the second problem, we will look into clustering of hashtags/handles. Represent each hashtag/handle as a vector of integers with the frequency of the hashtag/handle for each user. **Hint: You may transpose
the matrix you achieved in the first problem to construct this representation.**

## Task 2.1 
- First, you apply the k-means algorithm. Create a plot of the SSE error of the k-means algorithm as a function of the number of clusters, for k up to 20, in order to determine the optimal
number of clusters.

In [None]:
error = np.zeros(20)
for k in range(1,20):
    kmeans = sk_cluster.KMeans(n_clusters=k, n_init=10, max_iter=300)
    kmeans.fit_predict(tfidf_data)
    error[k] = kmeans.inertia_

plt.plot(range(1, 20), error[1:])
plt.xlabel('Number of clusters')
plt.ylabel('SSE')

## Task 2.2
- Run the k-means algorithm for the optimal number of clusters you identified in the last task. Print some hashtags/handles in each cluster. From the hashtags/handles in each cluster, try to
deduce what is the topic it concerns. 

In [None]:
terms = vectorizer.get_feature_names()
#print(terms, '\n')

kmeans = sk_cluster.KMeans(n_clusters=20, n_init=10, max_iter=300)
kmeans.fit_predict(tfidf_data)

# Print all the clusters.
for j in range(20):
    print ("Cluster:", j)
    for ind in asc_order_centroids[j, :30]:
        print (terms[ind], vectorizer.vocabulary_[terms[ind]])
    print()