IBM Machine Learning Professional Certificate<br>
__Unsupervised Machine Learning__

# Clustering of Corona Tweets
***

__Author__: Chawit Kaewnuratchadasorn<br>
__Date__: 15th Jan 2022<br>

This notebook was created for Unsupervised Machine Learning of IBM Machine Learning certificate. The dataset was obtained from Kaggle Dataset by Aman Miglani. The link was attached below. In this notebook, the clustering methods are applied to group tweets and compared to sentiment (negative, positive, or neutral). This project aims to practice k-mean clustering and non negative matrix factorisation. 

Data source: [Corona Tweets Dataset](https://www.kaggle.com/datatattle/covid-19-nlp-text-classification)

The contents include:
> 1. Overview of Dataset
> 2. Data Processing
> 3. Five-class Clustering
>>  K-mean clustering<br>
>>  Non-negative Matrix Factorisation<br>
> 4. Two-class Clustering
>>  K-mean clustering<br>
>>  Non-negative Matrix Factorisation<br>
> 5. Summary and Future Plan

## 1. Overview of Dataset

In the corona tweets dataset, we have 3798 rows with 6 columns. Columns are user name, scrren name, location, time, tweets, and labeled sentiment. Some of the locations are null, but others columns do not have null value. In this notebook, we will use only unsupervised machine learning to compare with the real sentiment. We would like to see if the tweets can be grouped in 5 sentiments: extremely negative, negative, neutral, positive, and extremely positive. Then, we will also reduce into 2 sentiments by dropping neutral and reducing the extreme levels. 

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv("Corona_tweets.csv")
data.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,Positive
2,3,44955,,02-03-2020,Find out how you can protect yourself and love...,Extremely Positive
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,Negative
4,5,44957,"Melbourne, Victoria",03-03-2020,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral


In [3]:
tweets = data[["OriginalTweet", "Sentiment"]].copy()
print("Number of rows in the data:", data.shape[0])
print("Number of columns in the data:", data.shape[1])
tweets.head()

Number of rows in the data: 3798
Number of columns in the data: 6


Unnamed: 0,OriginalTweet,Sentiment
0,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative
1,When I couldn't find hand sanitizer at Fred Me...,Positive
2,Find out how you can protect yourself and love...,Extremely Positive
3,#Panic buying hits #NewYork City as anxious sh...,Negative
4,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral


### Notes:

It is very important to note here that each tweet is selected only words. Numbers and symbols are removed out. This is because the numbers and symbols are chaotic. However, we do not remove website link out. In other words, we do not include only dictionary words. We also include some shorten alphabets. 

In [4]:
tweets["OriginalTweet"] = tweets["OriginalTweet"].str.replace("(","");
tweets["OriginalTweet"] = tweets["OriginalTweet"].str.replace(")","");
tweets["OriginalTweet"] = tweets["OriginalTweet"].str.replace("[^a-zA-Z\s]+","");

  tweets["OriginalTweet"] = tweets["OriginalTweet"].str.replace("(","");
  tweets["OriginalTweet"] = tweets["OriginalTweet"].str.replace(")","");
  tweets["OriginalTweet"] = tweets["OriginalTweet"].str.replace("[^a-zA-Z\s]+","");


In [5]:
tweets.head()

Unnamed: 0,OriginalTweet,Sentiment
0,TRENDING New Yorkers encounter empty supermark...,Extremely Negative
1,When I couldnt find hand sanitizer at Fred Mey...,Positive
2,Find out how you can protect yourself and love...,Extremely Positive
3,Panic buying hits NewYork City as anxious shop...,Negative
4,toiletpaper dunnypaper coronavirus coronavirus...,Neutral


## 2. Data Processing

In this section, we will transform tweets into huge matrix with each column is a word. Therefore, here are the process

> 1. Put all words in a list called `all_words`
> 2. Create a list for each tweet telling if the tweet contains each word in `all_words`. 
> 3. Create 2-D list which contains lists of all tweets

Then, we will be putting the 2D list into Non-negative Matrix Factorisation.

In [6]:
all_words = []

for i in range(len(tweets)):
    all_words += tweets["OriginalTweet"][i].split()

In [7]:
all_words = list(set(all_words))
all_words.sort()

In [8]:
len(all_words)

16568

In [9]:
text_data = []

for i in range(len(tweets)):
    temp = []
    for word in all_words:
        temp.append(tweets["OriginalTweet"][i].count(word))
    text_data.append(temp)

### Illustrations of new table

I save into csv file for future uses. But here I remove the code out.

In [10]:
table = pd.DataFrame(text_data, columns = all_words)
table.head()

Unnamed: 0,A,AAPL,ABANDONING,ABC,ABOUT,ABPNews,ABSCBNNews,ABTAs,AC,ACROSS,...,zero,ziploc,zirnelle,zombie,zombies,zoo,zoomus,zsobovny,zstupce,zypisfy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 3. Five-class Clustering

5 clusters have been used for each model in this section to observe if the unsupervised can tell the setiments of tweets. 


### 3.1 K-mean Clustering

After fitting, we make a table to illustrate each setiment with each cluster

In [12]:
from sklearn.cluster import KMeans
### BEGIN SOLUTION
km = KMeans(n_clusters=5, random_state=0)
km = km.fit(table)

tweets['kmeans_5'] = km.predict(table)

(tweets[['Sentiment','kmeans_5']]
 .groupby(['kmeans_5','Sentiment'])
 .size()
 .to_frame()
 .rename(columns={0:'number'}))

Unnamed: 0_level_0,Unnamed: 1_level_0,number
kmeans_5,Sentiment,Unnamed: 2_level_1
0,Extremely Negative,149
0,Extremely Positive,162
0,Negative,250
0,Neutral,99
0,Positive,237
1,Extremely Negative,162
1,Extremely Positive,194
1,Negative,220
1,Neutral,49
1,Positive,211


### 3.2 Non-negative Matrix Factorisation

After fitting, we make a table to illustrate each setiment with each cluster

In [13]:
from sklearn.decomposition import NMF

model = NMF(n_components=5, init='random', random_state=818)
Sentiment = model.fit_transform(text_data)

tweets["NMF_5"] = Sentiment.argmax(axis=1)

(tweets[['Sentiment','NMF_5']]
 .groupby(['NMF_5','Sentiment'])
 .size()
 .to_frame()
 .rename(columns={0:'number'}))



Unnamed: 0_level_0,Unnamed: 1_level_0,number
NMF_5,Sentiment,Unnamed: 2_level_1
0,Extremely Negative,22
0,Extremely Positive,24
0,Negative,27
0,Neutral,18
0,Positive,32
1,Extremely Negative,150
1,Extremely Positive,131
1,Negative,232
1,Neutral,96
1,Positive,212


In [14]:
tweets.head()

Unnamed: 0,OriginalTweet,Sentiment,kmeans_5,NMF_5
0,TRENDING New Yorkers encounter empty supermark...,Extremely Negative,0,3
1,When I couldnt find hand sanitizer at Fred Mey...,Positive,3,3
2,Find out how you can protect yourself and love...,Extremely Positive,2,3
3,Panic buying hits NewYork City as anxious shop...,Negative,4,3
4,toiletpaper dunnypaper coronavirus coronavirus...,Neutral,0,3


## 3. Two-class Clustering

2 clusters have been used for each model in this section to observe if the unsupervised can tell the setiments of tweets. But first of all, we need to start from reducing extreme levels and removing neutral

In [15]:
tweets["Sentiment"] = tweets["Sentiment"].replace("Extremely Negative", "Negative")
tweets["Sentiment"] = tweets["Sentiment"].replace("Extremely Positive", "Positive")
tweets = tweets[tweets["Sentiment"].str.contains("Neutral")==False].reset_index(drop = True)

In [16]:
tweets.head()

Unnamed: 0,OriginalTweet,Sentiment,kmeans_5,NMF_5
0,TRENDING New Yorkers encounter empty supermark...,Negative,0,3
1,When I couldnt find hand sanitizer at Fred Mey...,Positive,3,3
2,Find out how you can protect yourself and love...,Positive,2,3
3,Panic buying hits NewYork City as anxious shop...,Negative,4,3
4,Voting in the age of coronavirus hand sanitiz...,Positive,2,3


In [17]:
all_words = []

for i in range(len(tweets)):
    all_words += tweets["OriginalTweet"][i].split()

all_words = list(set(all_words))
all_words.sort()
len(all_words)

14935

In [18]:
text_data = []

for i in range(len(tweets)):
    temp = []
    for word in all_words:
        temp.append(tweets["OriginalTweet"][i].count(word))
    text_data.append(temp)

In [19]:
table = pd.DataFrame(text_data, columns = all_words)
table.head()

Unnamed: 0,A,AAPL,ABANDONING,ABC,ABOUT,ABTAs,AC,ACROSS,ACS,ACT,...,yvonnetn,yzf,zen,zero,ziploc,zirnelle,zombie,zombies,zoo,zoomus
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Then we are ready for the unsupervised learning

### 3.1 K-mean Clustering

After fitting, we make a table to illustrate each setiment with each cluster

In [20]:
from sklearn.cluster import KMeans
### BEGIN SOLUTION
km = KMeans(n_clusters=2, random_state=0)
km = km.fit(table)

tweets['kmeans_2'] = km.predict(table)

(tweets[['Sentiment','kmeans_2']]
 .groupby(['kmeans_2','Sentiment'])
 .size()
 .to_frame()
 .rename(columns={0:'number'}))

Unnamed: 0_level_0,Unnamed: 1_level_0,number
kmeans_2,Sentiment,Unnamed: 2_level_1
0,Negative,1002
0,Positive,1007
1,Negative,631
1,Positive,539


### 3.2 Non-negative Matrix Factorisation

After fitting, we make a table to illustrate each setiment with each cluster

In [21]:
from sklearn.decomposition import NMF

model = NMF(n_components=2, init='random', random_state=818)
Sentiment = model.fit_transform(text_data)
tweets["NMF_2"] = Sentiment.argmax(axis=1)

(tweets[['Sentiment','NMF_2']]
 .groupby(['NMF_2','Sentiment'])
 .size()
 .to_frame()
 .rename(columns={0:'number'}))



Unnamed: 0_level_0,Unnamed: 1_level_0,number
NMF_2,Sentiment,Unnamed: 2_level_1
0,Negative,622
0,Positive,616
1,Negative,1011
1,Positive,930


## 5. Summary

From five-class and two-class unsupervised learning, we can conclude the following points.

> <li>Unfortunately, the unsupervised learnings cannot group for sentiments</li>
> <li>The 5 clusters are not 5 sentiments and the 2 clusters are not 2 sentiments</li>
> <li>Interestingly, we see that each cluster consists of similar number of each sentiment. This could imply that there is something similar for some negative and positive tweets</li>
> <li>Sentiments might not be obvious for clustering</li>

Future work and suggestions will be

> <li>Clustering of data can be explored more deeply. For example, the most appearing words might tell some relationships of clusterings.</li>
> <li>Sentiments will work better with supervised learning. This thus will lead to natural language processing classifications because sentiments should be labeled or taught to models</li>
> <li>Deep Learning which will be learned in next course can use the prepared data.</li>