# Corona Tweet Sentiment Analysis

The goal of this project is to classify tweets about COVID-19 into positive sentiment and negative emotion. This project also serves to make myself more familiar with state-of-the-art NLP models. 

In [47]:
import numpy as np 
import pandas as pd 
pd.set_option('display.max_colwidth', 500)

from sklearn.model_selection import train_test_split

import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn

from transformers import XLMRobertaConfig, XLMRobertaModel, XLMRobertaTokenizer, XLMRobertaForSequenceClassification

The data for this project comes from Kaggle ([COVID-19 NLP Text Classification](https://www.kaggle.com/datatattle/covid-19-nlp-text-classification)) and contains approx. 45,000 tweets concerning COVID-19. 

In [3]:
train = pd.read_csv('../input/covid-19-nlp-text-classification/Corona_NLP_train.csv', encoding='latin1')
test = pd.read_csv('../input/covid-19-nlp-text-classification/Corona_NLP_test.csv', encoding='latin1')

In [5]:
train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order,Positive
2,3801,48753,Vagabonds,16-03-2020,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",Positive
3,3802,48754,,16-03-2020,"My food stock is not the only one which is empty...\r\r\n\r\r\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \r\r\nStay calm, stay safe.\r\r\n\r\r\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j",Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COVID19 outbreak.\r\r\n\r\r\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...\r\r\n\r\r\n#CoronavirusFrance #restezchezvous #StayAtHome #confinement https://t.co/usmuaLq72n",Extremely Negative


As we can see, the data includes the Twitter username, their location, the date the tweet was posted, the text of the tweet and the corresponding label (extremely negative, negative, neutral, positive, extremely positive)

In [6]:
train.Sentiment.value_counts()

Positive              11422
Negative               9917
Neutral                7713
Extremely Positive     6624
Extremely Negative     5481
Name: Sentiment, dtype: int64

### Data Preprocessing

Since I do not want the project to be too complex, I decide to only work with the text data and omit the rest of the columns.

In [7]:
train = train[['OriginalTweet', 'Sentiment']]

In [8]:
train.head()

Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8,Neutral
1,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order,Positive
2,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",Positive
3,"My food stock is not the only one which is empty...\r\r\n\r\r\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \r\r\nStay calm, stay safe.\r\r\n\r\r\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j",Positive
4,"Me, ready to go at supermarket during the #COVID19 outbreak.\r\r\n\r\r\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...\r\r\n\r\r\n#CoronavirusFrance #restezchezvous #StayAtHome #confinement https://t.co/usmuaLq72n",Extremely Negative


In [9]:
train.shape

(41157, 2)

In [10]:
test = test[['OriginalTweet', 'Sentiment']]

In [11]:
test.head()

Unnamed: 0,OriginalTweet,Sentiment
0,"TRENDING: New Yorkers encounter empty supermarket shelves (pictured, Wegmans in Brooklyn), sold-out online grocers (FoodKick, MaxDelivery) as #coronavirus-fearing shoppers stock up https://t.co/Gr76pcrLWh https://t.co/ivMKMsqdT1",Extremely Negative
1,"When I couldn't find hand sanitizer at Fred Meyer, I turned to #Amazon. But $114.97 for a 2 pack of Purell??!!Check out how #coronavirus concerns are driving up prices. https://t.co/ygbipBflMY",Positive
2,Find out how you can protect yourself and loved ones from #coronavirus. ?,Extremely Positive
3,#Panic buying hits #NewYork City as anxious shoppers stock up on food&amp;medical supplies after #healthcare worker in her 30s becomes #BigApple 1st confirmed #coronavirus patient OR a #Bloomberg staged event?\r\r\n\r\r\nhttps://t.co/IASiReGPC4\r\r\n\r\r\n#QAnon #QAnon2018 #QAnon2020 \r\r\n#Election2020 #CDC https://t.co/29isZOewxu,Negative
4,#toiletpaper #dunnypaper #coronavirus #coronavirusaustralia #CoronaVirusUpdate #Covid_19 #9News #Corvid19 #7NewsMelb #dunnypapergate #Costco One week everyone buying baby milk powder the next everyone buying up toilet paper. https://t.co/ScZryVvsIh,Neutral


In [12]:
test.shape

(3798, 2)

To further simply the task at hand, I decide to turn the problem from a multiclass classification problem into a binary classification problem. As such, I drop the tweets labeled as neutral and merge the classes 'Positive' and 'Extremely Positive', and 'Negative' and 'Extremely Negative'. 

I do acknowledge that dropping all neutral tweets could turn out to be problematic if the model was to be actually implemented somewhere, since it would definitely encounter neutral tweets but be forced to classify them as either positive or negative. However, since I do this project mainly in order to get more familiar with various different NLP models, simplifying the data is acceptable to me.

In [13]:
train = train[train.Sentiment != 'Neutral']
test = test[test.Sentiment != 'Neutral']

For the model to be able to process the labels, I turn them into numerical labels (0 for negative, 1 for positive).

In [14]:
label2idx = {'Extremely Negative': 0, 
            'Negative': 0,
            'Positive': 1,
            'Extremely Positive': 1}

train['Sentiment'] = train['Sentiment'].replace(label2idx)
test['Sentiment'] = test['Sentiment'].replace(label2idx)

In [15]:
train.head()

Unnamed: 0,OriginalTweet,Sentiment
1,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order,1
2,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",1
3,"My food stock is not the only one which is empty...\r\r\n\r\r\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \r\r\nStay calm, stay safe.\r\r\n\r\r\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j",1
4,"Me, ready to go at supermarket during the #COVID19 outbreak.\r\r\n\r\r\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...\r\r\n\r\r\n#CoronavirusFrance #restezchezvous #StayAtHome #confinement https://t.co/usmuaLq72n",0
5,"As news of the regionÂs first confirmed COVID-19 case came out of Sullivan County last week, people flocked to area stores to purchase cleaning supplies, hand sanitizer, food, toilet paper and other goods, @Tim_Dodson reports https://t.co/cfXch7a2lU",1


In [18]:
train.Sentiment.value_counts()

1    18046
0    15398
Name: Sentiment, dtype: int64

In [17]:
test.Sentiment.value_counts()

0    1633
1    1546
Name: Sentiment, dtype: int64

### Tokenization and Building the Dataset

In order to feed the data to my model, I build a simple dataset class that also tokenizes the text data. For this, I import the XLMRoBERTa tokenizer with the the transformers library. This tokenizer is specifically built for the XLMRoBERTa model.

The pre-trained model I use for this project is XLM-RoBERTa, which was built by Facebook and is a multilingual model building on XLM-100 and RoBERTa. I chose this model in case the data also includes non-English tweets.

In oder to give a reasonable parameter value for `max_length` (which defines the maximum number of tokens in a sequence, I first tokenize all tweets with `max_length` set to 1000 so as to guarantee that no sequence is truncated. I then have a look at the distribution of tweet lengths and take the length corresponding to the 99th percentile (which turn out to be 105 tokens).

In [19]:
class CoronaTweetDataset_nontokenized(Dataset):
    
    def __init__(self, text, labels):
        self.text = text.values
        self.labels = labels.values
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, idx):
        return (self.text[idx], self.labels[idx])

In [20]:
train_ds = CoronaTweetDataset_nontokenized(train['OriginalTweet'], train['Sentiment'])

In [22]:
backbone = 'xlm-roberta-base'
tokenizer = XLMRobertaTokenizer.from_pretrained(backbone)

In [25]:
len_tweet = []
for tweet in range(0, len(train_ds)):
    tokenized_tweet = tokenizer.encode_plus(train_ds[tweet][0], max_length=1000)
    len_tweet.append(len(tokenized_tweet['input_ids']))

In [26]:
np.percentile(len_tweet, [75, 90, 99]), max(len_tweet)

(array([ 75.,  86., 105.]), 143)

In [28]:
tokenizer.encode_plus(train_ds[10][0], max_length=105)

{'input_ids': [0, 1215, 12, 87, 2301, 25, 18, 3871, 47, 31837, 1257, 98, 15381, 4, 87, 25, 1181, 1660, 765, 15075, 75060, 89778, 87, 3871, 468, 50886, 2477, 79600, 223, 15075, 12, 3975, 696, 18, 5, 587, 23538, 1723, 22489, 118058, 170, 19279, 441, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

I now use `max_length` to build a tokenizer that truncates each tweet to `max_length` and adds padding accordingly.

In [37]:
tokenizer.encode_plus(train_ds[10][0], 
                      max_length=105, 
                      truncation=True, # truncates to max_length
                      padding='max_length') # pads to max_length

# in contrast to `tokenizer.encode`, `tokenizer.encodeplus also returns the attention mask

{'input_ids': [0, 1215, 12, 87, 2301, 25, 18, 3871, 47, 31837, 1257, 98, 15381, 4, 87, 25, 1181, 1660, 765, 15075, 75060, 89778, 87, 3871, 468, 50886, 2477, 79600, 223, 15075, 12, 3975, 696, 18, 5, 587, 23538, 1723, 22489, 118058, 170, 19279, 441, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

Seeing that the above code works, I now build an updated Dataset class that now includes tokenization.

In [39]:
class CoronaTweetDataset(Dataset):
    
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts.values
        self.labels = labels.values
        self.tokenizer = tokenizer
        self.max_length = max_length
        
    def __len__(self):
        return len(self.texts)
    
    def get_tokens(self, text):
        encoded = self.tokenizer.encode_plus(
                    text,
                    add_special_tokens=True, 
                    truncation=True,
                    max_length=self.max_length, 
                    pad_to_max_length=True)
        
        return encoded['input_ids'], encoded['attention_mask']
    
    def __getitem__(self, idx):
        tokens, attention_mask = self.get_tokens(str(self.texts[idx]))   
        tokens, attention_mask = torch.tensor(tokens), torch.tensor(attention_mask)
        
        return self.labels[idx], tokens, attention_mask

### Splitting into Training and Validation Set

I use the sci-kit learn library to split the `train` dataframe into a training and validation set. In order to make sure that the distribution of labels is the same in both datasets, I use `stratify=train['Sentiment']`.

In [40]:
X_train, X_val, y_train, y_val = train_test_split(train['OriginalTweet'], train['Sentiment'], test_size=0.1, random_state=42, stratify=train['Sentiment'])

In [41]:
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((30099,), (3345,), (30099,), (3345,))

In [43]:
y_train.value_counts(normalize=True), y_val.value_counts(normalize=True)

(1    0.539586
 0    0.460414
 Name: Sentiment, dtype: float64,
 1    0.539611
 0    0.460389
 Name: Sentiment, dtype: float64)

With my training and validation set, I now build two separate tokenized datasets.

In [54]:
train_ds = CoronaTweetDataset(X_train, y_train, tokenizer, max_length=105)
val_ds = CoronaTweetDataset(X_val, y_val, tokenizer, max_length=105)

The data is now ready for modelling.

To be continued...