## Sentiment Analysis with transformers

NLP (Natural Language Processing) has gained mommentum since the advent of transformers. Transformers such as BERT, GPTs and others have been successfully used in NLP for various tasks including text classification. In text classification there are several types of problems to deal with, which include sentiment analysis, author identification e.tc. In this notebook we will work through the use of transfer learning for sentiment analysis, in this specific case, can our model predict if a particular tweet is depression positive or not? If we are able to succede in this use case, that means we can use this for organizations to quickly respond to such cases. We will be using data found on kaggle.

### Load data, do EDA and split dataset

In [1]:
#Import necessary libraries
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import random
import numpy as np

In [2]:
import warnings,transformers,logging,torch
warnings.simplefilter('ignore')
logging.disable(logging.WARNING)

In [3]:
#Load data

df = pd.read_csv("Tweets.csv")
df.head()

Unnamed: 0,textID,text,selected_text,sentiment,label
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,0.0
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,0.0
2,088c60f138,my boss is bullying me...,bullying me,negative,0.0
3,9642c003ef,what interview! leave me alone,leave me alone,negative,0.0
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,0.0


In [4]:
df.shape

(27481, 5)

In [5]:
df['sentiment'].value_counts(normalize=True)

neutral     0.404570
positive    0.312288
negative    0.283141
Name: sentiment, dtype: float64

From the value counts above, we see that a naive model that always predict the negative classs would have a 77% accuracy, thus if we are using accuracy as our metrics for evaluating our model, then we must produce a model that has accuracy greater than 77%. Or better still we could use other forms of evaluation such as recall or precision. Before we continue the whole process, let us have an ultimate test set that we will not touch throughout the whole process.

In [6]:
pd.options.display.max_colwidth = 5000
pd.set_option('display.max_rows', 500)

In [7]:
#Let us see further what makes a depression tweet, before we go ahead with the split

df['selected_text'][df['sentiment']=="negative"][:50]

1                                                     Sooo SAD
2                                                  bullying me
3                                               leave me alone
4                                                Sons of ****,
12                                                 DANGERously
13                                                        lost
15                                       Uh oh, I am sunburned
16                                                      *sigh*
17                                                        sick
18                                                        onna
26                                                  I`m sorry.
27                                                .no internet
29                               Power back up not working too
32          well so much for being unhappy for about 10 minute
36                                                        miss
38                                        soooooo sleee

From the above a common word found in depressive tweets are the words "depression" and "anxiety", sometimes, word like "meds" or "medication" also are seen. However, as these are common themes associated with depression tweets, let us also see non depression tweets that contain the common themes of depression.

In [8]:
#create copy of df
df1 = df.copy()

In [9]:
mask = (df['text'].str.contains("[Dd]epression")) & (df['sentiment'] == "negative")
df[mask]

Unnamed: 0,textID,text,selected_text,sentiment,label
19877,42bba2feb9,Time for my ritualistic Friday night depression,Time for my ritualistic Friday night depression,negative,
20978,2e4836d951,Having a light depression. Just payed an extra bill from last years taxes... Must find a country with a tax that is lower than 56%,depression.,negative,


In [10]:
mask = (df['text'].str.lower().str.contains(r"depression")) & ((df['sentiment'] == "postive")|(df['sentiment'] == "neutral"))
df[mask]#.head()
df.columns
df.columns = df.columns.str.strip()

In [11]:
non_idx = df[mask].textID
non_idx

19599    841817c75a
Name: textID, dtype: object

From this we see something also interesting as tweets that do not contain the word depression but tagged positive depressive tweets do contain tweets that actually not point to depression. Let us clean this up and make use of this information to derive special tokens that can be implemented in what our model. 

In [12]:
df = df[~df["textID"].isin(non_idx)]
df.shape

(27480, 5)

In [13]:
dep_anx = (df['text'].str.contains("depression")) & (df['text'].str.contains("anxiety"))

In [14]:
df["spectok"] = "[" + dep_anx.astype(str) + "]"
df.head()

Unnamed: 0,textID,text,selected_text,sentiment,label,spectok
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,0.0,[False]
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,0.0,[False]
2,088c60f138,my boss is bullying me...,bullying me,negative,0.0,[False]
3,9642c003ef,what interview! leave me alone,leave me alone,negative,0.0,[False]
4,358bd9e861,"Sons of ****, why couldn`t they put them on the releases we already bought","Sons of ****,",negative,0.0,[False]


 So we already added a column called spectoken that shows if a given tweet contains the word depression and anxiety.

#### Split dataframe

In [15]:
#Reset index
df = df.reset_index().drop(columns="index")#.tail()

In [16]:
idx = list(df.index)
random.shuffle(idx)

In [17]:
shuffle_df = df.iloc[idx, :]
shuffle_df.head()

Unnamed: 0,textID,text,selected_text,sentiment,label,spectok
20506,fa991fc78e,"Sorry for the triple twitter post, was having trouble w/Stocktwits account. I try not to clutter up the Twittersphere!","Sorry for the triple twitter post, was having trouble w/Stocktwits account. I try not to clutter up the Twittersphere!",negative,,[False]
12462,0e0fdad9a6,"oo. and studied today outside after having a ben+jerry`s.. wearing a sundress, hopefully didn`t get an awkward tan line.. haha!",hopefully,positive,,[False]
11685,751d70d6be,RIP Omar Edwards - Killed by friendly fire in NYC http://bit.ly/jrM6v,Killed,negative,,[False]
1461,4117fe27e1,I dont think he`s ganna text me.,I dont think he`s ganna text me.,neutral,,[False]
6456,45ba098ecb,"I know! I can barely believe it`s almost over! Thanks for the review, lovely!","Thanks for the review,",positive,,[False]


### Pop up EDA
What about if we replace hashtags there with empty spcae, that would reduce the number of tweets that have depression in it to identify it as a depression tweet. This also means we will have to redo our special token column.

In [18]:
df['text'] = df['text'].str.replace(r"(\x23.* )+", "")
df.head()

Unnamed: 0,textID,text,selected_text,sentiment,label,spectok
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,0.0,[False]
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,0.0,[False]
2,088c60f138,my boss is bullying me...,bullying me,negative,0.0,[False]
3,9642c003ef,what interview! leave me alone,leave me alone,negative,0.0,[False]
4,358bd9e861,"Sons of ****, why couldn`t they put them on the releases we already bought","Sons of ****,",negative,0.0,[False]


In [19]:
mask = (df['text'].str.contains("[Dd]epression")) & (df['sentiment'] == "negative")
df[mask].head()

Unnamed: 0,textID,text,selected_text,sentiment,label,spectok
19876,42bba2feb9,Time for my ritualistic Friday night depression,Time for my ritualistic Friday night depression,negative,,[False]
20977,2e4836d951,Having a light depression. Just payed an extra bill from last years taxes... Must find a country with a tax that is lower than 56%,depression.,negative,,[False]


In [20]:
#Reset index
df = df.reset_index().drop(columns="index")#.tail()

In [21]:
idx = list(df.index)
random.shuffle(idx)

In [22]:
shuffle_df = df.iloc[idx, :]
shuffle_df.head()

Unnamed: 0,textID,text,selected_text,sentiment,label,spectok
18604,7f65be55b7,I`m gonna wear my new purple converse today,I`m gonna wear my new purple converse today,neutral,,[False]
1901,01297393b5,oh gaha no of course i wasn`t offended why would i be? i`d love to play for you some day,d love,positive,,[False]
26878,aaaa3e4e66,Arrr. Exam is on next week.. Im dead.. bwahahaha.. I love you btw!!,Arrr. Exam is on next week.. Im dead.. bwahahaha.. I love you btw!!,neutral,,[False]
2041,914e57e54b,**** it! Must be Morrisons then,**** it!,negative,,[False]
4324,fae97f60d2,"Season 1 of Lie To Me was serious. ****, now here comes the wait.","Season 1 of Lie To Me was serious. ****, now here comes the wait.",neutral,,[False]


ADDENDUM: Redo and add the special token column

In [23]:
#Split dataset into train and test
test_len = int(len(idx) * 0.2)

test_df = shuffle_df[:test_len]
train_df = shuffle_df[test_len:]

In [24]:
print("Addition of the rows of both train and test df gives", train_df.shape[0]+test_df.shape[0], "rows")

Addition of the rows of both train and test df gives 27480 rows


In [25]:
train_df['sentiment'].value_counts(normalize=True)

neutral     0.405022
positive    0.312364
negative    0.282615
Name: sentiment, dtype: float64

In [26]:
test_df['sentiment'].value_counts(normalize=True)

neutral     0.402656
positive    0.312045
negative    0.285298
Name: sentiment, dtype: float64

In [27]:
train_df.head(1)

Unnamed: 0,textID,text,selected_text,sentiment,label,spectok
6815,87fca7f208,apparently you are not getting on anymore... sad,sad,negative,,[False]


### Preprocessing data: Datasets and Tokenization.

Just like in using fastai dataloaders to load in our inputs into the model, transformers also have a Dataset class that helps carry the dataset in the appropriate format into the model and thus make possible for building the model and performing the classification as needed. 

However, our input is still in the form of English sentences. In NLP, in order for this sentences to be processed and used as a signal for a way the model will finally interpret it, we need to convert these sentences into a number through tokenization and numericalization. And since we will be using a pretrained model, it is necessary that we tokenize our inputs in just the same way that it exists in the model's vocabulary in other to avoid messing up the process. To do this, we will be importing the AutoTokenizer from the transformer model and also choosing the model of choice.

In [28]:
model_name = "microsoft/deberta-v3-small"

In [29]:
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [30]:
tokz = AutoTokenizer.from_pretrained('microsoft/deberta-v3-small')

After initiating our tokenizer, it is important that we add the special tokens that we derived in the spectok column.

In [31]:
spec_tok = list(df.spectok.unique())
tokz.add_special_tokens({"additional_special_tokens": spec_tok})

1

In [32]:
#tokz.add_special_tokens({'pad_token': '[PAD]'})

The tokz(tokenizer) has a vocab attribute and tokenize method. The vocab attribute is a dictionary that maps all the tokens(word/subwords as the case may be) with some fixed number and the tokenize method splits a document (sentence or sentences) into individual tokens as it is in the vocab attribute.

In [33]:
tokz.tokenize("Merry go round, do the trees move when a bolstrous wind strike their Trunks [False]")

['▁Merry',
 '▁go',
 '▁round',
 ',',
 '▁do',
 '▁the',
 '▁trees',
 '▁move',
 '▁when',
 '▁a',
 '▁bol',
 'st',
 'rous',
 '▁wind',
 '▁strike',
 '▁their',
 '▁Trunk',
 's',
 '[False]']

In [34]:
#lets us also rename the inputs columns and output columns to easier names and transformers conventions
train_df.rename(columns={"text": "inputs", "sentiment": "labels"},
               inplace=True)
test_df.rename(columns={"text": "inputs", "sentiment": "labels"},
               inplace=True)

In [35]:
train_df.head(2)

Unnamed: 0,textID,inputs,selected_text,labels,label,spectok
6815,87fca7f208,apparently you are not getting on anymore... sad,sad,negative,,[False]
19588,6f955b0236,"- Yeah I know they are **** annoying with that... But it,s such good promo... I lost some contacts for business in there","- Yeah I know they are **** annoying with that... But it,s such good promo... I lost some contacts for business in there",positive,,[False]


In [36]:
#Preprocessing, convert lables in train and test to float
train_df["labels"] = train_df.labels.astype(float)
test_df["labels"] = test_df.labels.astype(float)

ValueError: could not convert string to float: 'negative'

In [38]:
test_df.head(1)

Unnamed: 0,textID,inputs,selected_text,labels,label,spectok
18604,7f65be55b7,I`m gonna wear my new purple converse today,I`m gonna wear my new purple converse today,neutral,,[False]


In [42]:
#convert train_df into a dataset

ds = Dataset.from_pandas(train_df)
ds

Dataset({
    features: ['textID', 'inputs', 'selected_text', 'labels', 'label', 'spectok', '__index_level_0__'],
    num_rows: 21984
})

In [43]:
#Define function to perform tokenization processing

def token(x): return tokz(x["inputs"])

In [44]:
#map ds to function to obtain new ds

ds_tok = ds.map(token, batched=True)
ds_tok

Map:   0%|          | 0/21984 [00:00<?, ? examples/s]

Dataset({
    features: ['textID', 'inputs', 'selected_text', 'labels', 'label', 'spectok', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 21984
})

Another column inputs_id have been added to our dataset and what does simply contains in the tokenized and numericalized array of every tweet that we have.

In [45]:
sample_tok = ds_tok[0]
sample_tok["inputs"], sample_tok["input_ids"]

('apparently you are not getting on anymore... sad',
 [1, 3900, 274, 281, 298, 646, 277, 3731, 260, 260, 260, 3756, 2])

In [46]:
#let us see the vocab for a word in the input that is displayed above -- happy
tokz.vocab["▁love"] #the character before happy, tells us that it is begining of a word

472

### Splitting the dataset into train and validation set.`

In [47]:
dds = ds_tok.train_test_split(0.20, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['textID', 'inputs', 'selected_text', 'labels', 'label', 'spectok', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 17587
    })
    test: Dataset({
        features: ['textID', 'inputs', 'selected_text', 'labels', 'label', 'spectok', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4397
    })
})

In [48]:
#treat test set to be as standard dataset

eval_ds = Dataset.from_pandas(test_df.drop(columns="labels")).map(token, batched=True)
eval_ds

Map:   0%|          | 0/5496 [00:00<?, ? examples/s]

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

### Loss function and Error Metrics

Loss function is an arithmetics function that helps measure the error of a given model and which the model seeks to optimize, an example of los functions is the crosss entropy or mean absholute error, often times it is usually hard to meassure their relative performance and thus make meaningful assertions from them. On the other side, an error metric is used to evaluate the performance of a model for human comprehension, they might be useful for a loss function becuause the lack of smooth curve on the gradient of the curve w.r.t change of the independennt variables. Examples of error metrics in classification include accuracy, recall, precision e.t.c. Let us define the following metrics below. 

In [49]:
def sig(x): return 1 / (1 + np.exp (-x))

def acc(x, y):
    x = x > 0.5
    return accuracy_score(x,y)

In [50]:
sig(np.array([0.5, 0.6]))>0.5#.astype(int)
accuracy_score(sig(np.array([0.5, 0.6]))>0.5, [1, 0])

0.5

In [51]:
def precision(x, y):
    x = x > 0.5
    tp = confusion_matrix(x, y)[1][1]
    fp = confusion_matrix(x, y)[1][0]
    denom = tp + fp
    return tp/denom

In [52]:
def recall(x, y):
    x = x > 0.5  #using the threshold at 0.5
    tp = float(confusion_matrix(x, y)[1][1])
    fn = float(confusion_matrix(x, y)[0][1])
    denom = tp + fn
    return tp/denom

In [53]:
def err(eval_preds): return {"accuracy": acc(*eval_preds),
                            "recall": recall(*eval_preds),
                            "precision": precision(*eval_preds)}

### Training our transformer

In [54]:
from transformers import Trainer, TrainingArguments

In [55]:
#instantiating hyperparameters
bs = 64
lr = 3e-6
epochs = 8

In [57]:
#Boilerplate TrainingARguments
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine',
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

In [58]:
#Instantiate model and Trainer
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=err)

Downloading pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

KeyboardInterrupt: 

In [None]:
trainer.train()

A model of this quality can be deployed and tested on our test sets, and even deployed. The following changes include changing the threshold to 0.5 without using a sigmoid to obtain such accuracy. Also we tripled the learning rate.

ttt

In [None]:
preds = trainer.predict(eval_ds).predictions
preds = sig(preds - 0.5)
preds

In [None]:
preds

In [None]:
eval_ds

In [None]:
ans_df = test_df[["Index", "inputs", "labels"]][test_df.Index == eval_ds["Index"]]
ans_df.head()

In [None]:
ans_df["predictions"] = preds
ans_df.head()

In [None]:
preds > 0.5

In [None]:
ans_df["pred_label"] = (preds > 0.5).astype(int)
ans_df.head()

In [None]:
#Check accuracy score
accuracy_score(ans_df["labels"], ans_df["pred_label"])

In [None]:
print(classification_report(ans_df["labels"], ans_df["pred_label"]))

In [None]:
ans_df[ans_df["labels"] == 1].head()

In order to see if our model is picking up on the right things, we should see places where our model predicted 0 for 1 and vice versa.

In [None]:
#model predicts negative while actual label says positive
mask = (ans_df["labels"] == 1) & (ans_df["pred_label"] == 0)
ans_df[mask]

In [None]:
#model predicts positive while actual label is negative
mask = (ans_df["labels"] == 0) & (ans_df["pred_label"] == 1)
ans_df[mask]

From the above performance of our model, we do see that our model is really doing a fine job. Because in places where it makes a false negative, the tweets are not in fact clear to us that they are actually depression tweets and moreover many of them are not clear at all. In places where our model is picking up on depression tweets as against the actual labels that say non depression tweets, we do also see that many of such tweets are worth raising alarm as they identify pain or possibility of depression, which is cool as our model is working "overtime" (winks).

Since we also know that depression is a very important word in determining a depressive tweet or not, let us also check if there are places where our model predicts a positive label and the input does not contain the word depression.

In [None]:
mask = (~ans_df["inputs"].str.contains(r"[Dd]epression")) & (ans_df["pred_label"] == 1)
ans_df[mask]#.shape

In [None]:
dir(trainer)

In [None]:
trainer.save_model("model")

### CAVEAT

A little caveat to what we just did is that when you go through the EDA process properly, yo realize that most tweets that were positive depression tweets could have been identified with a simple python function that can print True if a tweet contains the word "depression" or "anxiety" and false otherwise. If we would be be building a model of real value and importance, it would be relevant to obtain more datapoints especially tweets without depression or anxiety, which are depression positive, so that the model can pick up on more words.