### Code for kaggle competition: text emotion classification.
* In this notebook file, it contains both training and testing stages.
* I tried to use powerful pretrained language model to extract meaningful sentence features, and then fine-tuning classification task by pretrained LM and classifer.
* If you want to run this file, please install transformers package provided by HuggingFace: ```pip install transformers``` .
* This code is based on Pytorch.

In [11]:
import pandas as pd

##### Read the prepared training and testing data in csv format.

In [12]:
train_data = pd.read_csv("train_data.csv")
test_data = pd.read_csv("test_data.csv")

In [13]:
train_data

Unnamed: 0.1,Unnamed: 0,tweet_id,identification,emotion,text
0,0,0x29e452,train,joy,Huge Respect🖒 @JohnnyVegasReal talking about l...
1,1,0x2b3819,train,joy,Yoooo we hit all our monthly goals with the ne...
2,2,0x2a2acc,train,trust,@KIDSNTS @PICU_BCH @uhbcomms @BWCHBoss Well do...
3,3,0x2a8830,train,joy,Come join @ambushman27 on #PUBG while he striv...
4,4,0x20b21d,train,anticipation,@fanshixieen2014 Blessings!My #strength little...
...,...,...,...,...,...
1455558,1455558,0x227e25,train,disgust,@BBCBreaking Such an inspirational talented pe...
1455559,1455559,0x293813,train,sadness,And still #libtards won't get off the guy's ba...
1455560,1455560,0x1e1a7e,train,joy,When you sow #seeds of service or hospitality ...
1455561,1455561,0x2156a5,train,trust,@lorettalrose Will you be displaying some <LH>...


##### We have to map string type emotion labels to int type of label.

In [14]:
Emotion_label = {
        'sadness': 0,
        'disgust': 1,
        'anticipation': 2,
        'joy': 3,
        'trust': 4,
        'anger': 5,
        'fear': 6,
        'surprise': 7
}

In [15]:
def emotion_to_label(emotion):
    label = Emotion_label[emotion]
    return label

In [16]:
train_data["emotion"] = train_data["emotion"].apply(emotion_to_label)

##### Check if the conversion is successful

In [17]:
train_data["emotion"].unique()

array([3, 4, 2, 0, 1, 6, 7, 5])

#### [Noted]
##### I have tried to convert the emojis back to the word by "emoji" package and train the model again, so that tokenizer can recognize the wrod that emoji represent to and use the additional information to train the model. However, After training, the result on Kaggle only improve from 0.57297 to 0.57338 on public leaderboard. The improvement is lower than my expectation.

In [None]:
import emoji
def demoji(text):
    return emoji.demojize(text, delimiters=("", "")) 

In [None]:
train_data["text"] = train_data["text"].apply(demoji)
test_data["text"] = test_data["text"].apply(demoji)

In [18]:
train_texts = train_data["text"].values
train_labels = train_data["emotion"].values
test_texts = test_data["text"].values

##### This part is to split the training set to training and validation data, but in the end, after I fine-tuning my hyperparameters,  I use full training set to train my model in order to get better performance on Kaggle.

In [19]:
#from sklearn.model_selection import train_test_split
#train_texts, val_texts, train_labels, val_labels = train_test_split(train_text, train_label, test_size=0.15)

In [20]:
print("# of training data: ", len(train_texts))
#print("# of validation data: ", len(val_texts))

# of training data:  1455563


##### I use the Autoclass in transformers because it supports various pretrained language models, such as standard BERT and RoBERTa, and the implementation is simple.
* In my trials, I have tried bert-base-uncased and roberta-base.

In [21]:
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification

In [22]:
models = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=8)
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

HBox(children=(IntProgress(value=0, description='Downloading', max=501200538, style=ProgressStyle(description_…




Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

##### The input of BERT tokenizer has to be a list of data.

In [23]:
train_texts = list(train_texts)
#val_texts = list(val_texts)
test_texts = list(test_texts)

##### Use pytorch Dataset class to define my own customized dataset called TweetDataset, for the later usage of Trainer.

In [24]:
import torch

class TweetsDataset(torch.utils.data.Dataset):
    def __init__(self, tokenizer, texts, labels):
        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=64, verbose = True)
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [25]:
print("Whether using GPU: ", torch.cuda.is_available())
print("GPU name: ", torch.cuda.get_device_name(0))

Whether using GPU:  True
GPU name:  Tesla V100-SXM2-32GB


##### This is an example to show how the tokenizer output a sentence. The output includes "input_ids" and "attention_mask".

In [28]:
tokenizer(["Huge Respect🖒 @JohnnyVegasReal talking about", "Huge Respect🖒 @JohnnyVegasReal talking about"],
          truncation=True, padding=True, max_length=64, verbose = True)

{'input_ids': [[0, 725, 11797, 33106, 6569, 25448, 10659, 787, 39249, 846, 3733, 281, 17105, 1686, 59, 2], [0, 725, 11797, 33106, 6569, 25448, 10659, 787, 39249, 846, 3733, 281, 17105, 1686, 59, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

##### Build training dataset by the predefined TweetsDataset class.

In [29]:
train_dataset = TweetsDataset(tokenizer, train_texts, list(train_labels))
#val_dataset = TweetsDataset(tokenizer, val_texts, list(val_labels))

##### Use "TrainingArguments" and "Trainer" to train the downstream classification task.

In [30]:
training_args = TrainingArguments(
    output_dir='./results_roberta',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=256,  # batch size per device during training
    #per_device_eval_batch_size=128,   # batch size for evaluation
    warmup_steps=2000,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs_roberta',            # directory for storing logs
    logging_steps=1000, 
    save_steps=10000
)

trainer = Trainer(
    model=models,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    #eval_dataset=val_dataset           # evaluation dataset
)

trainer.train()

./results_roberta


wandb: Currently logged in as: daniel091444 (use `wandb login --relogin` to force relogin)


Step,Training Loss
1000,1.390715
2000,1.075018
3000,1.013678
4000,0.978613
5000,0.951293
6000,0.916831
7000,0.878918
8000,0.867628
9000,0.863633
10000,0.8563




TrainOutput(global_step=17058, training_loss=0.8968088038432555)

In [31]:
trainer.save_model('./DM_roberta')

###  Testing stage
* In this phase, I load the trained model and predict the testing data.

In [33]:
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModelForSequenceClassification.from_pretrained('DM_roberta', num_labels=8)

##### Please note that the customized Dataset is different from training Dataset because now the testing data doesn't contain emotion labels

In [34]:
class TweetsTestDataset(torch.utils.data.Dataset):
    def __init__(self, tokenizer, texts):
        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=64, verbose = True)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [35]:
test_dataset = TweetsTestDataset(tokenizer, test_texts)

##### Make prediction by DataLoader to load our testing dataset.

In [36]:
from torch.utils.data import DataLoader

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model.to(device)
model.eval()
cpu = torch.device('cpu')

test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False)

pred_labels = []
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
    
        outputs = model(input_ids, attention_mask=attention_mask)
        pred_labels.append(outputs[0].argmax(dim=1).to(cpu))
        torch.cuda.empty_cache()
        
pred_labels = torch.cat(pred_labels, dim=0)

##### The dictionary that used for converting int type of class back to string type of class, in order to match the format of  submission.

In [37]:
Emotion_label = {
        0: 'sadness',
        1: 'disgust',
        2: 'anticipation',
        3: 'joy',
        4: 'trust',
        5: 'anger',
        6: 'fear',
        7: 'surprise'
}

In [38]:
emotions = [Emotion_label[int(label)] for label in pred_labels]

In [39]:
pred = pd.DataFrame(data = emotions, columns = ['emotion'])

In [40]:
pred

Unnamed: 0,emotion
0,sadness
1,sadness
2,sadness
3,sadness
4,joy
...,...
411967,surprise
411968,anticipation
411969,sadness
411970,joy


In [42]:
test_data

Unnamed: 0.1,Unnamed: 0,tweet_id,identification,text
0,0,0x28cc61,test,@Habbo I've seen two separate colours of the e...
1,1,0x2db41f,test,@FoxNews @KellyannePolls No serious self respe...
2,2,0x2466f6,test,"Looking for a new car, and it says 1 lady owne..."
3,3,0x23f9e9,test,@cineworld “only the brave” just out and fount...
4,4,0x1fb4e1,test,Felt like total dog 💩 going into open gym and ...
...,...,...,...,...
411967,411967,0x2c4dc2,test,6 year old walks in astounded. Mum! Look how b...
411968,411968,0x31be7c,test,Only one week to go until the #inspiringvolunt...
411969,411969,0x1ca58e,test,"I just got caught up with the manga for ""My He..."
411970,411970,0x35c8ba,test,Speak only when spoken to and make hot ass mus...


In [43]:
test_data["id"] = test_data["tweet_id"].copy()

In [44]:
test_data

Unnamed: 0.1,Unnamed: 0,tweet_id,identification,text,id
0,0,0x28cc61,test,@Habbo I've seen two separate colours of the e...,0x28cc61
1,1,0x2db41f,test,@FoxNews @KellyannePolls No serious self respe...,0x2db41f
2,2,0x2466f6,test,"Looking for a new car, and it says 1 lady owne...",0x2466f6
3,3,0x23f9e9,test,@cineworld “only the brave” just out and fount...,0x23f9e9
4,4,0x1fb4e1,test,Felt like total dog 💩 going into open gym and ...,0x1fb4e1
...,...,...,...,...,...
411967,411967,0x2c4dc2,test,6 year old walks in astounded. Mum! Look how b...,0x2c4dc2
411968,411968,0x31be7c,test,Only one week to go until the #inspiringvolunt...,0x31be7c
411969,411969,0x1ca58e,test,"I just got caught up with the manga for ""My He...",0x1ca58e
411970,411970,0x35c8ba,test,Speak only when spoken to and make hot ass mus...,0x35c8ba


##### Concatenate the "id" and "predicted emotion" to csv file for final submission.

In [45]:
pd.concat([test_data["id"], pred], axis = 1).to_csv("submission2.csv", index = False)

### Conclusion
1. In this competition, I didn't do lots of feature engineering because BERT tokenizer have already set and BERT has amazing ability to extract sentence features by attention mechenism.
2. The one feature engineering method I've tried is to map the emojis back to word, so that BERT tokenizer wouldn't tokenize emojis to <UNK>. However, the score isn't increase a lot.
3. Compared BERT with RoBERTa, RoBERTa has better performance on the same hyperparameter setting.