<a href="https://colab.research.google.com/github/Lexian-6/Sentiment-Analysis-towards-COVID-19-on-Twitter/blob/main/Model4_XLnet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Model 4: XLNet**

@author - Janhavi Jain z5431064

### **Why XLNet ?**

XLNet is a transformer based model used for natural langugage processing tasks. It has a strength in understanding contextual information because it can effectively capture dependencies between all tokens. These dependencies are derived by leveraging multiple permutations of a sequence of tokens. This means that the model does not interpret relationships from left to right only, but rather considers other arrangements of words.<br><br>

This may be helpful in our endeavour of analysing covid related tweets because:

*   Tweets express sentiments of a person which require deep contextual understanding
*   The true meaning of a sequence of words is not always apparent if we consider unidirectional dependencies alone, so we require a mechanism to extract useful cues from different permutations of words

*   The model is renowned for its high performance in text classification and an attempt can be made to extend it to sentiment analysis<br><br>

Let's execute the model step by step.
We have referenced the code from [here](https://www.analyticsvidhya.com/blog/2024/05/xlnet-pre-trained-model/).





In [None]:
# install necessary libraries
!pip install datasets
!pip install accelerate -U

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00

In [None]:
# import functions from libraries
import pandas as pd
import re
from transformers import XLNetTokenizer, XLNetForSequenceClassification, Trainer, TrainingArguments, XLNetModel
from datasets import Dataset
import torch
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import torch.nn as nn
from transformers.modeling_outputs import SequenceClassifierOutput

### Dataset Preparation

1. Remove dataset skewness

The original dataset has 75% of neutral tweets which can impact our model's true performance because it would be biased towards predicting the neutral label. Hence, an attempt was made to get rid of skewness by creating another shuffled dataset of equal number of tweets of each class (positive, negative and neutral).


In [None]:
def prepareDataset(path, filename):
  dataset = pd.read_csv(path)

  pos_tweets = []
  pos_labels = []
  neg_tweets = []
  neg_labels = []
  neut_tweets = []
  neut_labels = []

  # separate positive, negative and neutral tweets
  for index in range(len(dataset)):
      if dataset.iloc[index].values[1] == 2:
          pos_tweets.append(dataset.iloc[index][0])
          pos_labels.append('pos')
      elif dataset.iloc[index].values[1] == 0:
          neg_tweets.append(dataset.iloc[index][0])
          neg_labels.append('neg')
      else:
          neut_tweets.append(dataset.iloc[index][0])
          neut_labels.append('neu')

  # choose random 3300 samples of each type of tweet
  pos_df = pd.DataFrame({'tweet': pos_tweets, 'label': pos_labels})
  sampled_pos_df = pos_df.sample(n=3300, random_state=30)
  neg_df = pd.DataFrame({'tweet': neg_tweets, 'label': neg_labels})
  sampled_neg_df = neg_df.sample(n=3300, random_state=30)
  neut_df = pd.DataFrame({'tweet': neut_tweets, 'label': neut_labels})
  sampled_neut_df = neut_df.sample(n=3300, random_state=30)

  # concatenate all the samples
  combined_df = pd.concat([sampled_pos_df, sampled_neg_df, sampled_neut_df], ignore_index=True)

  # shuffle all samples before saving to csv file
  shuffled_combined_df = combined_df.sample(frac=1, random_state=30).reset_index(drop=True)

  # saving new dataset
  shuffled_combined_df.to_csv(filename, index=False)

2. Preprocess Dataset

The tweets contain some inconsistencies and undesired information which need to be handled.

*  Remove all hashtags and links.
*  Convert all characters into lowercase.
*  Change labels from 'pos', 'neu', 'neg' to 2, 1, 0 respectively.

In [None]:
def preprocessData(dataset):
  for index in range(len(dataset)):
    # convert all characters to lowercase for consistency
    dataset.iloc[index].values[0] = dataset.iloc[index].values[0].lower()
    # remove links like https//:abc
    url = re.compile(r'https?://\S+|www\.\S+')
    dataset.iloc[index].values[0] = url.sub(r'', dataset.iloc[index].values[0])
    # remove hashtags and words that are like @abc because they are usernames
    hashtags_usernames = re.compile(r'[@#]\w+')
    dataset.iloc[index].values[0] = hashtags_usernames.sub(r'', dataset.iloc[index].values[0])

  # convert labels into numerical values
  label_to_number = {'neg': 0, 'neu': 1, 'pos': 2}
  dataset['label'] = dataset['label'].map(label_to_number)

  return dataset


3. Tokenize Dataset

The tweets need to be tokenized using a XLNet tokenizer from the transformers library. The tokens are padded to ensure they are of the same length, which allows the model to process data in batches for better efficiency.

In [None]:
# referenced https://huggingface.co/xlnet/xlnet-base-cased
# Tokenize dataset
def tokenizeDataset(examples):
    return tokenizer(examples['tweet'], truncation=True, padding='max_length', max_length=200)

4. Split dataset

We need to use the dataset for the purpose of training, validation and testing. Hence, the data is split in the ratio of 80:10:10.

In [None]:
def splitDataset(encoded_dataset):
  # Split dataset into train, validation and test sets
  train_valid_split = encoded_dataset.train_test_split(test_size=0.2, shuffle=True)
  train_dataset = train_valid_split['train']
  valid_test_split = train_valid_split['test'].train_test_split(test_size=0.5, shuffle=True)
  valid_dataset = valid_test_split['train']
  test_dataset = valid_test_split['test']

  print(train_dataset.shape)
  print(valid_dataset.shape)
  print(test_dataset.shape)

  return train_dataset, valid_dataset, test_dataset

### Define Trainer

In order to run the model, we need to define training arguments along with a trainer. We use the functions provided by transformers library for this purpose.

The arguments are refined progressively in the experiments conducted in order to improve model performance.

In [None]:
def createTrainer(learning_rate, num_train_epochs, batch_size, model, train_dataset, valid_dataset):
  # Define training arguments
  training_args = TrainingArguments(
      output_dir='./results',
      evaluation_strategy="epoch",
      learning_rate=learning_rate,
      per_device_train_batch_size=batch_size,
      per_device_eval_batch_size=batch_size,
      num_train_epochs=num_train_epochs,
      weight_decay=0.01
  )

  # Define Trainer
  trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      eval_dataset=valid_dataset,
  )

  return trainer

### Perform Experiments

* Experiment 1



In [None]:
prepareDataset('/content/COVIDSenti.csv', 'SmallCovidSenti.csv')

dataset = pd.read_csv('/content/SmallCovidSenti.csv')
dataset = preprocessData(dataset)

df = Dataset.from_pandas(dataset)

tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=3)

encoded_dataset = df.map(tokenizeDataset, batched=True)

train_dataset, valid_dataset, test_dataset = splitDataset(encoded_dataset)

trainer = createTrainer(0.01, 2, 32, model, train_dataset, valid_dataset)

# Train the model
trainer.train()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/9900 [00:00<?, ? examples/s]

(7920, 5)
(990, 5)
(990, 5)




Epoch,Training Loss,Validation Loss
1,No log,1.181921
2,No log,1.101543


TrainOutput(global_step=496, training_loss=1.3401100404800907, metrics={'train_runtime': 883.6753, 'train_samples_per_second': 17.925, 'train_steps_per_second': 0.561, 'total_flos': 1762711346880000.0, 'train_loss': 1.3401100404800907, 'epoch': 2.0})

In [None]:
# make predictions
predictions = trainer.predict(test_dataset=test_dataset)

In [None]:
# Evaluate the model
predicted_labels = predictions.predictions.argmax(axis=1)
true_labels = test_dataset['label']
accuracy = accuracy_score(true_labels, predicted_labels)
print(accuracy)

0.33636363636363636


* Experiment 2

Reduce batch size as well as learning rate to 2 and 0.001 respectively.

In [None]:
trainer = createTrainer(0.001, 2, 2, model, train_dataset, valid_dataset)

# Train the model
trainer.train()

In [None]:
# make predictions
predictions = trainer.predict(test_dataset=test_dataset)

In [None]:
# Evaluate the model
predicted_labels = predictions.predictions.argmax(axis=1)
true_labels = test_dataset['label']
accuracy = accuracy_score(true_labels, predicted_labels)
print(0.33333333333333333)

0.3333333333333333


* Experiment 3

Reduce learning rate to 0.00001 and increase epochs to 4.

In [None]:
trainer = createTrainer(0.00001, 4, 2, model, train_dataset, valid_dataset)

# Train the model
trainer.train()



Epoch,Training Loss,Validation Loss
1,1.1319,1.098439
2,1.1461,1.098781
3,1.1137,1.098406
4,1.1259,1.098375


TrainOutput(global_step=15840, training_loss=1.1281788642960366, metrics={'train_runtime': 3111.7008, 'train_samples_per_second': 10.181, 'train_steps_per_second': 5.09, 'total_flos': 3525422693760000.0, 'train_loss': 1.1281788642960366, 'epoch': 4.0})

In [None]:
# make predictions
predictions = trainer.predict(test_dataset=test_dataset)

In [None]:
# Evaluate the model
predicted_labels = predictions.predictions.argmax(axis=1)
true_labels = test_dataset['label']
accuracy = accuracy_score(true_labels, predicted_labels)
print(accuracy)

0.8525252525252526


* Experiment 4

Create customised XLNet model with additional dropout and linear layers. Use Relu activation fucntion.

In [None]:
class CustomXLNet(nn.Module):
    def __init__(self, num_labels=3):
        super(CustomXLNet, self).__init__()
        self.num_labels = num_labels
        self.xlnet = XLNetModel.from_pretrained('xlnet-base-cased')
        # additional dropout layer
        self.dropout = nn.Dropout(0.3)
        # additional linear layer
        self.additional_layer = nn.Linear(self.xlnet.config.hidden_size, 256)
        self.classifier = nn.Linear(256, num_labels)

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None):
        outputs = self.xlnet(input_ids=input_ids,
                             attention_mask=attention_mask,
                             token_type_ids=token_type_ids)

        last_hidden_state = outputs[0]
        pooled_output = torch.mean(last_hidden_state, 1)
        pooled_output = self.additional_layer(pooled_output)
        # use relu activation for pooled output from additional layer
        pooled_output = torch.relu(pooled_output)
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        return SequenceClassifierOutput(loss=loss, logits=logits)

In [None]:
# use custom model
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = CustomXLNet(num_labels=3)

In [None]:
# encode dataset
encoded_dataset = df.map(tokenizeDataset, batched=True)

# split into train, valid and test sets
train_dataset, valid_dataset, test_dataset = splitDataset(encoded_dataset)

# initialise trainer
trainer = createTrainer(0.00001, 4, 2, model, train_dataset, valid_dataset)

# Train the model
trainer.train()

Map:   0%|          | 0/9900 [00:00<?, ? examples/s]

(7920, 5)
(990, 5)
(990, 5)




Epoch,Training Loss,Validation Loss
1,0.9845,1.030127
2,0.7526,1.136729
3,0.5848,1.029463
4,0.385,1.015856


TrainOutput(global_step=15840, training_loss=0.6980367207767988, metrics={'train_runtime': 2965.3608, 'train_samples_per_second': 10.683, 'train_steps_per_second': 5.342, 'total_flos': 0.0, 'train_loss': 0.6980367207767988, 'epoch': 4.0})

In [None]:
# make predictions
predictions = trainer.predict(test_dataset=test_dataset)

In [None]:
# Evaluate the model
predicted_labels = predictions.predictions.argmax(axis=1)
true_labels = test_dataset['label']
accuracy = accuracy_score(true_labels, predicted_labels)
print(accuracy)

0.8757575757575757


* Experiment 5

Take the re-labelled dataset<br>
Remove skewness by creating a smaller dataset of re-labelled tweets<br>
Run the customized model

In [None]:
# create smaller dataset of relabelled tweets
#prepareDataset('/content/Newest-COVIDSenti_A.csv', 'ReSmallCovidSenti.csv')

# load smaller relabelled dataset and pre process it
dataset = pd.read_csv('/content/ReSmallCovidSenti2.csv')
dataset = preprocessData(dataset)

# convert to hugging face dataframe
df = Dataset.from_pandas(dataset)

# initalise tokenier and model with custom class
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = CustomXLNet(num_labels=3)

# encode dataset
encoded_dataset = df.map(tokenizeDataset, batched=True)

# split into train, valid and test sets
train_dataset, valid_dataset, test_dataset = splitDataset(encoded_dataset)

# initialise trainer class
trainer = createTrainer(0.00001, 4, 2, model, train_dataset, valid_dataset)

# Train the model
trainer.train()

In [None]:
# make predictions
predictions = trainer.predict(test_dataset=test_dataset)

Epoch,Training Loss,Validation Loss
1,1.1818,1.358613
2,1.0688,1.449693
3,0.7409,1.545016


In [None]:
# Evaluate the model
predicted_labels = predictions.predictions.argmax(axis=1)
true_labels = test_dataset['label']
accuracy = accuracy_score(true_labels, predicted_labels)
print(accuracy)

0.775438596491228


### Discussion

Based on the above experiments, we noticed and analysed the following things about XLNet:

1.   Requires smaller batch size

     This model cannot work with large batch sizes which is unlike most models. This is due to the fact that the gradient needs to be calculated many times so that its average can be used by the model for generalisation.
2.   Requires very small learning rate
     
     This is an extremely complex model with vast number of parameters. It cannot learn information if weights change too drastically using large learning rates. Thus, the ideal range of learning rate is between 0.00001 to 0.000001.

3.   Requires high quality dataset

     This model has deep contextual undertsanding and is able to realize the hidden semantic meaning of words. The predictions are based on those hidden meanings and do not match the original labels because the original labels were incorrect. Hence, it is necessary that the dataset quality is not compromised to ensure higher accuracy.

4.   Better performance on re-labelled dataset

     Even though XLNet was less accurate on original dataset in comparison to the other models, this was due to the fact that original labels were incorrect. It was actually able to capture the true sentiments of tweets better than the other models, indicating the strength of the network. On re-labelled dataset, it achieved better accuracy than most models showcasing its contextual strength.
    

