<a href="https://colab.research.google.com/github/NishaMDev/Tweets-Classification/blob/main/Fine_Tuning_Bert_for_Tweets_Classification_ft_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning Bert for Tweets Classification ft. Hugging Face

Bidirectional Encoder Representations from Transformers (BERT) is a state of the art model based on transformers developed by google. It can be pre-trained and later fine-tuned for a specific task. we will see fine-tuning in action in this post.

We will fine-tune bert on a classification task. The task is to classify the sentiment of covid related tweets.

Here we are using the Hugging face library to fine-tune the model. Hugging face makes the whole process easy from text preprocessing to training.

**Bert**

Bert was pre-trained on the BooksCorpus dataset and English Wikipedia. It obtained state-of-the-art results on eleven natural language processing tasks.

Bert was trained on two tasks simultaneously
 
1.   Masked language modelling (MLM) — 15% of the tokens were masked and was trained to predict the masked word
2.   Next Sentence Prediction(NSP) — Given two sentences A and B, predict whether B follows A

BERT is designed to pre-train deep bidirectional representations from an unlabeled text by jointly conditioning on both left and right context in all layers.

As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/Kaggle"

In [None]:
%cd /content/gdrive/MyDrive/Kaggle

/content/gdrive/MyDrive/Kaggle


In [None]:
!kaggle datasets download -d datatattle/covid-19-nlp-text-classification

Downloading covid-19-nlp-text-classification.zip to /content/gdrive/MyDrive/Kaggle
100% 4.38M/4.38M [00:00<00:00, 26.0MB/s]
100% 4.38M/4.38M [00:00<00:00, 25.7MB/s]


In [None]:
!unzip \*.zip  && rm *.zip

Archive:  covid-19-nlp-text-classification.zip
replace Corona_NLP_test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace Corona_NLP_train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [None]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 4.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 61.7 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 57.8 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstallin

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset('csv', data_files={'train': 'Corona_NLP_train.csv', 'test': 'Corona_NLP_test.csv'}, encoding = "ISO-8859-1")

Using custom data configuration default-642ab2b66d02caad


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-642ab2b66d02caad/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-642ab2b66d02caad/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet', 'Sentiment'],
        num_rows: 41157
    })
    test: Dataset({
        features: ['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet', 'Sentiment'],
        num_rows: 3798
    })
})

## Preprocessing Data

We will keep it simple and only do 2 pre-processing steps i.e tokenization and converting labels into integers.

Hugging Face AutoTokenizertakes care of the tokenization part. we can download the tokenizer corresponding to our model, which is bert in this case.



In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

bert tokenizer automatically convert sentences into tokens, numbers and attention_masks in the form which the bert model expects.

e.g: here is an example sentence that is passed through a tokenizer

In [None]:
tokenizer("Attention is all you need")

{'input_ids': [101, 1335, 5208, 2116, 1110, 1155, 1128, 1444, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Now as part of the preprocessing steps, we will perform two steps:


1.   Convert Sentiment into an integer
2.   Tokenize the tweets

We will be using map function of the dataset which is similar to apply function of the pandas data frame. It takes a function as an argument and applies to the entire dataset.

In [None]:
def transform_labels(label):

    label = label['Sentiment']
    num = 0
    if label == 'Positive':
        num = 0
    elif label == 'Negative':
        num = 1
    elif label == 'Neutral':
        num = 2
    elif label == 'Extremely Positive':
        num = 3
    elif label == 'Extremely Negative':
        num = 4

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['OriginalTweet'], padding='max_length')

dataset = dataset.map(tokenize_data, batched=True)

remove_columns = ['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet', 'Sentiment']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

  0%|          | 0/42 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/41157 [00:00<?, ?ex/s]

  0%|          | 0/3798 [00:00<?, ?ex/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 41157
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3798
    })
})

In the above code, we defined a method to convert labels into integers and tokenized the tweets also dropped the unwanted columns.

Now we are all set for the training part.

**Training**
There are two ways to train the data, either we write our own training loop or we can use trainer from the hugging face library.

In this case, we will use trainer from the library. To use trainer, first we need to define the training arguments like name, num_epochs, batch_size etc.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer", num_train_epochs=3)

Let’s download the bert model now, which is very simple using the AutoModelForSequenceClassificatio class.

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

The classification model downloaded also expects an argument num_labels which is the number of classes in our data. A linear layer is attached at the end of the bert model to give output equal to the number of classes.



(classifier): Linear(in_features=768, out_features=5, bias=True)



The above linear layer is automatically added as the last layer. Since the bert output size is 768 and our data has 5 classes so a linear layer with in_features=768 and out_features as 5 is added.

Before starting the training, we will split our training data into train and evaluation sets. We have 40k in training and 1k in eval set.

In [None]:
train_dataset = dataset['train'].shuffle(seed=10).select(range(40000))
eval_dataset = dataset['train'].shuffle(seed=10).select(range(40000, 41000))

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/csv/default-642ab2b66d02caad/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-55ee5ee877a00e4f.arrow


If we are using a hugging face trainer we need to import the module Trainer and pass model, dataset and training arguments to it.

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 40000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 15000


Step,Training Loss


Once training is done we can run trainer.evalute() to check the accuracy, but before that, we need to import metrics.

In [None]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer.evaluate()

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


{'epoch': 3.0,
 'eval_loss': 0.626483142375946,
 'eval_runtime': 9.6991,
 'eval_samples_per_second': 103.102,
 'eval_steps_per_second': 12.888}

On our data, we got an accuracy of 83% by training for only 3 epochs.

Accuracy can be further increased by training for some more time or doing some more pre-processing of data like removing mentions from tweets and unwanted clutter, but that’s for some other time.

