### `K-Fold Crossvalidation with BERT`

The text is cleaned and should be tokenized in a way that the embeddings can be extracted for K-Fold crossvalidation with the `BERT Sequence Classification` model.

Let's install required modules for the training process.

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler, Dataset
from transformers import BertModel, BertTokenizer, BertForSequenceClassification
from pytorch_pretrained_bert import BertAdam
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from keras.preprocessing.sequence import pad_sequences
from google.colab import runtime
from tqdm import tqdm, trange
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df_tweets = pd.read_csv('../data/preprocessed_tweets.csv')
df_tweets.shape

(22830, 9)

In [None]:
df_tweets = df_tweets.drop(['Unnamed: 0'], axis=1)
df_tweets.sample(10)

### Inputs
**BERT** requires specifically formatted inputs. For each tokenized input sentence, we need to create:

- **input ids**: a sequence of integers identifying each input token to its index number in the BERT tokenizer vocabulary
- **segment mask**: (optional) a sequence of 1s and 0s used to identify whether the input is one sentence or two sentences long. For one sentence inputs, this is simply a sequence of 0s. For two sentence inputs, there is a 0 for each token of the first sentence, followed by a 1 for each token of the second sentence. Will not be used in this project.
- **attention mask**: (optional) a sequence of 1s and 0s, with 1s for all input tokens and 0s for all padding tokens (we'll detail this in the next paragraph)
- **labels**: a single value of 1 or 0. In our task 1 means "grammatical" and 0 means "ungrammatical"

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.cuda()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'NVIDIA A100-SXM4-40GB'

In [None]:
K_FOLDS = 5
MAX_LEN = 128
PADDING = 'post'
TRUNCATING = 'post'
DTYPE = 'long'
BATCH_SIZE = 32

In [None]:
class TextDataset(Dataset):
  def __init__(self, texts, labels):
    self.texts = texts
    self.labels = labels

  def __len__(self):
    return len(self.texts)

  def __getitem__(self, idx):
    text = self.texts[idx]
    label = self.labels[idx]

    encoding = tokenizer(text, padding='max_length', truncation=True, max_length=510, return_tensors='pt')
    input_ids = encoding['input_ids'].squeeze()
    attention_masks = encoding['attention_mask'].squeeze()
    return {'input_ids': input_ids, 'attention_mask': attention_masks, 'labels': torch.tensor(label)}


In [None]:
LEARNING_RATE = 2e-5
EPOCHS = 3

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

optimizer = BertAdam(optimizer_grouped_parameters, lr=LEARNING_RATE, warmup=.1)



### K-fold Crossvalidation on BERT

In [None]:
def crossvalidation(df, device, optimizer, epochs=EPOCHS, k_folds=K_FOLDS, batch_size=BATCH_SIZE):
    dataset = TextDataset(df['tweet'].tolist(), df['vader_sentiment_label'].tolist())

    # Define k-fold cross-validation
    skf = StratifiedKFold(n_splits=k_folds, shuffle=True, random_state=42)

    # Initialize lists to store accuracies for each fold
    fold_accuracies = []

    # Perform k-fold cross-validation
    for fold, (train_indices, val_indices) in enumerate(skf.split(df['tweet'], df['vader_sentiment_label'])):
        print(f"Training fold: {fold+1}/{k_folds}")

        # Split dataset into train and validation sets for the current fold
        train_dataset = torch.utils.data.Subset(dataset, train_indices)
        val_dataset =  torch.utils.data.Subset(dataset, val_indices)

        # Create data loaders
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

        # Training loop

        model.to(device)
        model.train()
        for _ in trange(epochs, desc="Epoch"):
            for batch in train_loader:
                # clear out the gradients (by default they accumulate)
                optimizer.zero_grad()
                # add batch to GPU

                #batch = tuple(t.to(device) for t in batch)
                b_input_ids = batch['input_ids'].to(device)
                b_input_mask = batch['attention_mask'].to(device)
                b_labels = batch['labels'].to(device)

                # unpack the inputs from dataloader
                #b_input_ids, b_input_mask, b_labels = batch
                # forward pass
                outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)
                loss = outputs.loss
                # backward pass
                loss.backward()
                # update parameters and take a step using the computed gradient
                optimizer.step()

        # Validation
        # Put model on evaluation mode to evaluate loss on the validation set
        model.eval()

        # Tracking variables
        val_predictions, val_labels = [], []
        with torch.no_grad():
            # Evaluate the data for one epoch
            for batch in val_loader:
                # add batch to GPU
                #batch = tuple(t.to(device) for t in batch)
                # unpack the inputs from dataloader
                #b_input_ids, b_input_mask, b_labels = batch
                b_input_ids = batch['input_ids'].to(device)
                b_input_mask = batch['attention_mask'].to(device)
                b_labels = batch['labels'].to(device)

                outputs = model(b_input_ids, attention_mask=b_input_mask)
                _, predicted_labels = torch.max(outputs.logits, dim=1)
                val_predictions.extend(predicted_labels.tolist())
                val_labels.extend(b_labels.tolist())

        fold_accuracy = accuracy_score(val_labels, val_predictions)
        fold_accuracies.append(fold_accuracy)
        print(f"Fold {fold+1}:")
        print(f"Val. Acc:{accuracy_score(val_labels, val_predictions)}, Prec:{precision_score(val_labels, val_predictions)}, Rec:{recall_score(val_labels, val_predictions)}, F1:{f1_score(val_labels, val_predictions)}, F1-micro:{f1_score(val_labels, val_predictions, average='micro')}, F1-macro:{f1_score(val_labels, val_predictions, average='macro')}\n")


    # Calculate average accuracy across all folds
    average_accuracy = sum(fold_accuracies) / len(fold_accuracies)
    print(f"Average Accuracy: {average_accuracy}")

    return val_labels, val_predictions

In [None]:
val_labels, val_predictions = crossvalidation(df_tweets, device, optimizer)

Training fold: 1/5


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1630.)
  next_m.mul_(beta1).add_(1 - beta1, grad)
Epoch: 100%|██████████| 3/3 [21:21<00:00, 427.06s/it]


Fold 1:
Val. Acc:0.891371003066141, Prec:0.7734487734487735, Rec:0.6125714285714285, F1:0.6836734693877551, F1-micro:0.891371003066141, F1-macro:0.809049849447976

Training fold: 2/5


Epoch: 100%|██████████| 3/3 [21:19<00:00, 426.60s/it]


Fold 2:
Val. Acc:0.9774419623302671, Prec:0.9673123486682809, Rec:0.9131428571428571, F1:0.9394473838918284, F1-micro:0.9774419623302671, F1-macro:0.9627932653546074

Training fold: 3/5


Epoch: 100%|██████████| 3/3 [21:19<00:00, 426.57s/it]


Fold 3:
Val. Acc:0.9945247481384144, Prec:0.9885057471264368, Rec:0.9828571428571429, F1:0.98567335243553, F1-micro:0.9945247481384144, F1-macro:0.9911445143117138

Training fold: 4/5


Epoch: 100%|██████████| 3/3 [21:19<00:00, 426.51s/it]


Fold 4:
Val. Acc:0.9967148488830486, Prec:0.9953970080552359, Rec:0.9874429223744292, F1:0.9914040114613181, F1-micro:0.9967148488830486, F1-macro:0.9946867085870283

Training fold: 5/5


Epoch: 100%|██████████| 3/3 [21:19<00:00, 426.57s/it]


Fold 5:
Val. Acc:0.9962768287341218, Prec:0.9919816723940436, Rec:0.9885844748858448, F1:0.9902801600914809, F1-micro:0.9962768287341218, F1-macro:0.9939887865336181

Average Accuracy: 0.9712658782303987


In [None]:
print(f"\nVal. Acc:{accuracy_score(val_labels, val_predictions)}, Prec:{precision_score(val_labels, val_predictions)}, Rec:{recall_score(val_labels, val_predictions)}, F1:{f1_score(val_labels, val_predictions)}, F1-micro:{f1_score(val_labels, val_predictions, average='micro')}, F1-macro:{f1_score(val_labels, val_predictions, average='macro')}")



Val. Acc:0.9962768287341218, Prec:0.9919816723940436, Rec:0.9885844748858448, F1:0.9902801600914809, F1-micro:0.9962768287341218, F1-macro:0.9939887865336181


In [None]:
#print(os.getcwd())
model.save_pretrained("./5-fold")

In [None]:
runtime.unassign()