# Classification Selection

In this notebook we apply resampling methods of SVM-SMOTE and Oversampling by duplication of minority class to BERT+SVM and BERT+BERT models. Holdout and 5-fold approaches will be performed on the techniques.

### Summation of the aforementioned techniqes:
- Oversampling by duplication of minority class - (BERT+BERT, BERT+SVM)
  - tweets => Resample minority class by duplication (RandomOversampler) => resampled tweets
  - BERT+BERT:
    - resampled tweets => BERT(tokenizer, encoder layers) => results
  - BERT+SVM
    - resampled tweets => BERT(tokenizer, encoder layers) => resampled embedding vectors
    - re-sampled embedding vectors => SVM => results


- SVM-SMOTE - (BERT+SVM)
  - tweets => BERT (tokenizer, encoder layers) => embedding vectors (768)
  - embedding vectors => SVM-SMOTE => re-sampled embedding vectors
  - re-sampled embedding vectors => SVM => results

Additionally, SVM-SMOTE method applied on BERT+SVM model on ```Data Balance Methods.ipynb``` notebook. Therefore, this notebook doesn't cover the aforementioned approach. Results are being presented below.

- BERT+SVM (SVM-SMOTE):
  - holdout:
    - Acc: 0.81, Prec: 0.81, Rec: 0.81, F1: 0.81, F1-micro: 0.81, F1-macro: 0.81, F1-weighted: 0.81, G-mean: 0.81
  - 5-fold:
    - Acc: 0.87, Prec: 0.80, Rec: 0.97, F1: 0.88, F1-micro: 0.87, F1-macro: 0.87, F1-weighted: 0.87, G-mean: 0.86
- BERT+SVM (Oversampling):
  - holdout:
    - Acc: 0.75, Prec: 0.75, Rec: 0.76, F1: 0.76, F1-micro: 0.75, F1-macro: 0.75, F1-weighted: 0.75, G-mean: 0.75
  - 5-fold:
    - Acc: 0.75, Prec: 0.74, Rec: 0.75, F1: 0.75, F1-micro: 0.75, F1-macro: 0.75, F1-weighted: 0.75, G-mean: 0.75
- BERT+BERT (Oversampling):
  - holdout:
    - Acc: 0.95, Prec: 0.92, Rec: 0.97, F1: 0.95, F1-micro: 0.95, F1-macro: 0.95, F1-weighted: 0.95, G-mean: 0.95
  - 5-fold:
    - Acc: 1.00, Prec: 1.00, Rec: 0.99, F1: 1.00, F1-micro: 1.00, F1-macro: 1.00, F1-weighted: 1.00, G-mean: 1.00

<br>



In [None]:
import os

import pandas as pd
import numpy as np
from google.colab import runtime
import zipfile

In [None]:
# unzipping the zip file
with zipfile.ZipFile("nst_preprocessed_tweets.zip", 'r') as zip_ref:
    zip_ref.extractall(os.getcwd())

In [None]:
df_tweets = pd.read_csv('nst_preprocessed_tweets.csv')
df_tweets.shape

(22830, 9)

In [None]:
df_tweets.sample(10)

Unnamed: 0.1,Unnamed: 0,vader_sentiment_label,vader_score,tweet,tweet_length,url_link,pos_emoji,neg_emoji,profanity_word
7905,7929,0,-0.5859,surprising habit iss people hi woulding depres...,49,1,0,0,0
494,494,1,0.3825,one fucking disgusting hoe type people think d...,274,0,0,0,1
13731,13770,0,-0.8478,yes daily anger depression fear trump got go t...,102,0,0,0,0
8570,8594,0,-0.5719,seeing new doctor first visit clinical depress...,165,0,0,0,0
6563,6585,0,-0.7351,depression sucks,16,0,0,0,0
8221,8245,1,0.6486,depression behind back convinced capable happy,87,1,0,0,0
13113,13152,0,-0.128,study show diagnosing depression may able done...,103,1,0,0,0
4278,4289,0,-0.9136,list long rough eat emotions suffer anxiety de...,103,0,0,0,1
14439,14480,0,-0.7184,sorry depression tough suici woulde never tast...,129,0,0,0,0
22178,22301,0,-0.8126,ghosts past traumatic history constant failure...,225,0,0,0,0


# Resample minority class by duplication (RandomOversampler)

In [None]:
from collections import Counter
from imblearn.over_sampling import RandomOverSampler

print(f"Before oversampling: {Counter(df_tweets['vader_sentiment_label'].tolist())}")

ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(df_tweets[['tweet']], df_tweets['vader_sentiment_label'])

print(f"After oversampling: {Counter(y_res)}")

Before oversampling: Counter({0: 18453, 1: 4377})
After oversampling: Counter({0: 18453, 1: 18453})


In [None]:
X_res, y_res = np.asarray(X_res['tweet']), np.asarray(y_res)
X_res.shape, y_res.shape

((36906,), (36906,))

## Tokenize & Encode - TODO

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
import torch

max_len = 128
padding = 'post'
truncating = 'post'
dtype = 'long'

def tokenization(tweets, labels, maxlen=max_len, dtype=dtype, truncating=truncating, padding=padding, tokenizer=tokenizer):
    input_ids = []
    attention_masks = []

    for tweet in tweets:
        encoded_dict = tokenizer.encode_plus(
                        tweet,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = max_len,           # Pad & truncate all sentences.
                        truncation = True,
                        padding = 'max_length',
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                )

        # Add the encoded sentence to the list.
        input_ids.append(encoded_dict['input_ids'])

        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])

    # Convert the lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels)

    return input_ids, attention_masks, labels

In [None]:
input_ids, attention_masks, labels = tokenization(X_res, y_res)

# Oversampling on BERT+BERT

In [None]:
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 32
training_split = .75

def get_dataloader(input_ids, attention_masks, labels, training_split=training_split, batch_size=batch_size):
    # Use train_test_split to split our data into train and validation sets for training
    train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels,
                                                                                        random_state=2018, train_size=training_split)
    train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids,
                                                                                        random_state=2018, train_size=training_split)

    # Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop,
    # with an iterator the entire dataset does not need to be loaded into memory
    train_data = TensorDataset(train_inputs, train_masks, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
    validation_sampler = SequentialSampler(validation_data)
    validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

    return train_dataloader, validation_dataloader

In [None]:
train_dataloader, validation_dataloader = get_dataloader(input_ids, attention_masks, labels)

In [None]:
from transformers import BertModel, BertTokenizer, BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.cuda()

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.get_device_name(0)

'Tesla T4'

In [None]:
from transformers import AdamW

batch_size = 16
learning_rate = 2e-5 # try 3e-3 later
epochs = 4


optimizer = AdamW(model.parameters(),
                  lr = learning_rate,
                  eps = 1e-8 # a very small number to prevent any division by zero in the implementation
                  )



In [None]:
from transformers import get_linear_schedule_with_warmup

# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples!)
#total_steps = len(train_dataloader) * epochs
total_steps = 1846 * epochs

"""
  Create a schedule with a learning rate that decreases linearly from the
  initial lr set in the optimizer to 0, after a warmup period during which it
  increases linearly from 0 to the initial lr set in the optimizer.
"""
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

In [None]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

## BERT holdout
Below is our training loop. There's a lot going on, but fundamentally for each pass in our loop we have a trianing phase and a validation phase. At each pass we need to:

**Training loop:**
- Tell the model to compute gradients by setting the model in train mode
- Unpack our data inputs and labels
- Load data onto the GPU for acceleration
- Clear out the gradients calculated in the previous pass.
  - In pytorch the gradients accumulate by default (useful for things like RNNs) unless you explicitly clear them out
- Forward pass (feed input data through the network)
- Backward pass (backpropagation)
- Tell the network to update parameters with optimizer.step()
- Track variables for monitoring progress

**Evalution loop:**
- Tell the model not to compute gradients by setting the model in evaluation mode
- Unpack our data inputs and labels
- Load data onto the GPU for acceleration
- Forward pass (feed input data through the network)
- Compute loss on our validation data and track variables for monitoring progress

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.metrics import geometric_mean_score
from tqdm import tqdm, trange

for _ in trange(epochs, desc="Epoch"):

    # Training

    # Put the model into training mode. Don't be mislead--the call to
    # `train` just changes the *mode*, it doesn't *perform* the training.
    model.train()

    # tracking variables
    total_train_loss, num_train_steps = 0, 0
    training_stats = []

    for step, batch in enumerate(train_dataloader):
        # Unpack this training batch from our dataloader.
        # As we unpack the batch, we'll also copy each tensor to the GPU using
        # the 'to' method
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Clear prior gradients
        model.zero_grad()

        result = model(b_input_ids,
                           token_type_ids=None,
                           attention_mask=b_input_mask,
                           labels=b_labels)

        # Get the loss and "logits" output by the model. The "logits" are the
        # output values prior to applying an activation function like the
        # softmax.
        loss = result['loss']
        logits = result['logits']

        total_train_loss += loss.item()
        num_train_steps += 1
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    print(f"\nTrain loss: {total_train_loss / num_train_steps}")

    # After the completion of each training epoch, measure our performance on
    # our validation set.

    model.eval()

    # Tracking variables
    total_eval_loss = 0
    total_eval_accuracy = 0

    # Tracking variables for performance evaluation
    predictions , true_labels = [], []

    for batch in validation_dataloader:
        # Unpack this training batch from our dataloader.
        # As we unpack the batch, we'll also copy each tensor to the GPU using
        # the 'to' method

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        with torch.no_grad():
            result = model(b_input_ids,
                           token_type_ids=None,
                           attention_mask=b_input_mask,
                           labels=b_labels)

        # Get the loss and "logits" output by the model. The "logits" are the
        # output values prior to applying an activation function like the
        # softmax.
        true_labels.extend(b_labels.tolist())
        _, predicted_labels = torch.max(result["logits"], dim=1)
        predictions.extend(predicted_labels.tolist())

        loss = result['loss']
        logits = result['logits']

        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

    print(f"\nAcc:{(accuracy_score(true_labels, predictions)).round(2)}," \
            f" Prec:{precision_score(true_labels, predictions).round(2)}," \
            f" Rec:{recall_score(true_labels, predictions).round(2)}," \
            f" F1:{f1_score(true_labels, predictions).round(2)}," \
            f" F1-micro:{f1_score(true_labels, predictions, average='micro').round(2)}," \
            f" F1-macro:{f1_score(true_labels, predictions, average='macro').round(2)}," \
            f" F1-weighted:{f1_score(true_labels, predictions, average='weighted').round(2)}," \
            f" G-mean:{geometric_mean_score(true_labels, predictions).round(2)}")


Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Train loss: 0.39173539267971336


Epoch:  25%|██▌       | 1/4 [09:36<28:50, 576.87s/it]


Acc:0.89, Prec:0.88, Rec:0.91, F1:0.89, F1-micro:0.89, F1-macro:0.89, F1-weighted:0.89, G-mean:0.89
Train loss: 0.19448785168371793


Epoch:  50%|█████     | 2/4 [19:19<19:20, 580.35s/it]


Acc:0.94, Prec:0.95, Rec:0.94, F1:0.94, F1-micro:0.94, F1-macro:0.94, F1-weighted:0.94, G-mean:0.94
Train loss: 0.09946017345989586


Epoch:  75%|███████▌  | 3/4 [29:02<09:41, 581.33s/it]


Acc:0.94, Prec:0.91, Rec:0.97, F1:0.94, F1-micro:0.94, F1-macro:0.94, F1-weighted:0.94, G-mean:0.94
Train loss: 0.05821884610485609


Epoch: 100%|██████████| 4/4 [38:44<00:00, 581.07s/it]


Acc:0.95, Prec:0.92, Rec:0.97, F1:0.95, F1-micro:0.95, F1-macro:0.95, F1-weighted:0.95, G-mean:0.95





In [None]:
print(f"\nAcc:{(accuracy_score(true_labels, predictions))}," \
            f" Prec:{precision_score(true_labels, predictions)}," \
            f" Rec:{recall_score(true_labels, predictions)}," \
            f" F1:{f1_score(true_labels, predictions)}," \
            f" F1-micro:{f1_score(true_labels, predictions, average='micro')}," \
            f" F1-macro:{f1_score(true_labels, predictions, average='macro')}," \
            f" F1-weighted:{f1_score(true_labels, predictions, average='weighted')}," \
            f" G-mean:{geometric_mean_score(true_labels, predictions)}")


Acc:0.9468949821176981, Prec:0.9236154004529545, Rec:0.9741585233441911, F1:0.9482139082646375, F1-micro:0.9468949821176981, F1-macro:0.9468605128511799, F1-weighted:0.9468580193297722, G-mean:0.9465540091057032


## BERT 5-fold cross-validation

In [None]:
from torch.utils.data import Dataset

# Returns tokens of the tweet, and tensors of the tokens and segment ids
class TextDataset(Dataset):
  def __init__(self, texts, labels):
    self.texts = texts
    self.labels = labels

  def __len__(self):
    return len(self.texts)

  def __getitem__(self, idx):
    text = self.texts[idx]
    label = self.labels[idx]

    encoding = tokenizer(text, padding='max_length', truncation=True, max_length=510, return_tensors='pt')
    input_ids = encoding['input_ids'].squeeze()
    attention_masks = encoding['attention_mask'].squeeze()
    return {'input_ids': input_ids, 'attention_mask': attention_masks, 'labels': torch.tensor(label)}

In [None]:
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

from tqdm import tqdm, trange
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.metrics import geometric_mean_score

dataset = TextDataset(X_res, y_res)

# Define k-fold cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform k-fold cross-validation
for fold, (train_indices, val_indices) in enumerate(skf.split(X_res, y_res)):

    # Split dataset into train and validation sets for the current fold
    train_dataset = torch.utils.data.Subset(dataset, train_indices)
    val_dataset =  torch.utils.data.Subset(dataset, val_indices)

    # Create data loaders
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    validation_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

    # Training loop
    total_train_loss, num_train_steps = 0, 0
    print(f"Training fold: {fold+1}/{5}")
    model.to(device)
    model.train()
    for _ in trange(epochs, desc="Epoch"):
        for batch in train_loader:
            b_input_ids = batch['input_ids'].to(device)
            b_input_mask = batch['attention_mask'].to(device)
            b_labels = batch['labels'].to(device)

            model.zero_grad()
            result = model(b_input_ids,
                              token_type_ids=None,
                              attention_mask=b_input_mask,
                              labels=b_labels)

            loss = result['loss']
            logits = result['logits']

            total_train_loss += loss.item()
            num_train_steps += 1

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()

    print(f"\nTrain loss: {total_train_loss / num_train_steps}")

    # Validation
    # Put model on evaluation mode to evaluate loss on the validation set
    model.eval()
    print(f"Evaluation fold: {fold+1}/{5}")
    # Tracking variables for performance evaluation
    predictions , true_labels = [], []
    for batch in validation_dataloader:
        # Unpack this training batch from our dataloader.
        # As we unpack the batch, we'll also copy each tensor to the GPU using
        # the 'to' method

        b_input_ids = batch['input_ids'].to(device)
        b_input_mask = batch['attention_mask'].to(device)
        b_labels = batch['labels'].to(device)

        with torch.no_grad():
            result = model(b_input_ids,
                           token_type_ids=None,
                           attention_mask=b_input_mask,
                           labels=b_labels)

        # Get the loss and "logits" output by the model. The "logits" are the
        # output values prior to applying an activation function like the
        # softmax.
        true_labels.extend(b_labels.tolist())
        _, predicted_labels = torch.max(result["logits"], dim=1)
        predictions.extend(predicted_labels.tolist())

        loss = result['loss']
        logits = result['logits']

        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

    print(f"\nAcc:{(accuracy_score(true_labels, predictions)).round(2)}," \
            f" Prec:{precision_score(true_labels, predictions).round(2)}," \
            f" Rec:{recall_score(true_labels, predictions).round(2)}," \
            f" F1:{f1_score(true_labels, predictions).round(2)}," \
            f" F1-micro:{f1_score(true_labels, predictions, average='micro').round(2)}," \
            f" F1-macro:{f1_score(true_labels, predictions, average='macro').round(2)}," \
            f" F1-weighted:{f1_score(true_labels, predictions, average='weighted').round(2)}," \
            f" G-mean:{geometric_mean_score(true_labels, predictions).round(2)}")

In [None]:
print(f"\nAcc:{(accuracy_score(true_labels, predictions))}," \
            f" Prec:{precision_score(true_labels, predictions)}," \
            f" Rec:{recall_score(true_labels, predictions)}," \
            f" F1:{f1_score(true_labels, predictions)}," \
            f" F1-micro:{f1_score(true_labels, predictions, average='micro')}," \
            f" F1-macro:{f1_score(true_labels, predictions, average='macro')}," \
            f" F1-weighted:{f1_score(true_labels, predictions, average='weighted')}," \
            f" G-mean:{geometric_mean_score(true_labels, predictions)}")

# Oversampling on BERT+SVM

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 64
training_split = .75

def get_dataloader(input_ids, attention_masks, labels, training_split=training_split, batch_size=batch_size):
    # Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop,
    # with an iterator the entire dataset does not need to be loaded into memory

    data = TensorDataset(input_ids, attention_masks, labels)
    sampler = SequentialSampler(data)
    dataloader = DataLoader(data, sampler=sampler, batch_size=batch_size)

    return dataloader

In [None]:
dataloader = get_dataloader(input_ids, attention_masks, labels)

## Custom Classes
In order to implement our model, we need to define our own BERT class based on
`BertForSequenceClassification`. \
We named our custom class `BertEmbeddingVectors`. \
The aim of our custom model is to get the BERT embeddings of tweets. Then, we'll apply SVM-SMOTE on these vectors to re-sample the training set for the SVM classifier.

In [None]:
import math
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss, MSELoss
from sklearn.svm import SVC


from transformers import BertForSequenceClassification

class BertEmbeddingVectors(BertForSequenceClassification):
    """
        A model for embedding extracting for oversampling and SVM
        classification.

        This class expects a transformers.BertConfig object and the config
        object.
    """

    def __init__(self, config):

      #BERT set-up

      # Call the constructor for the huggingface 'BertForSequenceClassification'
      # class, which will do all of the BERT-related setup. The resulting BERT
      # model is stored in 'self.bert'.
      super().__init__(config)

      # Feature combination set-up

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        class_weights=None,
        output_attentions=None,
        output_hidden_states=None):
        # BERT

        # Run the text through the BERT model. Invoking 'self.bert' returns
        # outputs from the encoding layers, and not from the final classifier.

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states)

        # outputs[0] - All of the outputs embeddings from BERT
        # outputs[1] - The [CLS] token embedding, with some additional "pooling"
        #              done.
        cls = outputs[1]

        # Apply dropout to the CLS embedding for concatenation process.
        cls = self.dropout(cls)

        # np array here
        cls = cls.detach().cpu().data.numpy()
        return cls

### Load Model

In this section, we'll use our custom BERT class and Google's pretrained BERT model.

First, connect GPU to PyTorch

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")
torch.cuda.get_device_name(0)

'Tesla T4'

In [None]:
from transformers import BertConfig

# We'll need to use a "BertConfig" object from the transformers library
# to specify our parameters.
config = BertConfig.from_pretrained(
          'bert-base-uncased',
          num_labels=2)

model = BertEmbeddingVectors.from_pretrained(
        'bert-base-uncased',
        config=config)

# Tell pytorch to run this model on the GPU
desc = model.cuda()



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertEmbeddingVectors were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## BERT Embeddings for SVM

In [None]:
def get_embeddings(dataloader):
    X, y = [], []

    for step, batch in enumerate(dataloader):
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        cls_head = model(b_input_ids,
                           token_type_ids=None,
                           attention_mask=b_input_mask,
                           labels=b_labels)

        labels = b_labels.to('cpu').numpy()

        X.extend(cls_head)
        y.extend(labels)

    return X, y

In [None]:
X, y = get_embeddings(dataloader)

In [None]:
X, y = np.asarray(X), np.asarray(y)
X.shape, y.shape

((36906, 768), (36906,))

## SVM Holdout Training

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
from sklearn.svm import SVC

svm_model = SVC(kernel='linear', verbose=True)
svm_model.fit(X_train, y_train)

[LibSVM]

In [None]:
X_pred = svm_model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.metrics import geometric_mean_score

print(f"Acc:{(accuracy_score(y_test, X_pred)).round(2)}," \
            f" Prec:{precision_score(y_test, X_pred).round(2)}," \
            f" Rec:{recall_score(y_test, X_pred).round(2)}," \
            f" F1:{f1_score(y_test, X_pred).round(2)}," \
            f" F1-micro:{f1_score(y_test, X_pred, average='micro').round(2)}," \
            f" F1-macro:{f1_score(y_test, X_pred, average='macro').round(2)}," \
            f" F1-weighted:{f1_score(y_test, X_pred, average='weighted').round(2)}," \
            f" G-mean:{geometric_mean_score(y_test, X_pred).round(2)}")

Acc:0.75, Prec:0.75, Rec:0.76, F1:0.76, F1-micro:0.75, F1-macro:0.75, F1-weighted:0.75, G-mean:0.75


In [None]:
print(f"\nVal. Acc:{(accuracy_score(y_test, X_pred))}," \
            f" Prec:{precision_score(y_test, X_pred)}," \
            f" Rec:{recall_score(y_test, X_pred)}," \
            f" F1:{f1_score(y_test, X_pred)}," \
            f" F1-micro:{f1_score(y_test, X_pred, average='micro')}," \
            f" F1-macro:{f1_score(y_test, X_pred, average='macro')}," \
            f" F1-weighted:{f1_score(y_test, X_pred, average='weighted')}," \
            f" G-mean:{geometric_mean_score(y_test, X_pred)}")


Val. Acc:0.7504064159531809, Prec:0.7492083597213426, Rec:0.760934819897084, F1:0.7550260610573343, F1-micro:0.7504064159531809, F1-macro:0.750317625690492, F1-weighted:0.7503691648659608, G-mean:0.750214377594774


# SVM 5-fold Cross-validation

In [None]:
X.shape, y.shape

((36906, 768), (36906,))

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from imblearn.metrics import geometric_mean_score

gm_scorer = make_scorer(geometric_mean_score, greater_is_better=True)

scoring = {'accuracy': 'accuracy', 'precision': 'precision', 'recall': 'recall', 'f1': 'f1', 'f1_micro': 'f1_micro', 'f1_macro': 'f1_macro', 'f1_weighted': 'f1_weighted', 'g-mean': gm_scorer}

svm_cross_validation = SVC(kernel='linear')
cv_results = cross_validate(svm_cross_validation, X, y, scoring=scoring, cv=5, verbose=3)

[CV] END  accuracy: (test=0.750) f1: (test=0.754) f1_macro: (test=0.750) f1_micro: (test=0.750) f1_weighted: (test=0.750) g-mean: (test=0.750) precision: (test=0.743) recall: (test=0.765) total time=11.7min
[CV] END  accuracy: (test=0.765) f1: (test=0.768) f1_macro: (test=0.765) f1_micro: (test=0.765) f1_weighted: (test=0.765) g-mean: (test=0.765) precision: (test=0.760) recall: (test=0.777) total time=12.1min
[CV] END  accuracy: (test=0.749) f1: (test=0.750) f1_macro: (test=0.749) f1_micro: (test=0.749) f1_weighted: (test=0.749) g-mean: (test=0.749) precision: (test=0.749) recall: (test=0.751) total time=12.0min
[CV] END  accuracy: (test=0.761) f1: (test=0.763) f1_macro: (test=0.761) f1_micro: (test=0.761) f1_weighted: (test=0.761) g-mean: (test=0.761) precision: (test=0.756) recall: (test=0.770) total time=12.1min
[CV] END  accuracy: (test=0.746) f1: (test=0.748) f1_macro: (test=0.746) f1_micro: (test=0.746) f1_weighted: (test=0.746) g-mean: (test=0.746) precision: (test=0.742) recal

In [None]:
for x in cv_results:
    print(f"{x}: {cv_results[x][4].round(2)}", end='\n')

fit_time: 643.76
score_time: 62.76
test_accuracy: 0.75
test_precision: 0.74
test_recall: 0.75
test_f1: 0.75
test_f1_micro: 0.75
test_f1_macro: 0.75
test_f1_weighted: 0.75
test_g-mean: 0.75
