# BERT+SVM Experiment

In this notebook I combine BERT model with the traditional classifier SVM to identify depression from tweets. Holdout and 5-fold approaches are performed for training.

### Summation of the ensemble model:
  - BERT+SVM
    - tweets => BERT(tokenizer, encoder layers) => embedding vectors
    - embedding vectors => SVM => results

<br>



In [None]:
import os

import pandas as pd
import numpy as np
from google.colab import runtime
import zipfile

In [None]:
df_tweets = pd.read_csv('nst_preprocessed_tweets.csv')
df_tweets.shape

(22830, 9)

In [None]:
df_tweets.sample(10)

Unnamed: 0.1,Unnamed: 0,vader_sentiment_label,vader_score,tweet,tweet_length,url_link,pos_emoji,neg_emoji,profanity_word
10290,10320,1,0.128,depression always gets best,37,0,0,0,0
6722,6744,0,-0.5095,whole thread twdepression bc today fucking sucked,60,0,0,0,1
18382,18468,0,-0.4019,tropical depression barry small chance spinoff...,94,1,0,0,0
16315,16371,0,-0.5583,anyone suffering depression ok though warrior ...,148,1,0,0,0
8206,8230,0,-0.5329,bonus mindful someone elses needs important al...,250,0,0,0,0
13867,13906,0,-0.7506,gua tertawa miris baca ini dapetin ipk diatas ...,255,1,0,0,0
12197,12233,0,-0.4724,world filled depression choose kind right kind,73,0,0,0,0
20774,20883,0,-0.9162,depression kinda worse week body getting used ...,266,0,0,0,0
8082,8106,0,-0.0258,listened ki would cudi high school chances dep...,91,0,0,0,0
10303,10333,1,0.3415,unapologetically happy hard society romanticiz...,84,0,0,0,0


## Tokenize & Encode - TODO

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
import torch

max_len = 128
padding = 'post'
truncating = 'post'
dtype = 'long'

def tokenization(tweets, labels, maxlen=max_len, dtype=dtype, truncating=truncating, padding=padding, tokenizer=tokenizer):
    input_ids = []
    attention_masks = []

    for tweet in tweets:
        encoded_dict = tokenizer.encode_plus(
                        tweet,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = max_len,           # Pad & truncate all sentences.
                        truncation = True,
                        padding = 'max_length',
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                )

        # Add the encoded sentence to the list.
        input_ids.append(encoded_dict['input_ids'])

        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])

    # Convert the lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels)

    return input_ids, attention_masks, labels

In [None]:
tweets = df_tweets.tweet.values
labels = df_tweets.vader_sentiment_label.values

In [None]:
input_ids, attention_masks, labels = tokenization(tweets, labels)

# TODO

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 64
training_split = .75

def get_dataloader(input_ids, attention_masks, labels, training_split=training_split, batch_size=batch_size):
    # Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop,
    # with an iterator the entire dataset does not need to be loaded into memory

    data = TensorDataset(input_ids, attention_masks, labels)
    sampler = SequentialSampler(data)
    dataloader = DataLoader(data, sampler=sampler, batch_size=batch_size)

    return dataloader

In [None]:
dataloader = get_dataloader(input_ids, attention_masks, labels)

## Custom Classes
In order to implement our model, we need to define our own BERT class based on
`BertForSequenceClassification`. \
We named our custom class `BertEmbeddingVectors`. \
The aim of our custom model is to get the BERT embeddings of tweets. Then, we'll feed these vectors to the SVM model for binary classification.

In [None]:
import math
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss, MSELoss
from sklearn.svm import SVC


from transformers import BertForSequenceClassification

class BertEmbeddingVectors(BertForSequenceClassification):
    """
        A model for embedding extracting for oversampling and SVM
        classification.

        This class expects a transformers.BertConfig object and the config
        object.
    """

    def __init__(self, config):

      #BERT set-up

      # Call the constructor for the huggingface 'BertForSequenceClassification'
      # class, which will do all of the BERT-related setup. The resulting BERT
      # model is stored in 'self.bert'.
      super().__init__(config)

      # Feature combination set-up

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        class_weights=None,
        output_attentions=None,
        output_hidden_states=None):
        # BERT

        # Run the text through the BERT model. Invoking 'self.bert' returns
        # outputs from the encoding layers, and not from the final classifier.

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states)

        # outputs[0] - All of the outputs embeddings from BERT
        # outputs[1] - The [CLS] token embedding, with some additional "pooling"
        #              done.
        cls = outputs[1]

        # Apply dropout to the CLS embedding for concatenation process.
        cls = self.dropout(cls)

        # np array here
        cls = cls.detach().cpu().data.numpy()
        return cls

### Load Model

In this section, we'll use our custom BERT class and Google's pretrained BERT model.

First, connect GPU to PyTorch

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")
torch.cuda.get_device_name(0)

'Tesla T4'

In [None]:
from transformers import BertConfig

# We'll need to use a "BertConfig" object from the transformers library
# to specify our parameters.
config = BertConfig.from_pretrained(
          'bert-base-uncased',
          num_labels=2)

model = BertEmbeddingVectors.from_pretrained(
        'bert-base-uncased',
        config=config)

# Tell pytorch to run this model on the GPU
desc = model.cuda()

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertEmbeddingVectors were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## BERT Embeddings for SVM

In [None]:
def get_embeddings(dataloader):
    X, y = [], []

    for step, batch in enumerate(dataloader):
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        cls_head = model(b_input_ids,
                           token_type_ids=None,
                           attention_mask=b_input_mask,
                           labels=b_labels)

        labels = b_labels.to('cpu').numpy()

        X.extend(cls_head)
        y.extend(labels)

    return X, y

In [None]:
X, y = get_embeddings(dataloader)

In [None]:
X, y = np.asarray(X), np.asarray(y)
X.shape, y.shape

((22830, 768), (22830,))

## SVM Holdout Training

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
from sklearn.svm import SVC

svm_model = SVC(kernel='linear', verbose=True)
svm_model.fit(X_train, y_train)

[LibSVM]

In [None]:
X_pred = svm_model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.metrics import geometric_mean_score

print(f"Acc:{(accuracy_score(y_test, X_pred)).round(2)}," \
            f" Prec:{precision_score(y_test, X_pred).round(2)}," \
            f" Rec:{recall_score(y_test, X_pred).round(2)}," \
            f" F1:{f1_score(y_test, X_pred).round(2)}," \
            f" F1-micro:{f1_score(y_test, X_pred, average='micro').round(2)}," \
            f" F1-macro:{f1_score(y_test, X_pred, average='macro').round(2)}," \
            f" F1-weighted:{f1_score(y_test, X_pred, average='weighted').round(2)}," \
            f" G-mean:{geometric_mean_score(y_test, X_pred).round(2)}")

AttributeError: 'float' object has no attribute 'round'

In [None]:
print(f"\nVal. Acc:{(accuracy_score(y_test, X_pred))}," \
            f" Prec:{precision_score(y_test, X_pred)}," \
            f" Rec:{recall_score(y_test, X_pred)}," \
            f" F1:{f1_score(y_test, X_pred)}," \
            f" F1-micro:{f1_score(y_test, X_pred, average='micro')}," \
            f" F1-macro:{f1_score(y_test, X_pred, average='macro')}," \
            f" F1-weighted:{f1_score(y_test, X_pred, average='weighted')}," \
            f" G-mean:{geometric_mean_score(y_test, X_pred)}")


Val. Acc:0.8332165381920112, Prec:0.7485207100591716, Rec:0.22589285714285715, F1:0.34705075445816186, F1-micro:0.8332165381920112, F1-macro:0.625724614023618, F1-weighted:0.7950380241450268, G-mean:0.4708586120530516


# SVM 5-fold Cross-validation

In [None]:
X.shape, y.shape

((22830, 768), (22830,))

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from imblearn.metrics import geometric_mean_score

gm_scorer = make_scorer(geometric_mean_score, greater_is_better=True)

scoring = {'accuracy': 'accuracy', 'precision': 'precision', 'recall': 'recall', 'f1': 'f1', 'f1_micro': 'f1_micro', 'f1_macro': 'f1_macro', 'f1_weighted': 'f1_weighted', 'g-mean': gm_scorer}

svm_cross_validation = SVC(kernel='linear')
cv_results = cross_validate(svm_cross_validation, X, y, scoring=scoring, cv=5, verbose=3)

[CV] END  accuracy: (test=0.837) f1: (test=0.351) f1_macro: (test=0.629) f1_micro: (test=0.837) f1_weighted: (test=0.800) g-mean: (test=0.476) precision: (test=0.735) recall: (test=0.231) total time= 2.5min
[CV] END  accuracy: (test=0.835) f1: (test=0.329) f1_macro: (test=0.618) f1_micro: (test=0.835) f1_weighted: (test=0.796) g-mean: (test=0.456) precision: (test=0.746) recall: (test=0.211) total time= 2.4min
[CV] END  accuracy: (test=0.834) f1: (test=0.320) f1_macro: (test=0.613) f1_micro: (test=0.834) f1_weighted: (test=0.793) g-mean: (test=0.447) precision: (test=0.748) recall: (test=0.203) total time= 2.4min
[CV] END  accuracy: (test=0.836) f1: (test=0.337) f1_macro: (test=0.622) f1_micro: (test=0.836) f1_weighted: (test=0.797) g-mean: (test=0.463) precision: (test=0.746) recall: (test=0.218) total time= 2.4min
[CV] END  accuracy: (test=0.837) f1: (test=0.348) f1_macro: (test=0.628) f1_micro: (test=0.837) f1_weighted: (test=0.800) g-mean: (test=0.471) precision: (test=0.756) recal

In [None]:
for x in cv_results:
    print(f"{x}: {cv_results[x][4].round(2)}", end='\n')

fit_time: 643.76
score_time: 62.76
test_accuracy: 0.75
test_precision: 0.74
test_recall: 0.75
test_f1: 0.75
test_f1_micro: 0.75
test_f1_macro: 0.75
test_f1_weighted: 0.75
test_g-mean: 0.75
