<a href="https://colab.research.google.com/github/SumitNawathe/HateSpeechModel/blob/main/Final_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Accessing Data**

Imports Needed For Modeling

In [None]:
# pytorch, transformers, upgraded pandas, numpy, and pickle
!pip install torch
!pip install pytorch-lightning
!pip install transformers
!pip install --upgrade pandas
!pip3 install pickle5
!pip install -U sentence-transformers
import pickle5 as pickle
import pandas as pd
import numpy as np

from tqdm.auto import tqdm

# neural networks library (to use linear layers)
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# to wrap data in pytorch lightning datasets while also logging data from model
import pytorch_lightning as pl
from torchmetrics.functional import accuracy, f1, auroc
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
from pytorch_lightning.loggers import TensorBoardLogger
from transformers import AdamW

# sckit to split data
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, multilabel_confusion_matrix

# plotting library
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc

%matplotlib inline
%config InlineBackend.figure_format='retina'

RANDOM_SEED = 42

sns.set(style='whitegrid', palette='muted', font_scale=1.2)
HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]
sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))
rcParams['figure.figsize'] = 12, 8

pl.seed_everything(RANDOM_SEED)

# to configure LaBSE model from HuggingFace
import gc
from sentence_transformers import SentenceTransformer

Accessing Saved Preprocessed File

In [None]:
with open('hatespeech_df_encodedv2.pickle', 'rb') as hatespeech_df_encodedv2_file:
  hatespeech_df = pickle.load(hatespeech_df_encodedv2_file)

In [None]:
# Static Variables
MAX_TOKEN_COUNT = 768
MODEL_NAME = 'sentence-transformers/LaBSE'
LABEL_COLUMNS = ['race', 'asian', 'black', 'immigrant', 'other_race', 'religion', 
                 'jew', 'muslim', 'gender', 'women', 'lgbt', 'disability', 'not_hate']

In [None]:
# Splitting the data into training and validation data sets
train_df, val_df = train_test_split(hatespeech_df, test_size=0.10)
train_df.shape, val_df.shape

((20749, 16), (2306, 16))

# **Boilerplate for Model**

Wrapping the data loading in a PyTorch Dataset, along with converting the labels to tensors:

In [None]:
class ToxicCommentsDataset(Dataset):
  def __init__(
      self,
      data: pd.DataFrame,
      max_token_len: int = MAX_TOKEN_COUNT
  ):
    self.data = data
    self.max_token_len = max_token_len
  
  def __len__(self):
    return len(self.data)
  
  # called methods for indexing []
  def __getitem__(self, index: int):
    data_row = self.data.iloc[index]
    text = data_row.text
    encoded = data_row.encoding
    labels = data_row[LABEL_COLUMNS]

    # returns all multiple aspects of data separately as dict
    return dict(
        text = text,
        encoded = encoded,
        labels = torch.FloatTensor(labels)
    )

We'll wrap our custom dataset into a LightningDataModule.

ToxicCommentDataModule encapsulates all data loading logic and returns the necessary data loaders.

In [None]:
class ToxicCommentDataModule(pl.LightningDataModule):
  def __init__(self, train_df, test_df, batch_size=10):
    super().__init__()
    self.batch_size = batch_size
    self.train_df = train_df
    self.test_df = test_df
  
  # sets up datasets from raw data
  def setup(self, stage=None):
    self.train_dataset = ToxicCommentsDataset(self.train_df)
    self.test_dataset = ToxicCommentsDataset(self.test_df)
  
  # returns dataloaders which are iterable, sequential access to elements in batches  
  def train_dataloader(self):
    return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True, num_workers=0, drop_last=True)
  
  def val_dataloader(self):
    return DataLoader(self.test_dataset, batch_size=self.batch_size, num_workers=0, drop_last=True)
  
  def test_dataloader(self):
    return DataLoader(self.test_dataset, batch_size=self.batch_size, num_workers=0, drop_last=True)

Creating an instance of our data module:

In [None]:
# Establishing 15 epochs through data, with 10 samples per iteration
N_EPOCHS = 15
BATCH_SIZE = 10

# Create a data module (to return dataloaders)
data_module = ToxicCommentDataModule(
  train_df,
  val_df,
  batch_size=BATCH_SIZE,
)

# **Model**

Our model will use a pre-trained LaBSE Model and five linear layers to convert the LaBSE representation to a classification task. We'll pack everything in a LightningModule:

In [None]:
class ToxicCommentTagger(pl.LightningModule):
  def __init__(self, input_dim, n_classes):
    super().__init__()
    
    # Five linear layers
    self.classifier = nn.Sequential(
        nn.Linear(768, 8192),
        nn.ReLU(inplace=True),
        nn.Linear(8192, 4096),
        nn.ReLU(inplace=True),
        nn.Linear(4096, 2048),
        nn.ReLU(inplace=True),
        nn.Linear(2048, 1024),
        nn.ReLU(inplace=True),
        nn.Linear(1024, n_classes),
    )

    # Binary Cross Entropy Loss used for model
    self.criterion = nn.BCELoss()
  
  # Defining how the model will be run
  def forward(self, encoded, labels=None):
    output = self.classifier(encoded)
    output = torch.sigmoid(output)    
    loss = 0
    if labels is not None:
        loss = self.criterion(output, labels)
    return loss, output
  
  # runs each type of data through model

  def training_step(self, batch, batch_idx):
    encoded = batch["encoded"]
    labels = batch["labels"]
    loss, outputs = self(encoded, labels)
    self.log("train_loss", loss, prog_bar=True, logger=True)
    return {"loss": loss, "predictions": outputs, "labels": labels}
  
  def validation_step(self, batch, batch_idx):
    encoded = batch["encoded"]
    labels = batch["labels"]
    loss, outputs = self(encoded, labels)
    self.log("val_loss", loss, prog_bar=True, logger=True)
    return loss
  
  def test_step(self, batch, batch_idx):
    encoded = batch["encoded"]
    labels = batch["labels"]
    loss, outputs = self(encoded, labels)
    self.log("test_loss", loss, prog_bar=True, logger=True)
    return loss
  
  # evaluates results of model at end of epoch
  def training_epoch_end(self, outputs):
    # boilerplate to get labels and predictions out of model pipeline
    labels = []
    predictions = []
    for output in outputs:
      # detach() takes each out of pipeline, cpu() moves data to cpu
      for out_labels in output["labels"].detach().cpu():
        labels.append(out_labels)
      for out_predictions in output["predictions"].detach().cpu():
        predictions.append(out_predictions)
    
    labels = torch.stack(labels).int()
    predictions = torch.stack(predictions)

    for i, name in enumerate(LABEL_COLUMNS):
      if name not in ['race', 'religion', 'gender', 'disability', 'not_hate']:
        continue
      # auroc = area under reciever operating characteristic,
      # metric used to evaluate classification models
      class_roc_auc = auroc(predictions[:, i], labels[:, i])
      # logs results
      self.logger.experiment.add_scalar(f"{name}_roc_auc/Train", class_roc_auc, self.current_epoch)
  
  # uses AdamW optimizer, schedule adjusts learning rate during training (no other linear schedule)
  def configure_optimizers(self):
    optimizer = AdamW(self.parameters(), lr=5e-5)
    return dict(
        optimizer=optimizer
    )

Creating an instance of our model:

In [None]:
model = ToxicCommentTagger(
  input_dim=768,
  n_classes=len(LABEL_COLUMNS),
)

# **Training the Model**

Clearing current loggers and checkpoints to log new data

In [None]:
!rm -rf lightning_logs/
!rm -rf checkpoints/
%load_ext tensorboard
%tensorboard --logdir ./lightning_logs

Checkpoint callback to model with least validation loss and monitoring loss stats

In [None]:
# checkpointing saves best model based on validation loss
checkpoint_callback = ModelCheckpoint(
  dirpath="checkpoints",
  filename="best-checkpoint",
  save_top_k=1,
  verbose=True,
  monitor="val_loss",
  mode="min"
)

# progress logged in TensorBoard
logger = TensorBoardLogger("lightning_logs", name="toxic-comments")

# if the model doesnt improve (for last 2 epochs), stop early
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=2)

Starting Training Process

In [None]:
trainer = pl.Trainer(
  logger=logger,
  checkpoint_callback=checkpoint_callback,
  callbacks=[early_stopping_callback],
  max_epochs=N_EPOCHS,
  gpus=1,
  progress_bar_refresh_rate=30
)

  f"Setting `Trainer(checkpoint_callback={checkpoint_callback})` is deprecated in v1.5 and will "
  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [None]:
trainer.fit(model, data_module)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Missing logger folder: lightning_logs/toxic-comments

  | Name       | Type       | Params
------------------------------------------
0 | classifier | Sequential | 50.4 M
1 | criterion  | BCELoss    | 0     
------------------------------------------
50.4 M    Trainable params
0         Non-trainable params
50.4 M    Total params
201.441   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
Global seed set to 42


Training: 0it [00:00, ?it/s]

  f"One of the returned values {set(extra.keys())} has a `grad_fn`. We will detach it automatically"


Validating: 0it [00:00, ?it/s]

  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
  "Trying to infer the `batch_size` from an ambiguous collection.

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Loading model from best checkpoint (one with least validation loss)

In [None]:
trained_model = ToxicCommentTagger.load_from_checkpoint(
  trainer.checkpoint_callback.best_model_path,
  input_dim=768,
  n_classes=len(LABEL_COLUMNS)
)
trained_model.eval()
trained_model.freeze()

Saving best model and continuing into evaluation

In [None]:
with open('trained_model.pickle', 'wb') as trained_model_file:
  pickle.dump(trained_model, trained_model_file)