<a href="https://colab.research.google.com/github/Jordy-VL/document-classification-exps/blob/master/BERT_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#BERT for document classification
*Welcome to the BERT tutorial*


## Installation
*And of course some necessary packages to install in our Colab environment*

In [None]:
!pip3 install tqdm matplotlib scikit-learn keras
!pip3 install torch -i https://download.pytorch.org/whl/cu90/torch_stable.html/torch-1.0.0-cp36-cp36m-linux_x86_64.whl 
!pip3 install transformers #pytorch_transformers
# https://colab.research.google.com/drive/1uvHuizCBqFgvbCwEhK7FvU8JW0AfxgJw install custom module

### Enter free GPU
Let's fire up a free GPU by going to "edit - notebook settings" and choosing "GPU" as hardware accelerator. The below script will give you some statistics on the device. 

In [None]:
!nvidia-smi
#for TPUs: https://colab.research.google.com/drive/1M8uYeHHQjmomsSEZJ6NNtfpEL_hPzcpq#scrollTo=AoJ4XQWoHbIB 

### Mount google drive 
(required for saving model)

In [None]:
#https://medium.com/@ml_kid/how-to-save-our-model-to-google-drive-and-reuse-it-2c1028058cb2 
from google.colab import drive
drive.mount('/content/gdrive')
path='/content/gdrive/My Drive'
!path="/content/gdrive/My Drive"
!ls "$path"
!mkdir -p "$path"

#bonus: import module from Google Drive
"""
import sys
sys.path.insert(0, '/content/gdrive/My Drive/Colab Notebooks/my_modules')
from woef import main as waf
"""

In [None]:
# helper imports
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler, Dataset
from tqdm import tqdm, trange
import torch.nn.functional as F
import torch
#from evaluate import *

let's start with some hardcoded values to ensure we run the same strategy achieving reproducible results.

In [None]:
SEED = 100
np.random.seed(SEED)
sample = 0 # set to XXX in order to perform input sampling on train/val/test (enabling dryrun mode)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

## Loading train/test documents
### 20-News Corpus 


Below we fetch the 20News corpus as a databunch from sklearn and setup X (features) and y (labels) for train and test. To ensure we run all the same preprocessing steps for the different sets, we put them per identifier in a dictionary. From experience, this helps to reduce code duplication.

In [None]:
identifiers = ["train", "val", "test"]
data = {identifier: {} for identifier in identifiers}

In [None]:
##sample dataset
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train', categories=None,
                                shuffle=True, random_state=42)
data_test = fetch_20newsgroups(subset='test', categories=None,
                               shuffle=True, random_state=42)

# order of labels in `target_names` can be different from `categories`
labels = data_train.target_names
num_labels = len(labels)

data["train"]["X"], data["test"]["X"] = data_train.data, data_test.data
data["train"]["y"], data["test"]["y"] = data_train.target, data_test.target

We can also download one of our proprietary datasets and input them in the same fashion for preprocessing

In [None]:
 ##real dataset
  #https://github.com/ctberthiaume/gdcp
from google.colab import files
#files.upload()

#optionally move it to gdrive
datapath = "XXX.csv"
import pandas as pd
df = pd.read_csv(datapath)

Only run this cell if you use a custom dataset; adjust like required

In [None]:
#print(df.head())
#print(df.columns.tolist())
tag = "Doctype"
num_labels = len(df[tag].unique())
labels = sorted(df[tag].unique().tolist())
mapping = {k:i for i,k in enumerate(sorted(df[tag].unique().tolist()))}
df[tag] = df[tag].apply(lambda x: mapping[x])
data["train"]["X"], data["test"]["X"], data["train"]["y"], data["test"]["y"] = train_test_split(df["unprep"].values, df[tag].values,
                                                                                                        random_state=SEED, test_size=0.3)
# data["train"]["X"], data["val"]["X"], data["train"]["y"], data["val"]["y"] = train_test_split(data["train"]["X"], data["train"]["y"],
#                                                                                                         random_state=SEED, test_size=0.1)

In [None]:
if sample:
    for l in ["train", "test"]:
        for k in ["X", "y"]:
            data[l][k] = data[l][k][:sample+1]

# Set the maximum sequence length. 
# In the original paper, the authors used a length of 512.
MAX_LEN = 256
# Optional function to find MAX_LEN: 

## Loading pre-trained BERT weights and tokenizer

In [None]:
import torch
from transformers import BertForSequenceClassification, BertTokenizer,AdamW, get_linear_schedule_with_warmup, AutoTokenizer, AutoConfig, AutoModelForSequenceClassification
model_class, tokenizer_class, pretrained_weights = BertForSequenceClassification, BertTokenizer, "bert-base-uncased" #'bert-base-multilingual-cased' #allenai/longformer-base-4096
#tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model_class = "bert-base-uncased"#'allenai/longformer-base-4096'
tokenizer = AutoTokenizer.from_pretrained(model_class)

## Tokenization - Max Length Trim/Pad + Encode - BatchLoader -> train/val/test  

In [None]:
for l in ["train", "test"]:
    print("l: ", l)
    data[l]["tokenized"] = tqdm([tokenizer.tokenize(text)[:MAX_LEN]
                                 for text in data[l]["X"]])  # [:MAX_LEN]
    data[l]["input_ids"] = pad_sequences(tqdm([tokenizer.convert_tokens_to_ids(tokenized) for tokenized in data[l]["tokenized"]]),
                                         maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
    # Create attention masks
    data[l]["masks"] = []
    # Create a mask of 1s for each token followed by 0s for padding
    for seq in data[l]["input_ids"]:
        seq_mask = [float(i > 0) for i in seq]
        data[l]["masks"].append(seq_mask)

In [None]:
data["train"]["inputs"], data["val"]["inputs"], data["train"]["y"], data["val"]["y"] = train_test_split(data["train"]["input_ids"], data["train"]["y"],
                                                                                                        random_state=SEED, test_size=0.1)
data["train"]["masks"], data["val"]["masks"], _, _ = train_test_split(data["train"]["masks"], data["train"]["input_ids"],
                                                                      random_state=SEED, test_size=0.1)
data["test"]["inputs"] = data["test"]["input_ids"]

# TORCHify arrays and matrices
for l in ["train", "val", "test"]:
    print(l)
    for k in ["inputs", "y", "masks"]:
        print(k)
        data[l][k] = torch.tensor(data[l][k])

In [None]:
# LOADERs
# Select a batch size for training. For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32
BATCH_SIZE = 8

# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop,
# with an iterator the entire dataset does not need to be loaded into memory
train_data = TensorDataset(data["train"]["inputs"], data["train"]["masks"], data["train"]["y"])
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)

val_data = TensorDataset(data["val"]["inputs"], data["val"]["masks"], data["val"]["y"])
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(
    val_data, sampler=val_sampler, batch_size=BATCH_SIZE)

test_data = TensorDataset(data["test"]["inputs"], data["test"]["masks"], data["test"]["y"])
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(
    test_data, sampler=test_sampler, batch_size=BATCH_SIZE)

## Training & Optimization params

In [None]:
# Number of training epochs (authors recommend between 2 and 4)
EPOCHS = 2
if sample: EPOCHS = 1
lr = 3e-5 #1e-6
max_grad_norm = 1.0
warmup_proportion = 0.1
num_total_steps = len(train_dataloader) #*EPOCHS
num_warmup_steps = float(num_total_steps)*0.1

## Fire up the model on the GPU

In [None]:
#model = model_class.from_pretrained(pretrained_weights, num_labels=num_labels)
config = AutoConfig.from_pretrained(model_class,num_labels=num_labels)
model = AutoModelForSequenceClassification.from_pretrained(model_class,config=config)
model.cuda()
# To reproduce BertAdam specific behavior set correct_bias=False
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps = num_total_steps)

#### Evaluation code

In [None]:
import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, brier_score_loss, precision_recall_curve,roc_curve,auc,roc_auc_score
from sklearn.metrics import matthews_corrcoef

def evaluation_measures(gold, predicted, target_names=[]):
    acc = accuracy_score(gold,predicted)
    print('Accuracy [:( metric ]:', acc)
    if target_names:
        report = classification_report(gold, predicted, target_names=target_names)
        repdict = classification_report(gold, predicted, target_names=target_names, output_dict=True)
    else:
        report = classification_report(gold, predicted)
        repdict = classification_report(gold, predicted, output_dict=True)
    print('Classification report:')
    print(report)
    matrix = confusion_matrix(gold, predicted)
    print('Confusion matrix:')
    print(matrix)
    return repdict, matrix

def calc_uof_fp(points, thresh):
    stats = {}
    stats["pos_over"], stats["pos_under"], stats["neg_over"], stats["neg_under"] = 0, 0, 0, 0

    multiply = True
    for i, p, value, status, group in points:
        if value > 1:
            multiply = False
            break
    if multiply:
        points = [(i, p, 100*value, status, group) for i, p, value, status, group in points]

    for i, p, value, status, group in points:
        if status == True:
            if value >= thresh:
                stats["pos_over"] += 1
            else:
                stats["pos_under"] += 1
        else:
            if value >= thresh:
                stats["neg_over"] += 1
            else:
                stats["neg_under"] += 1
    return round((stats["pos_over"] + stats["neg_over"])/len(points), 4), round(stats["neg_over"]/max(1, ((stats["pos_over"] + stats["neg_over"]))), 2), round((stats["pos_over"])/len(points), 4)

def easy_calc_uof_fp(predict, probs, gold, thresh):
    """
    LIST all
    """
    total = len(gold)
    boolean = np.array([True if predict[i] == gold[i] else False for i in range(0, len(gold))])
    unique = sorted(list(set(sorted([int(x) for x in probs]))))
    points = list(zip(list(range(0, total)), predict, probs, boolean.tolist(),
                      [unique.index(int(prob)) for prob in probs]))
    return calc_uof_fp(points, thresh)

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

## Train - Validate Model & Save

In [None]:
# Store our loss and accuracy for plotting
train_loss_set = []

# trange is a tqdm wrapper around the normal python range
for _ in trange(EPOCHS, desc="Epoch"):

    # Training

    # Set our model to training mode (as opposed to evaluation mode)
    model.train()

    # Tracking variables
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0

    #from pdb import set_trace; set_trace()
    # Train the data for one epoch
    for step, batch in enumerate(train_dataloader):
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch

        # Forward pass
        outputs = model(b_input_ids, token_type_ids=None,
                        attention_mask=b_input_mask, labels=b_labels)
        loss = outputs[0]
        # print(loss)

        #from pdb import set_trace; set_trace()
        train_loss_set.append(loss.item())

        # Backward pass
        loss.backward()
        # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        # Update parameters and take a step using the computed gradient
        optimizer.step()
        # Update learning rate for next steps
        scheduler.step()
        # Clear out the gradients (by default they accumulate)
        optimizer.zero_grad()

        # Update tracking variables
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1

    print("Train loss: {}".format(tr_loss/nb_tr_steps))

    # Validation

    # Put model in evaluation mode to evaluate loss on the val set
    model.eval()

    # Tracking variables
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in val_dataloader:
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        # Telling the model not to compute or store gradients, saving memory and speeding up val
        with torch.no_grad():
            # Forward pass, calculate logit predictions
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
            logits = outputs[0] # PER BATCH!

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        tmp_eval_accuracy = flat_accuracy(logits, label_ids)

        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1

    print("val Accuracy: {}".format(eval_accuracy/nb_eval_steps))


# Now let's save our model and tokenizer to a directory
model.save_pretrained(path+'/models/')
tokenizer.save_pretrained(path+'/models/')

## Test - evaluate Model

In [None]:
# Prediction on test set
# Put model in evaluation mode
model.eval()

# Tracking variables
predictions, probs, true_labels = [], [], []

# Predict
for batch in test_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Telling the model not to compute or store gradients, saving memory and speeding up prediction
    with torch.no_grad():
        # Forward pass, calculate logit predictions
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
        logits = outputs[0]
        prob = F.softmax(logits, dim=1)
    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    prob = prob.detach().cpu().numpy()

    # Store predictions and true labels
    predictions.append(logits)
    true_labels.append(label_ids)
    probs.append(prob)

### Evaluation & Statistics

In [None]:
# Import and evaluate each test batch using Matthew's correlation coefficient
matthews_set = []
for i in range(len(true_labels)):
    matthews = matthews_corrcoef(true_labels[i],
                                np.argmax(predictions[i], axis=1).flatten())
    matthews_set.append(matthews)
#print(matthews_set)

#from pdb import set_trace; set_trace()
# Flatten the predictions and true values for aggregate Matthew's evaluation on the whole dataset
flat_true_labels = [label for batch in true_labels for label in batch]
flat_logits = [logits for batch in predictions for logits in batch]
flat_probs = [prob for batch in probs for prob in batch]
flat_predictions = np.argmax(flat_logits, axis=1).flatten()
flat_argmax_probs = [100*flat_probs[i][flat_predictions[i]] for i in range(len(flat_predictions))]

print("Matthews Coëfficient: ",matthews_corrcoef(flat_true_labels, flat_predictions))

evaluation_measures(flat_true_labels, flat_predictions)#,target_names=labels)

for thresh in np.arange(10,100,10):
    stats = easy_calc_uof_fp(flat_predictions, flat_argmax_probs, flat_true_labels, thresh)
    print(thresh, ":", stats)



# Congratulations, you have now been converted to BERTology! ![alt text](https://i.ytimg.com/vi/odVtLluew-8/maxresdefault.jpg)

In [None]:
#Bonus: how to get your model from google drive to local disk
# zip folder with model -> download zip (works BUT slow for large models)
# https://stackoverflow.com/questions/53581023/google-colab-file-download-failed-to-fetch-error => make sure to enable third-party cookies + possible refresh

from google.colab import files
!ls "$path"
#!zip -r "$path"/models.zip "$path"'/models/'
files.download(path+'/models.zip')

This code and tutorial all borrows insights from the below sources: 

# Tutorials #

https://mccormickml.com/2019/07/22/BERT-fine-tuning/
https://medium.com/dsnet/running-pytorch-transformers-on-custom-datasets-717fd9e10fe2
https://engineering.wootric.com/when-bert-meets-pytorch
https://towardsdatascience.com/distilling-bert-models-with-spacy-277c7edc426c
https://github.com/huggingface/pytorch-transformers#quick-tour-of-the-fine-tuningusage-scripts
https://github.com/explosion/spacy-pytorch-transformers
https://github.com/huggingface/pytorch-transformers
https://github.com/fredriko/bert-tensorflow-pytorch-spacy-conversion
https://arxiv.org/pdf/1905.05583.pdf #how to finetune
https://www.kaggle.com/sharmilaupadhyaya/20newsgroup-classification-using-keras-bert-in-gpu 
https://colab.research.google.com/drive/1YSfscbb-g92m1vkYxY4IOVMWMfgfLLJD #tensorflow version on TPU
https://colab.research.google.com/drive/1pS-eegmUz9EqXJw22VbVIHlHoXjNaYuc#scrollTo=JggjeDC9m2MH #BertViz repo
https://www.kaggle.com/criscastromaya/cnn-for-nlp-in-keras [compare with CNN]

https://github.com/sugi-chan/custom_bert_pipeline/blob/master/bert_pipeline.ipynb #phased method

https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/discussion/100661 #faster batch training
