It is recommended to use google colab as it offers a P100 GPU that is fast and has with enough VRAM to train our model. Training and inference is very slow on CPU because of the size of the model.

### Google Colab drive mount

Change the following cell with the path of the folder in which you placed our github repository files in your google drive

If loading this from google colab uncomment and run the following cell to mount your google drive folder to the notebook

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/My\ Drive/ML_proj2/GITHUB_SUBMISSION

The following cell must be run if you are using google colab

In [1]:
!pip install transformers



If running this notebook on your personal computer, we recommend you create a virtual environment (using anaconda) , and install all the required packages using these commands (this will fullfill the versions requirements on the README):  
- conda create -y -n ml python=3.7 scipy pandas numpy matplotlib  
- conda activate ml
- conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch  
- pip install -U scikit-learn  
- conda install jupyterlab nb_conda_kernels  
- pip install --user -U nltk  
- conda install -c conda-forge ipywidgets
- pip install transformers

(note above that the virtual environment in named ml, and needs to be activated before installing the further dependeces and everytime before running this notebook)

Python 3.7 was used for this as google COLAB currently uses it.

### Imports

In [1]:
from transformers import BertForSequenceClassification, AdamW, BertConfig
from transformers import get_linear_schedule_with_warmup
from transformers import AutoTokenizer

from sklearn.model_selection import train_test_split

import pandas as pd
import time
import datetime
import random
import numpy as np
import csv

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import RandomSampler, DataLoader, Subset
from torch.utils.data import TensorDataset, random_split, SequentialSampler

#from tqdm.notebook import tqdm
from tqdm import tqdm
# https://stackoverflow.com/questions/42212810/tqdm-in-jupyter-notebook-prints-new-progress-bars-repeatedly

tqdm.pandas()

Following cell reloads .py files automatically into jupyter notebook even when saved outside

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from preprocessing_bert import *
from helpers_bert import *
from models_bert import *
from train_bert import *
from helpers import *

### run.py

If you are on google colab you can directly run run.py from here :

In [9]:
!python run.py

^C


### Paths &  data loading

In [11]:
PATH_DATA = './data/'
PATH_PREPROCESSING = PATH_DATA + 'preprocessing/'

Loading the positive and negative tweets into dataframes, use the option small_dataset to load either the small dataset (1) or the full dataset (0)

In [None]:
train_pos, train_neg = load_tweets(PATH_DATA, small_dataset=1)

In [6]:
device = gpu_cpu_setup()

No GPU available, using the CPU instead.


Below the input dataframes are correctly formatted and labeled then they are tokenized

In [None]:
df = create_input_df(train_pos,train_neg)

input_ids, attention_masks, labels = tokenize_with_autoencoder(df, max_len=40)

After that we create Dataloader objects for training and validation to be used by the training functions later

If you want to split the training data into a validation and test set use the following cell:

In [None]:
# with train test split
full_dataset = TensorDataset(input_ids, attention_masks, labels)
train_ds, val_ds = train_val_split(full_dataset, proportion = 0.9)
train_dataloader = DataLoader(train_ds, shuffle = True, batch_size = 32)
val_dataloader = as_dataloader(val_ds, random = False)

If you want to load only a subset of the dataset to see if the functions can run use the following cell:

In [None]:
# Subset of train train test split -- To test if train function works
full_dataset = TensorDataset(input_ids, attention_masks, labels)
train_ds, val_ds = train_val_split(full_dataset)
train_dataloader = DataLoader(Subset(train_ds,np.arange(64*300)), shuffle = True, batch_size = 32)
val_dataloader = as_dataloader(Subset(val_ds,np.arange(32)), random = False)

If you want to only train the model, use the following cell :

In [None]:
# # Only train set
train_dataloader = DataLoader(full_dataset, shuffle=True, batch_size = 16)
val_dataloader = None

# Training

Instantiating the model ( you can also instantiatite the modified model 'BertWithCustomClassifier' with a custom classifier )

In [None]:
model =  load_model(device, model_name = 'BertForSequenceClassification')

Loading the optimizer and schdeuler needed for training : ( with the parameters for the best model BertForSequenceClassification )

In [None]:
epochs = 2
total_steps = len(train_dataloader) * epochs # = number of batches times epochs

optimizer = AdamW(model.parameters(), lr = 3e-5, eps = 1e-8) # trying lr = 1e-5

scheduler = get_linear_schedule_with_warmup(optimizer,  num_warmup_steps = round(total_steps*0.10), num_training_steps = total_steps)

Training the model (note that if using BertForSequenceClassification, do not use the option freezing=True):

In [None]:
training_stats = train_bert_class_with_params(train_dataloader,val_dataloader,
                                              model, optimizer, scheduler,
                                              epochs, random_seed=42,
                                              device=device,
                                              PATH_DATA=PATH_DATA,
                                              save_N_steps=599,
                                              save_epoch=True,
                                              save_path='./data/models/BERT/BERT_model_TEST',
                                              step_print=100,
                                              validate=True,
                                              freezing=False,
                                              freez_steps=100,
                                              frozen_epochs=1)

In [None]:
training_stats

# Submission


In [None]:
device = gpu_cpu_setup()

In [None]:
# Our best model : 
path_model = PATH_DATA + 'models/BERT/best_submission_bert.pkl'
model = load_model_disk(device, path_model, model_name = 'BertForSequenceClassification')

# Our second best model : 
# path_model = PATH_DATA + 'models/BERT/best_submission_bert_custom.pkl'
# model = load_model_disk(device, path_model, model_name = 'BertWithCustomClassifier')

In [None]:
path_test_data = PATH_DATA + 'twitter-datasets/test_data.txt'

In [None]:
# use max_len=140 to reproduce our best results
test_dataloader = load_test_data(path_test_data, max_len=140)

In [None]:
y_pred, ids = make_prediction(model, test_dataloader, device)

In [None]:
pred_sanity_checks(y_pred)

In [None]:
path_submission = PATH_DATA + 'submissions/output.csv'
create_csv_submission(ids, y_pred, path_submission )

# Cross-validation

In [None]:
# only use the small dataset for this, and a good gpu like a P100 otherwise it will probably take days to finish it
# P100 also has approx 16 GB of VRAM, 
cv_bert(input_ids, attention_masks, labels, device, PATH_DATA, model_name = 'BertWithCustomClassifier')

In [None]:
cv_bert(input_ids, attention_masks, labels, device, PATH_DATA, model_name = 'BertWithCustomClassifier')

In [None]:
test1 = pd.read_csv(PATH_DATA+ '\submissions\output_run_py.csv' )
test2 = pd.read_csv(PATH_DATA+ '\submissions\output_run_py_BEST_140len.csv' )