In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import random
import torch
import pickle
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)
torch.cuda.manual_seed(1)

import sys
sys.path.append('../smaberta')
from smaberta import TransformerModel

### Loading Data

Load train data stored in CSV format using Pandas. Pretty much any format is acceptable, just some form of text and accompanying labels. Modify according to your task. For the purpose of this tutorial, we are using a sample from New York Times Front Page Dataset (Boydstun, 2014).

In [2]:
train_df = pd.read_csv("../data/tutorial_train.csv")

Loading test data

In [3]:
test_df = pd.read_csv("../data/tutorial_test.csv")

Just to get an idea of what this dataset looks like

Paired data consisting of freeform text accompanied by their supervised labels towards the particular task. Here the text is headlines of news stories and the label categorizes them into the subjects. We have a total of 25 possible labels here, each represented by a separate number.

In [None]:
print(len(train_df.label.values))

In [4]:
train_df.head()

Unnamed: 0,text,label
0,"AIDS in prison, treatment costs overwhelm pris...",12
1,olympics security,19
2,police brutality,12
3,Iranian nuclear program; deal with European Un...,16
4,terror alert raised,16


In [5]:
print(train_df.text[:10].tolist(), train_df.label[:10].tolist())

['AIDS in prison, treatment costs overwhelm prison budgets', 'olympics security', 'police brutality', 'Iranian nuclear program; deal with European Union and its leaving of Iran free to develop plutonium.', 'terror alert raised', 'Job report shows unexpected vigor for US economy', "Clinton proposes West Bank Plan to Isreal's Prime Minister Netanyahu", 'Senators debate Iraq War policy', 'Myrtle Beach', 'china visit'] [12, 19, 12, 16, 16, 5, 19, 16, 14, 19]


### Learning Parameters
These are training arguments that you would use to train the classifier. For the purposes of the tutorial we set some sample values. Presumably in a different case you would perform a grid search or random search CV

In [6]:
lr = 1e-3
epochs = 2
print("Learning Rate ", lr)
print("Train Epochs ", epochs)

Learning Rate  0.001
Train Epochs  2


### Initialise model
1. First argument is indicative to use the Roberta architecture (alternatives - Bert, XLNet... as provided by Huggingface). Used to specify the right tokenizer and classification head as well 
2. Second argument provides intialisation point as provided by Huggingface [here](https://huggingface.co/transformers/pretrained_models.html). Examples - roberta-base, roberta-large, gpt2-large...
3. The tokenizer accepts the freeform text input and tansforms it into a sequence of tokens suitable for input to the transformer. The transformer architecture processes these before passing it on to the classifier head which transforms this representation into the label space.  
4. Number of labels is specified below to initialise the classification head appropriately. As per the classification task you would change this.
5. You can see the training args set above were used in the model initiation below.. 
6. Pass in training arguments as initialised, especially note the output directory where the model is to be saved and also training logs will be output. The overwrite output directory parameter is a safeguard in case you're rerunning the experiment. Similarly if you're rerunning the same experiment with different parameters, you might not want to reprocess the input every time - the first time it's done, it is cached so you might be able to just reuse the same. fp16 refers to floating point precision which you set according to the GPUs available to you, it shouldn't affect the classification result just the performance.

In [7]:
model = TransformerModel('roberta', 'roberta-base', num_labels=25, reprocess_input_data=True, num_train_epochs=epochs, learning_rate=lr, 
                  output_dir='./saved_model/', overwrite_output_dir=True, fp16=False)

### Run training

In [8]:
model.train(train_df['text'], test_df['label'])

Starting Epoch:  0
Starting Epoch:  1
Training of roberta model complete. Saved to ./saved_model/.


To see more in depth logs, set flag show_running_loss=True on the function call of train_model

### Inference from model

At training time the model is saved to the output directory that was passed in at initialization. We can either continue retaining the same model object, or load from the directory it was previously saved at. In this example we show the loading to illustrate how you would do the same. This is helpful when you want to train and save a classifier and use the same sporadically. For example in an online setting where you have some labelled training data you would train and save a model, and then load and use it to classify tweets as your collection pipeline progresses.

In [9]:
model = TransformerModel('roberta', 'roberta-base',  num_labels=25, location="./saved_model/")

### Evaluate on test set

At inference time we have access to the model outputs which we can use to make predictions as shown below. Similarly you could perform any emprical analysis on the output before/after saving the same. Typically you would save the results for replication purposes. You can use the model outputs as you would on a normal Pytorch model, here we just show label predictions and accuracy. In this tutorial we only used a fraction of the available data, hence why the actual accuracy is not great. For full results that we conducted on the experiments, check out our paper.

In [10]:
result, model_outputs, wrong_predictions = model.evaluate(test_df['text'], test_df['label'])
preds = np.argmax(model_outputs, axis = 1)

{'mcc': 0.0}


In [12]:
len(test_df), len(preds)

(998, 998)

In [14]:
correct = 0
labels = test_df['label'].tolist()
for i in range(len(labels)):
    if preds[i] == labels[i]:
        correct+=1

accuracy = correct/len(labels)
print("Accuracy: ", accuracy)

Accuracy:  0.23947895791583165


In [15]:
pickle.dump(model_outputs, open("../model_outputs.pkl", "wb"))

### Run inference 

This is the use case when you only have a new set of documents and no labels. For example if we just want to make predictions on a set of new text documents without loading a pandas datafram i.e. if you just have a list of texts, it can be predicted as shown below. Note that here you have the predictions and model outputs.

In [17]:
texts = test_df['text'].tolist()

In [18]:
preds, model_outputs = model.predict(texts)

In [19]:
correct = 0
for i in range(len(labels)):
    if preds[i] == labels[i]:
        correct+=1

accuracy = correct/len(labels)
print("Accuracy: ", accuracy)

Accuracy:  0.23947895791583165


### References

Boydstun, Amber E. (2014). New York Times Front Page Dataset. www.comparativeagendas.net. Accessed April 26, 2019.



