# Using SMaBERTa for text classification
To find this demo: https://github.com/SMAPPNYU/krn_tools_demo


Megan Brown

Center for Social Media and Politics at NYU

October 13, 2021

## Agenda
Today we will discuss:

1. What SMaBERTa is
2. How to install the package
3. A brief look at how to use the package


## SMaBERTa

SMaBERTa is a wrapper for the Huggingface Transformers libraries. 
SMaBERTa simplifies the use of various transformer models, requiring fewer lines of code to create a model

## How to Install

   The software is on PyPI, so you can download it via `pip`
   
   
   `pip install smaberta`

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import random
import torch
import pickle
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)
torch.cuda.manual_seed(1)

In [None]:
from smaberta import TransformerModel

### Loading Data

Load train data stored in CSV format using Pandas. Pretty much any format is acceptable, just some form of text and accompanying labels. Modify according to your task. For the purpose of this tutorial, we are using a sample from New York Times Front Page Dataset (Boydstun, 2014) labelled using categories from the Comparative Agendas Project.

In [2]:
train_df = pd.read_csv("../data/tutorial_train.csv")

Loading test data

In [3]:
test_df = pd.read_csv("../data/tutorial_test.csv")

The dataset consists of text headlines and their CAP data label.

In [4]:
train_df.head()

Unnamed: 0,text,label
0,"AIDS in prison, treatment costs overwhelm pris...",12
1,olympics security,19
2,police brutality,12
3,Iranian nuclear program; deal with European Un...,16
4,terror alert raised,16


### Learning Parameters
These are training arguments that you would use to train the classifier. For the purposes of the tutorial we set some sample values. In practice, you would want to perform a grid search or random search CV.

In [6]:
lr = 1e-3
epochs = 2
print("Learning Rate ", lr)
print("Train Epochs ", epochs)

Learning Rate  0.001
Train Epochs  2


### Initialise model
1. First argument is indicative to use the Roberta architecture (alternatives - Bert, XLNet... as provided by Huggingface). Used to specify the right tokenizer and classification head as well 
2. Second argument provides intialisation point as provided by Huggingface [here](https://huggingface.co/transformers/pretrained_models.html). Examples - roberta-base, roberta-large, gpt2-large...
3. The tokenizer accepts the freeform text input and tansforms it into a sequence of tokens suitable for input to the transformer. The transformer architecture processes these before passing it on to the classifier head which transforms this representation into the label space.  
4. Number of labels is specified below to initialise the classification head appropriately. As per the classification task you would change this.
5. You can see the training args set above were used in the model initiation below.. 
6. Pass in training arguments as initialised, especially note the output directory where the model is to be saved and also training logs will be output. The overwrite output directory parameter is a safeguard in case you're rerunning the experiment. Similarly if you're rerunning the same experiment with different parameters, you might not want to reprocess the input every time - the first time it's done, it is cached so you might be able to just reuse the same. fp16 refers to floating point precision which you set according to the GPUs available to you, it shouldn't affect the classification result just the performance.

Once we have loaded the data, we instantiate the model. 

In [7]:
model = TransformerModel('roberta', 
                         'roberta-base', 
                         num_labels=25, 
                         reprocess_input_data=True, 
                         num_train_epochs=epochs, 
                         learning_rate=lr, 
                         output_dir='./saved_model/', 
                         overwrite_output_dir=True, 
                         fp16=False)

### Run training

In [8]:
model.train(train_df['text'], train_df['label'])

Starting Epoch:  0
Starting Epoch:  1
Training of roberta model complete. Saved to ./saved_model/.


To see more in depth logs, set flag show_running_loss=True on the function call of train_model

### Inference from model

At training time the model is saved to the output directory that was passed in at initialization. We can either continue retaining the same model object, or load from the directory it was previously saved at. In this example we show the loading to illustrate how you would do the same. This is helpful when you want to train and save a classifier and use the same sporadically. For example in an online setting where you have some labelled training data you would train and save a model, and then load and use it to classify tweets as your collection pipeline progresses.

You can also instantiate a model using a previously trained model. Here we are loading the model we just trained.

In [9]:
model = TransformerModel('roberta', 'roberta-base',  num_labels=25, location="./saved_model/")

### Evaluate on test set

At inference time we have access to the model outputs which we can use to make predictions as shown below. Similarly you could perform any emprical analysis on the output before/after saving the same. Typically you would save the results for replication purposes. You can use the model outputs as you would on a normal Pytorch model, here we just show label predictions and accuracy. In this tutorial we only used a fraction of the available data, hence why the actual accuracy is not great. For full results that we conducted on the experiments, check out our paper.

We then evaluate the model using the test set.

In [10]:
result, model_outputs, wrong_predictions = model.evaluate(test_df['text'], test_df['label'])
preds = np.argmax(model_outputs, axis = 1)

{'mcc': 0.0}


Note: The model has a low accuracy because we chose a large learning rate and small number of epochs. In practice, we would recommend performing a grid search with cross validation to get the best model for your task. 

In [14]:
correct = 0
labels = test_df['label'].tolist()
for i in range(len(labels)):
    if preds[i] == labels[i]:
        correct+=1

accuracy = correct/len(labels)
print("Accuracy: ", accuracy)

Accuracy:  0.23947895791583165


### Run inference 

This is the use case when you only have a new set of documents and no labels. For example if we just want to make predictions on a set of new text documents without loading a pandas datafram i.e. if you just have a list of texts, it can be predicted as shown below. Note that here you have the predictions and model outputs.

We can then use our trained model to predict the category for new texts.

In [17]:
texts = test_df['text'].tolist()
preds, model_outputs = model.predict(texts)

Thank you!

You can find out more about the package at https://github.com/SMAPPNYU/SMaBERTa.git