# Introduction

In this Colab notebook, I perform sentiment analysis using XLNET on the Twitter US Airline dataset, and apply the EDA augmentation. I then report the performance metrics (before and after augmentation).

## Install and Import

We'll train a neural network with a GPU provided by Google. 

Add the GPU by going to the menu and selecting:

Edit -> Notebook Settings -> Add accelerator (GPU)

Run the following cell to confirm that the GPU is detected.

In [None]:
import tensorflow as tf

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Next, install transformers for XLNet by Hugging Face. (This library contains interfaces for other pretrained language models like OpenAI's GPT, BERT, and GPT-2.)

At the moment, the Hugging Face library seems to be the most widely accepted and powerful pytorch interface for transfer learning models. 


In [None]:
!pip install transformers

In [None]:
#xlnet requires the sentencepiece module
!pip install sentencepiece

In [4]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

from transformers import XLNetModel, XLNetTokenizer, XLNetForSequenceClassification
from transformers import AdamW

import sentencepiece

from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device. 

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

## Load Dataset


We use the Twitter Airline Sentiment dataset.  It's a set of Tweets labeled as 'positive', 'negative', or 'neutral'. 

The dataset can be found at the following location: https://www.kaggle.com/crowdflower/twitter-airline-sentiment

In [None]:
# Upload the train file from your local drive
from google.colab import files
uploaded = files.upload()

In [6]:
# Read the datafile (in .csv format)
df = pd.read_csv('/content/Tweets.csv', converters={'column_name': eval})

In [None]:
# Review the dimensions of the dataframe
df.shape

In [None]:
# Familiarize yourself with the dataframe by visual inspection
df.sample(10)

# Format Dataframe


In [9]:
# Drop columns not needed for the analysis
df = df.drop('tweet_id', 1)
df = df.drop('airline_sentiment_confidence', 1)
df = df.drop('negativereason', 1)
df = df.drop('negativereason_confidence', 1)
df = df.drop('airline', 1)
df = df.drop('airline_sentiment_gold', 1)
df = df.drop('retweet_count', 1)
df = df.drop('tweet_coord', 1)
df = df.drop('tweet_created', 1)
df = df.drop('tweet_location', 1)
df = df.drop('user_timezone', 1)
df = df.drop('name', 1)
df = df.drop('negativereason_gold', 1)

In [None]:
# Drop columns with empty cells
df. dropna()

In [11]:
# create a dictionary file  
sentiment = {'neutral': 0, 'positive': 1,'negative': 2} 
  
# Traverse through dataframe  
# Assign values where key matches 
df.airline_sentiment = [sentiment[item] for item in df.airline_sentiment] 

In [None]:
df.sample(10)

In [13]:
# Format the 'airline_sentiment' column as a categorical variable
df['airline_sentiment'] = df.airline_sentiment.astype('category')

In [None]:
df['airline_sentiment'].describe()

In [None]:
df.shape

# Data Augmentation (EDA)

In [None]:
# Import EDA
try:
    from textaugment import EDA
except ModuleNotFoundError:
    !pip install textaugment
    from textaugment import EDA

In [None]:
# Import NLTK and download modules needed for EDA
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

In [21]:
t = EDA(random_state=1)

# Synonym Replacement

In [None]:
# Visually inspect the first row of the 'text' column for verification
df['text'][1]

In [None]:
# Loop for EDA to perform synonym replacement
for row in range(0, len(df)):
    df['text'][row] = t.synonym_replacement(df['text'][row])

In [None]:
# Visually inspect the first row of the 'text' column to verify a word has been 
# replaced by its synonym. 
df['text'][1]

# Random Insertion

In [None]:
# Loop for EDA to perform random insertion 
for row in range(0, len(df)):
    df['text'][row] = t.random_insertion(df['text'][row])

In [None]:
# Visually inspect the first row of the 'text' column to verify 
# a word has been randomly inserted.  
df['text'][1]

# Random Swap

In [None]:
# Loop for EDA to perform random swap
for row in range(0, len(df)):
    df['text'][row] = t.random_swap(df['text'][row])

In [None]:
# Visually inspect the first row of the 'text' column to verify 
# a word has been randomly swapped.  
df['text'][1]

# Random Deletion

In [None]:
# Loop for EDA to perform random deletion
for row in range(0, len(df)):
    df['text'][row] = t.random_deletion(df['text'][row])

In [None]:
# Visually inspect the first row of the 'text' column to verify 
# a word has been randomly deleted.  
df['text'][1]

# Create Holdout Dataset

In [None]:
import random
# Create new colunm named 'carve_assignment'
df['carve_assignment'] = ""

# Loop over rows in 'carve_assignment'
# Assign a random number between 0 and 1 to the 'carve_assignment' column
for row in range(0, len(df)):
    df['carve_assignment'][row] = random.random()

df.sample(10)

In [26]:
# Assign rows from column 'carve_assignment' with value equal or greater than 0.8 to holdout dataset
holdout = df[df['carve_assignment'] >= 0.8]

# Assign rows from column 'carve_assignment' with value less than 0.8 to 'df' dataframe
df = df[df['carve_assignment'] < 0.8]

In [None]:
# Verify proportions of data in each dataset
print(df.shape)
print(holdout.shape)

In [None]:
# Drop column named 'carve_assignment' from both dataframes
df = df.drop('carve_assignment', 1)
holdout = holdout.drop('carve_assignment', 1)

In [31]:
# Save 'holdout' dataset to CSV for later analysis
holdout.to_csv('holdout.csv', index=False)

In [32]:
# Delete 'holdout' dataset and release from memory
del holdout

# Format Sentences for XLNet 

In [33]:
# Create sentence and label lists
sentences = df.text.values

We need to add special tokens ("[SEP]" and "[CLS]") at the beginning and end of each sentence for XLNet to work properly. 

For XLNet the token pattern looks like this:

    Sentence_A + [SEP] + Sentence_B + [SEP] + [CLS]
    
For single sentence inputs (such as we have here), we just need to add [SEP] and [CLS] to the end:

In [34]:
sentences = [sentence + " [SEP] [CLS]" for sentence in sentences]
labels = df.airline_sentiment

In [35]:
# Delete the 'df' dataframe and release from memory; it is no longer needed
# The subsequent analysis will be conducted on the 'sentences' and 'labels' lists
# A separate dataframe named 'df' will be created later from the 'holdout' dataset
del df

## Tokenize Sentences

Next, import the XLNet tokenizer, and convert the text into tokens that correspond to XLNet's vocabulary.

In [None]:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased', do_lower_case=True)

tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])

XLNet requires specifically formatted inputs. For each tokenized input sentence, we need to create:

- **input ids**: a sequence of integers identifying each input token to its index number in the XLNet tokenizer vocabulary
- **segment mask**: (optional) a sequence of 1s and 0s used to identify whether the input is one sentence or two sentences long. For one sentence inputs, this is simply a sequence of 0s. For two sentence inputs, there is a 0 for each token of the first sentence, followed by a 1 for each token of the second sentence
- **attention mask**: (optional) a sequence of 1s and 0s, with 1s for all input tokens and 0s for all padding tokens (we'll detail this in the next paragraph)
- **labels**: a single value of 0, 1, or 2 to characterize the sentiment of the sentence (here, Tweet).  In our task 0 means 'neutral', 1 means 'positive', and 2 means 'negative'. 

Although we can have variable length input sentences, XLNet requires the input arrays to be the same size. We address this by choosing a maximum sentence length, and then padding and truncating the inputs until each input sequence is the same length. 

To "pad" inputs means that if a sentence is shorter than the maximum sentence length, we add 0s to the end of the sequence until it is the maximum sentence length. 

If a sentence is longer than the maximum sentence length, we truncate the end of the sequence, discarding anything that does not fit into the maximum sentence length.

We pad and truncate our sequences so that they are all of length MAX_LEN ("post" indicates that we want to pad and truncate at the end of the sequence, as opposed to the beginning) `pad_sequences` is a utility function from Keras. It handles the truncating and padding of Python lists.

In [37]:
# Set the maximum sequence length.  This value leaves plenty of 'room'. 
MAX_LEN = 140

In [38]:
# Use the XLNet tokenizer to convert the tokens to their index numbers in the XLNet vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

In [39]:
# Pad our input tokens
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

Create the attention masks 

In [40]:
# Create attention masks
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

In [41]:
# Use train_test_split to split our data into train and validation sets for training

train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels, 
                                                            random_state=2018, test_size=0.2, stratify=labels)

train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids,
                                             random_state=2018, test_size=0.2)

In [None]:
# Identify Data Types
print(type(train_inputs))
print(type(train_labels))
print(type(train_masks))

print(type(validation_inputs))
print(type(validation_labels))
print(type(validation_masks))

In [43]:
# Convert Series to numpy array 
train_labels = np.asarray(train_labels)
validation_labels = np.asarray(validation_labels)

In [44]:
# Convert list to numpy array
train_masks = np.asarray(train_masks)
validation_masks = np.asarray(validation_masks)

In [None]:
# Verify Data Types
print(type(train_inputs))
print(type(train_labels))
print(type(train_masks))

print(type(validation_inputs))
print(type(validation_labels))
print(type(validation_masks))

In [46]:
# Convert all of our data into torch tensors, the required datatype for our model

train_inputs = torch.from_numpy(train_inputs)
validation_inputs = torch.from_numpy(validation_inputs)

train_labels = torch.from_numpy(train_labels)
validation_labels = torch.from_numpy(validation_labels)

train_masks = torch.from_numpy(train_masks)
validation_masks = torch.from_numpy(validation_masks)

In [None]:
# Verify the data type (all should be torch tensors)
print(train_inputs.size())
print(validation_inputs.size())

print(train_labels.size())
print(validation_labels.size())

print(train_masks.size())
print(validation_masks.size())

In [48]:
# Select a batch size for training. For fine-tuning with XLNet, the authors recommend a batch size of 32, 48, or 128. We will use 32 here to avoid memory issues.
batch_size = 32

# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
# with an iterator the entire dataset does not need to be loaded into memory

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)


## Train Model

Now that the input data is properly formatted, we fine tune the XLNet model. 

For this task, we first modify the pre-trained model to give outputs for classification, and then continue training the model on our dataset until the entire model, end-to-end, is well-suited for our task.   

Load [XLNetForSequenceClassification](https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_xlnet.py#L1076). This is the baseline XLNet model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the pre-trained XLNet model and the additional untrained classification layer is trained on our specific task. 

### The Fine-Tuning Process

Because the pre-trained model layers already encode a lot of information about the language, training the classifier is relatively inexpensive. Rather than training every layer in a large model from scratch, it's as if we have already trained the bottom layers 95% of where they need to be, and only really need to train the top layer, with a bit of tweaking in the lower levels to accomodate our task.

There are a few different pre-trained XLNet models available. "xlnet-base-cased" means the version that has both upper and lowercase letters ("cased") and is the smaller version of the two ("base" vs "large").

In [None]:
# Load XLNEtForSequenceClassification, the pretrained XLNet model with a single linear classification layer on top. 

model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=3)
model.cuda()

Now that the model is loaded, we need to grab the training hyperparameters from within the stored model.

For the purpose of fine-tuning, the authors recommend the following hyperparameters (broken down by which NLP dataset they are applied to):

![alt text](https://i.imgur.com/AhirErN.png)

In [50]:

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

In [51]:
# This variable contains all of the hyperparemeter information our training loop needs
optimizer = AdamW(optimizer_grouped_parameters,
                     lr=2e-5)

Below is our training loop. For each pass in our loop we have a training phase and a validation phase. At each pass we need to:

Training loop:
- Tell the model to compute gradients by setting the model in train mode
- Unpack our data inputs and labels
- Load data onto the GPU for acceleration
- Clear out the gradients calculated in the previous pass. In pytorch the gradients accumulate by default (useful for things like RNNs) unless you explicitly clear them out
- Forward pass (feed input data through the network)
- Backward pass (backpropagation)
- Tell the network to update parameters with optimizer.step()
- Track variables for monitoring progress

Evalution loop:
- Tell the model not to compute gradients by setting th emodel in evaluation mode
- Unpack our data inputs and labels
- Load data onto the GPU for acceleration
- Forward pass (feed input data through the network)
- Compute loss on our validation data and track variables for monitoring progress


In [52]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [None]:
# Store our loss and accuracy for plotting
train_loss_set = []

# Number of training epochs (authors recommend between 2 and 4)
epochs = 4

# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):
  
  
  # Training
  
  # Set our model to training mode (as opposed to evaluation mode)
  model.train()
  
  # Tracking variables
  tr_loss = 0
  nb_tr_examples, nb_tr_steps = 0, 0
  
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Clear out the gradients (by default they accumulate)
    optimizer.zero_grad()
    # Forward pass
    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    loss = outputs[0]
    logits = outputs[1]
    train_loss_set.append(loss.item())    
    # Backward pass
    loss.backward()
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    
    
    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1

  print("Train loss: {}".format(tr_loss/nb_tr_steps))
    
    
  # Validation

  # Put model in evaluation mode to evaluate loss on the validation set
  model.eval()

  # Tracking variables 
  eval_loss, eval_accuracy = 0, 0
  nb_eval_steps, nb_eval_examples = 0, 0

  # Evaluate data for one epoch
  for batch in validation_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Telling the model not to compute or store gradients, saving memory and speeding up validation
    with torch.no_grad():
      # Forward pass, calculate logit predictions
      output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
      logits = output[0]
    
    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1

  print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))

## Training Evaluation

Plot and review training loss over all batches:

In [None]:
plt.figure(figsize=(15,8))
plt.title("Training loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.plot(train_loss_set)
plt.show()

##Predict and Evaluate on Holdout Set

Now we'll load the holdout dataset and prepare inputs just as we did with the training set. Then we'll evaluate predictions using [Matthew's correlation coefficient](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html).  This is the metric used by the wider NLP community to evaluate performance on CoLA. With this metric, +1 is the best score, and -1 is the worst score. This way, we can see how well we perform against the state of the art models for this specific task.

In [None]:
# Upload the test file from your local drive
from google.colab import files
uploaded = files.upload()

In [55]:
# Or, upload the test from directly
df = pd.read_csv("/content/holdout.csv")

In [56]:
# Create sentence and label lists
sentences = df.text.values

# We need to add special tokens at the beginning and end of each sentence for XLNet to work properly
sentences = [sentence + " [SEP] [CLS]" for sentence in sentences]
labels = df.airline_sentiment

tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]


MAX_LEN = 140

# Use the XLNet tokenizer to convert the tokens to their index numbers in the XLNet vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
# Pad our input tokens
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
# Create attention masks
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask) 

In [None]:
# Review data type of input variables
print(type(input_ids))
print(type(attention_masks))
print(type(labels))

In [58]:
# Convert list to numpy array 
attention_masks = np.asarray(attention_masks)

# Convert Series to numpy array 
labels = np.asarray(labels)

In [None]:
# Review data type of input variables (they should be numpy arrays at this point)
print(type(input_ids))
print(type(attention_masks))
print(type(labels))

In [60]:
prediction_inputs = torch.from_numpy(input_ids)
prediction_masks = torch.from_numpy(attention_masks)
prediction_labels = torch.from_numpy(labels)
  
batch_size = 32  

prediction_data = TensorDataset(prediction_inputs, prediction_masks, prediction_labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

In [61]:
# Prediction on test set

# Put model in evaluation mode
model.eval()

# Tracking variables 
predictions , true_labels = [], []

# Predict 
for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  # Telling the model not to compute or store gradients, saving memory and speeding up prediction
  with torch.no_grad():
    # Forward pass, calculate logit predictions
    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    logits = outputs[0]

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

In [62]:
# Import and evaluate each test batch using Matthew's correlation coefficient
from sklearn.metrics import matthews_corrcoef
matthews_set = []

for i in range(len(true_labels)):
  matthews = matthews_corrcoef(true_labels[i],
                 np.argmax(predictions[i], axis=1).flatten())
  matthews_set.append(matthews)

The final score will be based on the entire test set, but let's take a look at the scores on the individual batches to get a sense of the variability in the metric between batches.


In [None]:
matthews_set

In [64]:
# Flatten the predictions and true values for aggregate Matthew's evaluation on the whole dataset
flat_predictions = [item for sublist in predictions for item in sublist]
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]

In [None]:
# Print the Matthews Correleation Coefficient
matthews_corrcoef(flat_true_labels, flat_predictions)

At this point, we've fine-tuned XLNet.

To improve this score, we can experiment with hyperparameter tuning (adjusting the learning rate, epochs, batch size, optimizer properties, etc.). 

## Conclusion

This notebook shows how to train an XLNet model with the huggingface pytorch interface.