<a href="https://colab.research.google.com/github/EdwardDixon/trainings_in_ml/blob/master/language_modelling_for_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transfer Learning for Text
*Using a pre-trained language model to play with NLP*


---


First, let's make sure you have a GPU attached to your Colab notebook.  When you run the next sell, the output should tell you that you've got a Tesla K80 attached to this notebook.  If not, you'll need to go to the _"Edit"_ menu, then choose "Notebook Settings" and select "GPU" from the "Hardware Accelerator" dropdown.

In [0]:
!nvidia-smi

## Installing Pre-requisites
Next, we need to install the libraries we will rely on.  The `pytorch-pretrained-bert` library comes from [the nice people at Hugging Face](https://huggingface.co/) and includes pre-trained PyTorch ports of the latest language models and some nice helper code to get your text ready to feed into them.  Notice the `!` prefix - these are commands we are running in the shell, not Python code. The `SpaCy` library is becoming something of an industry standard for NLP.

In [0]:
!pip install pytorch-pretrained-bert
!pip install spacy ftfy==4.4.3
!python -m spacy download en

## Preparing the text
We need to do a little work before we can get our text into the model.  [Bert is a word-piece model](https://github.com/google-research/bert), which means that it operates on words and parts of words.  The version that we will be using, `bert-base-uncased`, has 12 layers, 110M parameters, and a vocabulary of about 30K words.  Other Bert variants have double the number, distinguish between lower and upper case, or have been trained ion multi-lingual datasets.

In [0]:
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenized input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

## Text to Vectors
Now that our text is in tensors, we can push it into the model.  This will get us hidden states that we can use for our NLP tasks...

In [0]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, segments_tensors)
# We have a hidden states for each of the 12 layers in model bert-base-uncased
assert len(encoded_layers) == 12

Let's see if the model can predict the missing token...

In [0]:
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    predictions = model(tokens_tensor, segments_tensors)

print("Prediction has shape " + str(predictions.shape) + "\n")
    
# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print("Predicted token = " + predicted_token)

# ...and the least likely token?
not_predicted_index = torch.argmin(predictions[0, masked_index]).item()
not_predicted_token = tokenizer.convert_ids_to_tokens([not_predicted_index])[0]
print("Least likely value: " + not_predicted_token)

## Near misses?
What about the words that didn't quite make it?  Our prediction vector is a softmax over the entire vocabularly, so we can take a look at the next-highest-scoring words.  I've taken the top 5.  Note we need to bring the results vector back to the CPU from the GPU's memory.

In [0]:
# Top 5 guesses?
from torch import topk
token_indices = topk(predictions[0, masked_index],5)[1]
print(token_indices)
topk_tokens = tokenizer.convert_ids_to_tokens(token_indices.cpu().numpy())
print(topk_tokens)

## Time for classification!
Let's start by getting a dataset with two classes.  We can use [this repository I created for my chaper in Online Harassment](https://github.com/EdwardDixon/Automation-and-Harassment-Detection).

In [13]:
!git clone https://github.com/EdwardDixon/Automation-and-Harassment-Detection

Cloning into 'Automation-and-Harassment-Detection'...
remote: Enumerating objects: 39, done.[K
remote: Total 39 (delta 0), reused 0 (delta 0), pack-reused 39[K
Unpacking objects: 100% (39/39), done.


Now we can load the data and view a sample.

In [16]:
import pandas as pd
df_train = pd.read_csv("Automation-and-Harassment-Detection/data/attack_train.csv")
df_train.tail()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,attack
69521,699756185,""" The lead itself is original research. Wher...",2016,True,article,blocked,train,0.1
69522,699813325,""" ::I'm talking about you making unjustified m...",2016,True,article,blocked,train,0.157895
69523,699848324,""" These sources don't exactly exude a sense ...",2016,True,article,blocked,train,0.111111
69524,699857133,:The way you're trying to describe it in this...,2016,True,article,blocked,train,0.0
69525,699897151,Alternate option=== Is there perhaps enough ne...,2016,True,article,blocked,train,0.0
