**Purpose of this Notebook**

As stated in chapter 4.1.3, **Transfer Learning** has set a new state-of-the-art in Natural Language Processing (incl. NLU and NLG comp. chapter 2.2). The idea behind Transfer Learning, which has its origin in Computer Vision, has led to major breakthroughs for NLP in the last two years. Powerful language models were created by training them on a large corpus of unlabeled text data. The result are neural networks which are able to capture general facets/aspects of language by using their internal word and character representations.<br>
The developers of the NLP-library spaCy have incorporated state-of-the-art-transformer architecturs such as BERT and XLNet in their model pipeline. **In this notebook the pipeline is used to illustrate how "Generalized Autoregressive Pretraining for Language Understanding (XLNet)" can be used for text classification in the Quora Case.** 

* XLNet Paper: https://arxiv.org/abs/1906.08237


Further resources:
*   https://github.com/explosion/spacy-transformers
*   https://github.com/huggingface/transformers (spacy developers wrapped Hugging Face's transformers)





Inspecting available RAM


*   Sometimes Colab does not offer enough RAM (the batches for quora questions to not fit into RAM)
*   Topic was discussed here: https://stackoverflow.com/questions/48750199/google-colaboratory-misleading-information-about-its-gpu-only-5-ram-available
* The Code below checks avalable GPU RAM



---



In [1]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
gpu = GPUs[0]

Collecting gputil
  Downloading https://files.pythonhosted.org/packages/ed/0e/5c61eedde9f6c87713e89d794f01e378cfd9565847d4576fa627d758c554/GPUtil-1.4.0.tar.gz
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-cp36-none-any.whl size=7410 sha256=c9e4f9a2c049fbe48f6857420b37ec2d9488e0b0ad1662efd31c3900dcf5c265
  Stored in directory: /root/.cache/pip/wheels/3d/77/07/80562de4bb0786e5ea186911a2c831fdd0018bda69beab71fd
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0


In [2]:
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
 
printm()

Gen RAM Free: 12.8 GB  | Proc size: 156.5 MB
GPU RAM Free: 11441MB | Used: 0MB | Util   0% | Total 11441MB


Expectd result: GPU RAM Free: 11441MB

### **Step 1: Install State-of-the-Art language model XLNet**

*   https://explosion.ai/blog/spacy-transformers?ref=Welcome.AI



In [0]:
# !pip install --upgrade spacy

In [2]:
import spacy
print("Spacy Version: ", spacy.__version__) # Version 2.2.2 needed

Spacy Version:  2.2.2


In [0]:
#!pip install torch==1.1.0
#!pip install spacy-pytorch-transformers[cuda100]==0.5.1
#!python -m spacy download en_trf_xlnetbasecased_lg # only XLNet is used in this Notebook
#!python -m spacy download en_trf_bertbaseuncased_lg 

### **Step 2: Load necessary libraries**

In [0]:
import thinc
import random
import spacy
import GPUtil
import torch
from spacy.util import minibatch
from tqdm.auto import tqdm
import unicodedata
import wasabi
import numpy
from collections import Counter
import pandas as pd
import numpy as np
from sklearn.utils import shuffle

In [3]:
print("Spacy Version: ", spacy.__version__)
print("Torch version: ", torch.__version__)

Spacy Version:  2.2.2
Torch version:  1.1.0


In [4]:
# Ensure that GPU is used 
spacy.util.fix_random_seed(0)
is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    print("GPU ON!\n")
    torch.set_default_tensor_type("torch.cuda.FloatTensor")
    print("GPU Usage")
    GPUtil.showUtilization()

GPU ON!

GPU Usage
| ID | GPU | MEM |
------------------
|  0 |  2% |  1% |


### **Step 3: Load Quora dataset**

In [9]:
from google.colab import drive
drive.mount('/content/drive') # Trainset locates in Google Drive. Has to be made available by mounting.
df = pd.read_csv("/content/drive/My Drive/Colab Notebooks/train.csv", nrows = 50000) 
print(df.shape)
pd.set_option('display.max_colwidth', 1500)
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
(50000, 3)


Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province as a nation in the 1960s?,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you encourage people to adopt and not shop?",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity affect space geometry?,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg hemispheres?,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain bike by just changing the tyres?,0


If the inputs to the transformer don't match how it was pretrained, it will have to rely much more on your small labelled training corpus, leading to lower accuracies.

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
def _prepare_partition(text_label_tuples, *, preprocess=False):
    texts, labels = zip(*text_label_tuples)
    # positive = 0 sincere
    # negative = 1 insincere
    cats = [{"POSITIVE": float(bool(y)), "NEGATIVE": float(not bool(y))} for y in labels]
    return texts, cats

def load_data(df):
  # transform data as expected by the Spacy Model
  # Create datasets (Only take up to max_seq_length words for memory)
  max_seq_length = 130

  # Shuffle df
  df = shuffle(df)

  # Extract relevant data for classification
  df_texts = df['question_text'].tolist()
  df_texts_padded = [question[:max_seq_length] for question in df_texts]
  df_labels = df['target'].tolist()

  # Create train-validation split
  train_texts_padded, dev_texts_padded, train_labels, dev_labels = train_test_split(df_texts, df_labels, test_size=0.2,
                                                                                    stratify = df_labels, random_state=90)

  # Create tuples
  train_data= zip(train_texts_padded, train_labels)
  dev_data = zip(dev_texts_padded, dev_labels)

  train_texts, train_labels = _prepare_partition(train_data, preprocess=False)
  dev_texts, dev_labels = _prepare_partition(dev_data, preprocess=False)
  return (train_texts, train_labels), (dev_texts, dev_labels)

In [0]:
(train_texts, train_cats), (eval_texts, eval_cats) = load_data(df)

In [13]:
print(eval_texts[0])
print(eval_cats[0]) # negative: NOT toxic, insincere

Is it alright to share our feeling of love to opposite sex?
{'POSITIVE': 0.0, 'NEGATIVE': 1.0}


### **Step 4: Load Spacy Model and prepare Pipeline**

In [14]:
model_choice = "en_trf_xlnetbasecased_lg" #  ["en_trf_bertbaseuncased_lg", "en_trf_xlnetbasecased_lg"]
max_wpb = 1000 # number of tokens

nlp = spacy.load(model_choice)
print(f"Loaded model '{model_choice}'")
print(nlp.pipe_names)
if model_choice == "en_trf_xlnetbasecased_lg":
  textcat = nlp.create_pipe(
               "trf_textcat",                   #config={"exclusive_classes": True} #  "trf_textcat",
                        config={"architecture": "softmax_last_hidden", "words_per_batch": max_wpb}
      )
elif model_choice == "en_trf_bertbaseuncased_lg":
  textcat = nlp.create_pipe(
          "trf_textcat", config = {"architecture": "softmax_last_hidden", "words_per_batch": max_wpb}
      )
else: 
  print("Choose a supported transformer model")

Loaded model 'en_trf_xlnetbasecased_lg'
['sentencizer', 'trf_wordpiecer', 'trf_tok2vec']


![alt text](https://d33wubrfki0l68.cloudfront.net/39251284c89675c9f1db57a109d804077e06620e/9ecfb/blog/img/spacy-trf_pipeline.svg)

A spacy pipline with the following components was created:

*  **sentencizer:** splits sentences on punctuation like ., ! or ? [https://spacy.io/usage/linguistic-features#sbd-component]
*  **pytt_wordpiecer:** performs the model's wordpiece pre-processing
* **pytt_tok2vec**: runs the transformer over the doc, and saves the results into the built-in doc.tensor attribute and several extension attributes 
* More info: https://explosion.ai/blog/spacy-transformers

One Component is still missing: The component for text categorization: *trf_textcat*

*  **trf_textcat** is based on spaCys [textCategorizer](https://spacy.io/api/textcategorizer).
* The last component in the current pipeline translates the tokens of a sentence in contextual token representations (vectors)
* These vectors are then used by trf_textcat to perform the binary classification task for Quora

In [15]:
# trf_textcat was already initialized above

# add label to text classifier
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

# Add trf_textcat as last pipeline component
nlp.add_pipe(textcat, last=True)
print(nlp.pipe_names) # pipeline looks like this now

['sentencizer', 'trf_wordpiecer', 'trf_tok2vec', 'trf_textcat']


### **Step 5: Setting up model hyperparameters**

In [0]:
n_iter= 10 # = number of epochs
# n_texts=75 # Changed number of texts to 75 to relieve pressue on GPU memory
batch_size= 128 # batch-szie changed to 4 to relieve pressure on GPU memory
learn_rate=1e-5
pos_label="NEGATIVE"

### **Step 6: Create Evaluation function to monitor learning process**

### **Step 6: Model training**

In [17]:
# Model input
print(f"Using {len(train_texts)} training docs, {len(eval_texts)} evaluation \n")

# Perparing training data input
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
print(train_data[0])

# Inspecting Validation data
print("Validation data text: ", eval_texts[0])
print("Validation data label: ", eval_cats[0])

Using 40000 training docs, 10000 evaluation 

("Why is durian so disgusting to those who don't like it?", {'cats': {'POSITIVE': 0.0, 'NEGATIVE': 1.0}})
Validation data text:  Is it alright to share our feeling of love to opposite sex?
Validation data label:  {'POSITIVE': 0.0, 'NEGATIVE': 1.0}


In [18]:
is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    torch.set_default_tensor_type("torch.cuda.FloatTensor")

#nlp = spacy.load("en_trf_bertbaseuncased_lg")
print(nlp.pipe_names) # ["sentencizer", "trf_wordpiecer", "trf_tok2vec"]
#textcat = nlp.create_pipe("trf_textcat", config={"exclusive_classes": True})
#for label in ("POSITIVE", "NEGATIVE"):
#    textcat.add_label(label)
#nlp.add_pipe(textcat)
print("Final_pipeline: ", nlp.pipe_names)

optimizer = nlp.resume_training()
for i in range(n_iter):
    random.shuffle(train_data)
    losses = {}
    for batch in minibatch(train_data, size=batch_size):       
        texts, cats = zip(*batch)
        nlp.update(texts, cats, sgd=optimizer, losses=losses)
    print(i, losses)

['sentencizer', 'trf_wordpiecer', 'trf_tok2vec', 'trf_textcat']
Final_pipeline:  ['sentencizer', 'trf_wordpiecer', 'trf_tok2vec', 'trf_textcat']
0 {'trf_textcat': 0.0024444458842936}
1 {'trf_textcat': 0.0022726016477463418}
2 {'trf_textcat': 0.0023061066879108694}
3 {'trf_textcat': 0.0022859878758936247}
4 {'trf_textcat': 0.0022666700258469064}
5 {'trf_textcat': 0.0022981258065328802}
6 {'trf_textcat': 0.002265798757775883}
7 {'trf_textcat': 0.0022597706896476666}
8 {'trf_textcat': 0.0022798393233642855}
9 {'trf_textcat': 0.0022726109815494056}


Loss of Text Categorizer decreases until it stagnates. The Loss osciallates around 0.0022. The quite low loss looks strange but the implementation of the spacy pipeline seems correct. A possible reason herefore could be the that the model was only fit with a quite small subsample of the original data (because of runtime, resource reasons).
Other experiments with a different learning rate even resulted in an increasing loss. Reason herefore could have been a too high learning rate (comp. image below)

![alt text](https://miro.medium.com/max/1106/1*An4tZEyQAYgPAZl396JzWg.png)

However even using [Cyclical Learning Rates](https://towardsdatascience.com/adaptive-and-cyclical-learning-rates-using-pytorch-2bf904d18dee) did not give the expected results

### **Step 7: Model prediction**

In [19]:
# Test the trained model
test_text = eval_texts[0]
doc = nlp(test_text)
print("Sententence to perform Classification on: \n", test_text)
print("Prediction returned by Softmax Function: ", doc.cats)

Sententence to perform Classification on: 
 Is it alright to share our feeling of love to opposite sex?
Prediction returned by Softmax Function:  {'POSITIVE': 0.05213458463549614, 'NEGATIVE': 0.9478654265403748}


**Prediction is correct:   
Model labels question with high confidence as not toxic, which is correct**   
("NEGATIVE" indicates class that is NOT TOXIC)

### **Step 8: Evaluation**

*   Spacy provides an easy to use interface to implement state-of-the-art NLP architectures. Positive is that spacy already incorporated XLNet which first was released some months ago. However the documentation of [spacy-transformers](https://github.com/explosion/spacy-transformers) still is insufficient because of the following reasons: 
  * First, it is not clear which config to use for the trf-textcat component. The demo recommends to use {"architecture": "softmax_last_hidden"}, however this architecture is not described in the [Spacy TextCategorizer Docs](https://spacy.io/api/textcategorizer#architectures). Since the Quora Case is a binary classification task config={"exclusive_classes": True} seems to be possible as well.
  *   Second, using raw XLNet without Fine-Tuning does not lead to different predictions for the Quora questions. Normally the internal representations of XLNet should lead to different softmax outputs.
  * Third, at the time of this writting it is unclear why it is not possible to only train the trf_textcat component alone without modifying the XLNet vector (tok2vec component).
  * Furthermore the loss of the trf_textcat component above indicates that there seems to be something wrong with either the spacy implementation of the pytorch-transformers or the current adaption of the [spacy demo for text classification](https://github.com/explosion/spacy-transformers/blob/master/examples/train_textcat.py) to the quora case.
  * A problem in the quora case is that the dataset is quite imbalanced. For this reason in the final prototype a search for an optimal threshold was conducted. However the spacy demo fixes a threshold at 0.5, which cannot be done in the Quora case (optimal threshold for the language model predictions are unknown). Because the Spacy pipeline processes documents as a stream (Generator Object) the experimental threshold search, used in the custom Keras model, can not be used.
