A.S. Lundervold, v.280223

# Introduction

This is a quick example of some techniques and ideas from natural language processing (NLP) and some approaches to NLP based on deep learning. The goal is to introduce some of the things going on in this field and for you better to understand some recent ideas and developments in deep learning.

This and the following two notebooks serve as companions to the fastai-based material covered in Module 4 of the course (Lesson 5 and Chapter 9 of the textbook). 

> NLP is an exciting area these days. Breakthroughs in deep learning for language processing recently initiated a revolution in NLP, and we're still in it. The best place to start exploring this is perhaps the HuggingFace community and library (at least if you want to get started right away playing around with using state-of-the-art NLP models): https://huggingface.co/. <br> <br><a href="https://huggingface.co/"><img width=20% src="https://luxcapital-website-media.s3.amazonaws.com/wp-content/uploads/2019/12/23115642/Logo-600x554.png"></a>

# Setup

In [1]:
# This is a quick check of whether the notebook is currently running on Google Colaboratory
# or on Kaggle, as that makes some difference for the code below.
# We'll do this in every notebook of the course.
try:
    import colab
    colab=True
except:
    colab=False

import os
kaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [2]:
import numpy as np

In [None]:
if (colab or kaggle):
  %pip install datasets
  %pip install transformers
  %pip install evaluate
  %pip install gradio

In [3]:
#import os
#os.environ['CUDA_VISIBLE_DEVICES'] = "0"

We'll use the excellent HuggingFace Transformers library, which covers all our natural language processing needs:

<img src="https://camo.githubusercontent.com/b253a30b83a0724f3f74f3f58236fb49ced8d7b27cb15835c9978b54e444ab08/68747470733a2f2f68756767696e67666163652e636f2f64617461736574732f68756767696e67666163652f646f63756d656e746174696f6e2d696d616765732f7265736f6c76652f6d61696e2f7472616e73666f726d6572735f6c6f676f5f6e616d652e706e67">


We will not cover the library in any detail. If you're interested, take a look at the [HuggingFace course](https://huggingface.co/course/chapter1/1) and its documentation over at https://huggingface.co/transformers.

# Load data

We'll use the [IMDB dataset](https://huggingface.co/datasets/imdb) containing 50.000 movie reviews from IMDB, each labeled as either negative (0) or positive (1). It is split into 25.000 reviews for training and 25.000 reviews for testing. 

The dataset is available via HuggingFace `datasets`:

In [4]:
from datasets import load_dataset

In [5]:
dataset = load_dataset("imdb")

Found cached dataset imdb (/home/alex/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

## Make a sample dataset

As the training process takes a long time, we create a small sample dataset:

In [7]:
sample = True

In [8]:
if sample:
    dataset = dataset['train']
    dataset = dataset.train_test_split(train_size=0.2, shuffle=True, seed=42)['train']
    dataset = dataset.train_test_split(test_size=0.2)

Loading cached split indices for dataset at /home/alex/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-db2211c4cbcbe9c5.arrow and /home/alex/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-728f846228169c23.arrow


In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
})

# Explore the data

The training data is stored under `train`, the test data under `test`:

Here are two training instances:

In [10]:
dataset['train'][10:12]

{'text': ["this is a great movie. I love the series on tv and so I loved the movie. One of the best things in the movie is that Helga finally admits her deepest darkest secret to Arnold!!! that was great. i loved it it was pretty funny too. It's a great movie! Doy!!!",
  'I\'m sorry to say that there isn\'t really any way, in my opinion, that an Enzo would really be able to keep up with a Saleen S7 Twin Turbo. The power to weight advantage possessed by the S7 would just be too great. The S7 has a power:weight ratio of 3.93 lbs/hp while the Enzo has 4.61 lbs/hp. The S7s low end is much better too. Sorry Ferrari fans but the Saleen just gets it done so much better.<br /><br />As for other parts of this film, I just have to say it\'s so substandard as to be pathetic. The story is way too weak. The acting in this lemon is worse than daytime soaps.<br /><br />I can say that as far as it being a treatise on negative psychology its kind of a gem. This film is nothing if not a glaring definiti

We can print them a in a more readable form:

In [11]:
dataset['train'][10]['text']

"this is a great movie. I love the series on tv and so I loved the movie. One of the best things in the movie is that Helga finally admits her deepest darkest secret to Arnold!!! that was great. i loved it it was pretty funny too. It's a great movie! Doy!!!"

In [12]:
dataset['train'][10]['label']

0

> **How do we represent the text for consumption by a machine learning model?**

> **How can a computer read??**

<img src="https://camo.githubusercontent.com/7d5ed540c87d660cae46ca0d2055d760f786bea36513bb1a0b0784d47cef45b1/687474703a2f2f322e62702e626c6f6773706f742e636f6d2f5f2d2d75564865746b5549512f54446165356a476e6138492f4141414141414141414b302f734253704c7564576d63772f73313630302f72656164696e672e676966">

# Prepare the data: tokenization and numericalization

For a computer, everything is numbers. We have to convert the text to a series of numbers and then feed those to the computer.

This can be done in two widely used steps in natural language processing: **tokenization** and **numericalization**.

## Tokenization

In tokenization, the text is split into single components or units called tokens. In the context of deep learning, tokenization aims to convert a sequence of characters into a sequence of tokens in a way that enables accurate and efficient processing by deep learning models. 

Multiple tokenization strategies–word, character and subword-based–can tackle these and other issues. Examples include **rule-based splitting of sentences** (used by ULMFiT and Transformer XL and others), **WordPiece** (used by BERT and others), **SentencePiece** (used by XLM and others), and **Byte-Pair encoding** (used by GPT models (including ChatGPT) and others).

Let's take a look at some of the ideas.

In [13]:
example_sentence = "Here's a sentence to be tokenized by a tokenizer, and it includes the non-existent word graffalacticus"

### Character tokenization

Perhaps the simplest tokenization strategy is to split the text into characters: 

In [14]:
characters = [c.lower() for c in example_sentence]
print(characters)

['h', 'e', 'r', 'e', "'", 's', ' ', 'a', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', 't', 'o', ' ', 'b', 'e', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 'd', ' ', 'b', 'y', ' ', 'a', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 'r', ',', ' ', 'a', 'n', 'd', ' ', 'i', 't', ' ', 'i', 'n', 'c', 'l', 'u', 'd', 'e', 's', ' ', 't', 'h', 'e', ' ', 'n', 'o', 'n', '-', 'e', 'x', 'i', 's', 't', 'e', 'n', 't', ' ', 'w', 'o', 'r', 'd', ' ', 'g', 'r', 'a', 'f', 'f', 'a', 'l', 'a', 'c', 't', 'i', 'c', 'u', 's']


Unicode and ASCII are well-known examples of character encodings.

In [15]:
example_sentence.encode('ascii')

b"Here's a sentence to be tokenized by a tokenizer, and it includes the non-existent word graffalacticus"

In [16]:
# Unicode values for each character in the example sentence
print([ord(c) for c in example_sentence])

[72, 101, 114, 101, 39, 115, 32, 97, 32, 115, 101, 110, 116, 101, 110, 99, 101, 32, 116, 111, 32, 98, 101, 32, 116, 111, 107, 101, 110, 105, 122, 101, 100, 32, 98, 121, 32, 97, 32, 116, 111, 107, 101, 110, 105, 122, 101, 114, 44, 32, 97, 110, 100, 32, 105, 116, 32, 105, 110, 99, 108, 117, 100, 101, 115, 32, 116, 104, 101, 32, 110, 111, 110, 45, 101, 120, 105, 115, 116, 101, 110, 116, 32, 119, 111, 114, 100, 32, 103, 114, 97, 102, 102, 97, 108, 97, 99, 116, 105, 99, 117, 115]


This is typically not a useful strategy for deep learning, as it is too granular. It is, however, useful for some applications, such as spelling correction.

### Splitting text into words

Here's a way to tokenize: simply split into words by spaces:

In [17]:
words = example_sentence.split(" ")
tokens = {v: k for k, v in enumerate(words)}

In [18]:
tokens

{"Here's": 0,
 'a': 7,
 'sentence': 2,
 'to': 3,
 'be': 4,
 'tokenized': 5,
 'by': 6,
 'tokenizer,': 8,
 'and': 9,
 'it': 10,
 'includes': 11,
 'the': 12,
 'non-existent': 13,
 'word': 14,
 'graffalacticus': 15}

One would never use this in practice, as it's very inefficient and uses no features of language except that words tend to, in some languages, be separated by spaces.

Among other things, we lose punctuation and the fact that some words are contractions of multiple words (for example "here's", "isn't", and "don't"). By specifying a set of rules, we can do better.

<img src="https://spacy.io/images/tokenization.svg">

### Rule-based splitting of sentences into words

Here's a better approach, using the NLP library `spaCy`. We install spaCy and download a set of rules for tokenizing English text:

In [19]:
%%capture
try:
    import spacy
except:
    %pip install spacy
    import spacy

In [20]:
try: 
    nlp = spacy.load("en_core_web_sm")
    print("Spacy model loaded")
except:
    import sys
    !{sys.executable} -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")

Spacy model loaded


In [21]:
doc = nlp(example_sentence)
for token in doc:
    print(token.text)

Here
's
a
sentence
to
be
tokenized
by
a
tokenizer
,
and
it
includes
the
non
-
existent
word
graffalacticus


### Subword tokenization

With word-based tokenization we typically need a very large vocabulary to encode all possible words. Subword tokenization is a mid-way between character encoding and full word encoding and is based on splitting words into word pieces. Common words can be assigned their own token while rare words can be split into pieces. 

Modern subword tokenizers tend to be _trained_ on the text corpus you're interested in (or pre-trained on a large corpus that you want to train a model that you can use for fine-tuning). 

Here's an example of a subword tokenizer: the token splitting algorithm of BERT.

In [22]:
from transformers import BertTokenizer

In [23]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [24]:
tokenizer.tokenize(example_sentence)

['here',
 "'",
 's',
 'a',
 'sentence',
 'to',
 'be',
 'token',
 '##ized',
 'by',
 'a',
 'token',
 '##izer',
 ',',
 'and',
 'it',
 'includes',
 'the',
 'non',
 '-',
 'existent',
 'word',
 'graf',
 '##fa',
 '##la',
 '##ctic',
 '##us']

### Byte-Pair encoding: an example of training an encoder

When faced with a particular text corpus the above rules-based tokenizers can often be both wasteful (with superfluous tokens for words that doen't appear in your corpus) and inefficient (for example, lacking tokens that can represent important and often-used words in your particular corpus). 

Tokenizers based on _training_, i.e. identification of important words or word pieces, can therefore be useful, and this is thus part of most modern tokenizers. An example is the **byte pair encoding** used by, among others, GPT models. 

The Byte Pair Encoding (BPE) algorithm was introduced by Philip Gage in 1994 for data compression (_"a simple general-purpose data compression algorithm"_), based on identifying common byte pairs. Here's a copy of the original article http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM. See also Wikipedia for a simple example of BPE used for data compression: https://en.wikipedia.org/wiki/Byte_pair_encoding

The procedure is roughly the following: 

```
1. Add identifiers marking the end of each word
2. Calculate the word frequencies in the text corpus
3. Split the words into characters and calculate the character frequencies
4. From character tokens, count the frequency of consecutive byte pairs and merge the most frequent byte pair
5. Continue until a manually defined iteration limit is reached, or the token limit is reached. 
```

> This is a greedy algorithm. Non-greedy variants exist and other tweaks to BPE are in use.

In [25]:
from tokenizers import Tokenizer
from tokenizers.models import BPE

In [26]:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]"])

In [27]:
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

In [28]:
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

In [29]:
tokenizer.train_from_iterator(dataset['train']['text'],trainer=trainer)






In [30]:
example_sentence_bpe = tokenizer.encode(example_sentence)

In [31]:
example_sentence_bpe

Encoding(num_tokens=27, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [32]:
example_sentence_bpe.tokens

['Here',
 "'",
 's',
 'a',
 'sentence',
 'to',
 'be',
 'token',
 'ized',
 'by',
 'a',
 'token',
 'izer',
 ',',
 'and',
 'it',
 'includes',
 'the',
 'non',
 '-',
 'existent',
 'word',
 'gr',
 'aff',
 'al',
 'actic',
 'us']

In [33]:
example_sentence_bpe.ids[:15]

[2160, 7, 83, 65, 9932, 151, 174, 13161, 1445, 273, 65, 13161, 14468, 12, 157]

## Numericalization

We convert tokens to numbers by making a list of all the tokens that have been used and assign them to numbers. This has already been taken care of for us:

In [34]:
example_sentence_bpe.ids[:15]

[2160, 7, 83, 65, 9932, 151, 174, 13161, 1445, 273, 65, 13161, 14468, 12, 157]

# Fine-tuning pre-trained models

The advent of the **Transformers models** has revolutionized the field of natural language processing. Therefore, when faced with any NLP task for which deep learning is applicable, everyone tends to turn to Transformers models. Furthermore, one typically uses _pre-trained models_. In other words, models that have already been trained on large-scale NLP tasks and thus contain representations that typically provide useful starting points for new tasks.

## Text representation for pre-trained models

When using pre-trained models, we must pre-process the text exactly as expected by the model. In other words, that we use the expected tokenization, numericalization, padding, and truncation strategies.

In [35]:
from transformers import AutoTokenizer

In [36]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [37]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [38]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [39]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
})

## Fine-tune a model

We'll fine-tune a BERT model on our IMDB dataset. (Note that this is where it's best to use a sample of the dataset. Otherwise the training process will take a long time.)

In [40]:
from transformers import AutoModelForSequenceClassification

**Define the model and its preprocessing steps**

In [41]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initi

In [42]:
#trainer.model

**Set up our evaluation metric**

In [43]:
import evaluate
metric = evaluate.load("accuracy")

In [44]:
def compute_metrics(eval_pred):

    logits, labels = eval_pred

    predictions = np.argmax(logits, axis=-1)

    return metric.compute(predictions=predictions, references=labels)

**Configure the training process**

In [45]:
from transformers import TrainingArguments, Trainer

In [46]:
#?TrainingArguments

In [None]:
# Increase this to improve performance (at the cost of computational time), 
# especially if you're training on the full data set.
num_train_epochs = 1

In [47]:
training_args = TrainingArguments(output_dir=".", num_train_epochs=num_train_epochs, 
                                  evaluation_strategy="epoch", report_to='all',
                                  )

In [48]:
trainer = Trainer(

    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    compute_metrics=compute_metrics,

)

**Train and evaluate the model**

In [None]:
if kaggle:
    import os
    os.environ["WANDB_DISABLED"] = "true"

In [49]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 4000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 500
  Number of trainable parameters = 108311810


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3766,0.398279,0.874


Saving model checkpoint to ./checkpoint-500
Configuration saved in ./checkpoint-500/config.json
Model weights saved in ./checkpoint-500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=500, training_loss=0.3765945129394531, metrics={'train_runtime': 112.3994, 'train_samples_per_second': 35.587, 'train_steps_per_second': 4.448, 'total_flos': 1052444221440000.0, 'train_loss': 0.3765945129394531, 'epoch': 1.0})

### Use the model on new data

In [54]:
test_data = ["This movie was not pretty good.", "You should miss it!"]

In [55]:
test_data = tokenizer(test_data, return_tensors="pt", padding=True)["input_ids"].cuda()

In [56]:
outputs = model(test_data)

In [57]:
# Predictions
outputs.logits.argmax(-1)

tensor([0, 1], device='cuda:0')

### Create an app

In [58]:
import gradio as gr

def classify_review(review):
        tokenized_text = tokenizer(review, return_tensors="pt", padding=True)["input_ids"].cuda()
        response = model(tokenized_text)
        sentiment = int(response.logits.argmax(-1))
        if sentiment:
            return "Positive"
        else:
            return "Negative"


textbox = gr.Textbox()

demo = gr.Interface(classify_review, inputs="text", outputs=["text"])

In [59]:
demo.launch(share=True)

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




# Embeddings and using pre-trained text encoders

## Some key concepts that were mentioned

In the lecture, I told a short story about the following key concepts, widely used in modern deep learning:

* Embeddings and representations
* Word2Vec
* Language Models
* Training language models
* Reusing representations for other tasks

## TensorFlow Embedding Projector

These concepts were introduced in the lecture using the TensorFlow Embedding Projector: http://projector.tensorflow.org/

<a href="http://projector.tensorflow.org/"><img src="https://raw.githubusercontent.com/HVL-ML/DAT255/main/3-NLP/assets/TensorFlowProjector.png"></a>

<img src="https://github.com/HVL-ML/DAT255/raw/main/3-NLP/assets/TensorFlowProjector.gif">