ASL, v.010323

# NLP using fastai & ULMFiT: a small case-study

This is an example of using [ULMFiT](https://arxiv.org/abs/1801.06146) from [fastai](https://docs.fast.ai/tutorial.text.html) for natural language processing (NLP).

# Setup

In [None]:
# This is a quick check of whether the notebook is currently running on Google Colaboratory
# or on Kaggle, as that makes some difference for the code below.
# We'll do this in every notebook of the course.
try:
    import colab
    colab=True
except:
    colab=False

import os
kaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [None]:
if (colab or kaggle):
    import sys
    # Install fastai etc
    !pip install -Uqq fastbook
    !{sys.executable} -m spacy download en_core_web_sm
    from fastbook import *
else:
    print("WARNING: To run this notebook locally you will have to install fastai")

We'll use the [spacy library]() for NLP and the [fastai]() library for deep learning.

In [None]:
import spacy

In [None]:
from fastai.text.all import *
from pprint import pprint as pp

# Load data

We use a data set collected in the work of Wakamiya et.al, _Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations_, 2019: https://www.jmir.org/2019/2/e12783/

![medweb-paper](https://github.com/MMIV-ML/ELMED219-2022/raw/main/Lab2-NLP/assets/medweb-paper.png)

The data us supposed to represent tweets that discusses one or more of eight symptoms. 

Some examples from the original paper:<br><br>
<img src="https://github.com/MMIV-ML/ELMED219-2022/raw/main/Lab2-NLP/assets/medweb_examples.png">

We'll only look at the English language tweets:

In [None]:
#df = pd.read_csv('https://github.com/HVL-ML/DAT255/raw/main/3-NLP/data/medwebdata.csv')
df = pd.read_csv('data/medwebdata.csv')

In [None]:
df.head()

In [None]:
pp(df['Tweet'][10])

From this text the goal is to determine whether the person is talking about one or more of the eight symptoms or conditions listed above:

In [None]:
list(df.columns[2:-2])

# Prepare the data

As we know, for a computer, everything is numbers. We have to convert the text to a series of numbers, and then feed those to the computer. 

In the previous notebook, we saw how this can be done in two steps: **tokenization** and **numericalization**

## Tokenization

In tokenization the text is split into components, called tokens. 

Here are some result after the tokenization procedure used by ULMFiT:

In [None]:
data_lm = TextDataLoaders.from_df(df, text_col='Tweet', is_lm=True, valid_pct=0.1)

data_lm.show_batch(max_n=2)

Tokens starting with "xx" are special. `xxbos` means the beginning of the text, `xxmaj` means that the following word is capitalized, `xxup` means that the following word is in all caps, and so on.

The tokens `xxunk` replaces words that are rare in the text corpus. We keep only words that appear at least twice (with a set maximum number of different words, 60.000 in our case). This is called our **vocabulary**.

## Numericalization

We convert tokens to numbers by making a list of all the tokens that have been used and assign them to numbers.

The above text is replaced by numbers, as in this example

In [None]:
data_lm.train_ds[0][0]

> **We are now in a position where the computer can compute on the text.**

# "Classical" versus deep learning-based NLP

In [None]:
try: 
    nlp = spacy.load("en_core_web_sm")
    print("Spacy model loaded")
except:
    import sys
    !{sys.executable} -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")

### Sentence Boundary Detection: splitting into sentences

Example sentence:
> _"Patient presents for initial evaluation of cough. Cough is reported to have developed acutely and has been present for 4 days. Symptom severity is moderate. Will return next week."_

In [None]:
sentence = "Patient presents for initial evaluation of cough. Cough is reported to have developed acutely and has been present for 4 days. Symptom severity is moderate. Will return next week."
doc = nlp(sentence)
 
for sent in doc.sents:
    print(sent)

### Named Entity Recognition

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

### Dependency parsing

In [None]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

> There's a lot more to natural language processing, of course! Have a look at [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101) for some examples.

In general, data preparation and feature engineering is a huge and challenging undertaking when using machine learning to analyze text. 

However, in what's called _deep learning_ most of this work is done by the computer! That's because deep learning does feature extraction _and_ prediction in the same model. 

This results in much less work and, often, _in much better models_!

![MLvsDL](https://aylien.com/images/uploads/general/tumblr_inline_oabas5sThb1sleek4_540.png)

# Deep learning language model

We now come to a relatively new and very powerful idea for deep learning and NLP. An idea that created a small revolution in NLP a couple of years ago ([1](https://blog.openai.com/language-unsupervised/), [2](http://ruder.io/nlp-imagenet/))

We want to create a system that can classify text into one or more categories. This is a complex problem as the computer must somehow implicitly learn to "read." 

Idea: why not _first_ teach the computer to "read" and _then_ let it loose on the classification task?

We can teach the computer to "understand" language by training it to guess the next word of a sentence, using as much training data as we can get hold of. This is called ***language modelling*** in NLP. 

Guessing the next word is a difficult task: our own ability to do this is based on our knowledge of the language and about the world.

> What word fits here? _"The light turned green, and Per crossed the ___"_

Luckily, obtaining large amounts of training data for language models is simple: any text can be used. The labels are simply the next word of a subpart of the text. 

We can, for example, use Wikipedia. Then, after the model performs alright at predicting the next word of Wikipedia text, we can fine-tune it on text that's closer to the classification task we're after. 

> This is often called ***transfer learning***.

We can use the tweet text to fine-tune a model that's already been pretrained on Wikipedia:

In [None]:
data_lm = TextDataLoaders.from_df(df, text_col='Tweet', is_lm=True, valid_pct=0.1)

data_lm.show_batch(max_n=3)

In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, pretrained=True, drop_mult=0.3,
                               metrics=[accuracy, Perplexity()], wd=0.1, model_dir='.').to_fp16()

Let's start training:

In [None]:
learn.fit_one_cycle(1, 5e-3)

In [None]:
learn.unfreeze()
learn.fit_one_cycle(3, 5e-3)

...and save the parts of the model that we can reuse for classification later:

In [None]:
learn.save_encoder('medweb_finetuned')

## Test the language model

We can test the language model by having it guess the next given number of words on a starting text:

In [None]:
def make_text(seed_text, nb_words):
    """
    Use the trained language model to produce text. 
    Input:
        seed_text: some text to get the model started
        nb_words: number of words to produce
    """
    pred = learn.predict(seed_text, nb_words, temperature=0.75)
    pp(pred)

In [None]:
make_text("I'm not feeling too good as my", 10)

In [None]:
make_text("No, that's a", 40)

Now we have something that seems to produce text that resembles the text to be classified. 

> **Note:** It's interesting to see that the model can come up with text that makes some sense (mostly thanks to training on Wikipedia), and that the text resembles the medical tweets (thanks to the fine-tuning). 

> **Note** also that an accuracy of 30-40% when predicting the next word of a sentence is pretty impressive, as the number of possibilities is very large (equal to the size of the vocabulary).

> **Also note** that this is not the task we care about: it's a pretext task before the tweet classification. 

# Classifier

In [None]:
medweb = DataBlock(blocks=(TextBlock.from_df(text_cols='Tweet', seq_len=12, vocab=data_lm.vocab), MultiCategoryBlock), 
                  get_x = ColReader(cols='text'), 
                  get_y = ColReader(cols='labels', label_delim=";"),
                  splitter = ColSplitter(col='is_test'))

data = medweb.dataloaders(df, bs=8)

Now our task is to predict the possible classes the tweets can be assigned to:

In [None]:
data.show_batch()

In [None]:
learn_clf = text_classifier_learner(data, AWD_LSTM, seq_len=16, pretrained=True, 
                                    drop_mult=0.6, 
                                    metrics=[accuracy_multi, 
                                             F1ScoreMulti(average='micro'),
                                             F1ScoreMulti(average='macro'),
                                             PrecisionMulti(average='micro'),
                                             PrecisionMulti(average='macro'),
                                             RecallMulti(average='micro'),
                                             RecallMulti(average='macro'),
                                             HammingLossMulti(),
                                             ], 
                                    model_dir='.').to_fp16()

In [None]:
learn_clf = learn_clf.load_encoder('medweb_finetuned')

In [None]:
lr = learn_clf.lr_find(suggest_funcs=(minimum, steep, valley, slide))

In [None]:
base_lr = (lr.valley + lr.steep)/2
print(base_lr)

In [None]:
learn_clf.fine_tune(8, base_lr=base_lr)

## Is it a good classifier?

We can test it out on some example text:

In [None]:
learn_clf.predict("I'm feeling really bad. My head hurts. My nose is runny. I've felt like this for days.")

It seems to produce reasonable results. _But remember that this is a small data set._ One cannot expect great things when asking the model to make predictions on text outside the small material it has been trained on. 

Go ahead and try to have the model predict symptoms for a few example sentences, and you'll see.

### How does it compare to other approaches?

From the [original article](https://www.jmir.org/2019/2/e12783/) from 2019 that presented the data set:

<img src="https://github.com/MMIV-ML/ELMED219-2022/raw/main/Lab2-NLP/assets/medweb_results.png">

The "NAIST-en" models are _"ensembles of hierarchical attention network and deep character-level convolutional neural network with loss functions (negative loss function, hinge, and hinge squared)"_. I.e. also deep learning-based models.