_HELIKT620-2021_. Alexander S. Lundervold, 03.02.2021.

# Natural language processing and machine learning: a small case-study

This is a quick example of some techniques and ideas from natural language processing (NLP) and some modern approaches to NLP based on _deep learning_.

# Setup

We'll use the [spacy library]() for NLP and the [fastai]() library for deep learning.

In [None]:
import spacy

In [None]:
from fastai.text.all import *
from pprint import pprint as pp

In [None]:
!nvidia-smi

# Load data

We use a data set collected in the work of Wakamiya et.al, _Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations_, 2019: https://www.jmir.org/2019/2/e12783/

![medweb-paper](assets/medweb-paper.png)

The data us supposed to represent tweets that discusses one or more of eight symptoms. 

From the original paper:
<img src="assets/medweb_examples.png">

We'll only look at the English language tweets:

In [None]:
df = pd.read_csv('data/medweb/medwebdata.csv')

In [None]:
df.head()

In [None]:
pp(df['Tweet'][10])

From this text the goal is to determine whether the person is talking about one or more of the eight symptoms or conditions listed above:

In [None]:
list(df.columns[2:-2])

> **BUT:** How can a computer read??

<img src="http://2.bp.blogspot.com/_--uVHetkUIQ/TDae5jGna8I/AAAAAAAAAK0/sBSpLudWmcw/s1600/reading.gif">

# Prepare the data

For a computer, everything is numbers. We have to convert the text to a series of numbers, and then feed those to the computer. 

This can be done in two widely used steps in natural language processing: **tokenization** and **numericalization**:

## Tokenization

In tokenization the text is split into single words, called tokens. A simple way to achieve this is to split according to spaces in the text. But then we, among other things, lose punctuation, and also the fact that some words are contractions of multiple words (for example _isn't_ and _don't_). 

<img src="https://spacy.io/tokenization-57e618bd79d933c4ccd308b5739062d6.svg">

Here are some result after tokenization:

In [None]:
data_lm = TextDataLoaders.from_df(df, text_col='Tweet', is_lm=True, valid_pct=0.1)

data_lm.show_batch(max_n=2)

Tokens starting with "xx" are special. `xxbos` means the beginning of the text, `xxmaj` means that the following word is capitalized, `xxup` means that the following word is in all caps, and so on.

The tokens `xxunk` replaces words that are rare in the text corpus. We keep only words that appear at least twice (with a set maximum number of different words, 60.000 in our case). This is called our **vocabulary**.

## Numericalization

We convert tokens to numbers by making a list of all the tokens that have been used and assign them to numbers.

The above text is replaced by numbers, as in this example

In [None]:
data_lm.train_ds[0][0]

> **We are now in a position where the computer can compute on the text.**

# "Classical" versus deep learning-based NLP

In [None]:
#import sys
#!{sys.executable} -m spacy download en

In [None]:
nlp = spacy.load('en')

Let's have a look at some standard tools of NLP

### Sentence Boundary Detection: splitting into sentences

Example sentence:
> _"Patient presents for initial evaluation of cough. Cough is reported to have developed acutely and has been present for 4 days. Symptom severity is moderate. Will return next week."_

In [None]:
sentence = "Patient presents for initial evaluation of cough. Cough is reported to have developed acutely and has been present for 4 days. Symptom severity is moderate. Will return next week."
doc = nlp(sentence)
 
for sent in doc.sents:
    print(sent)

### Named Entity Recognition

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

### Dependency parsing

In [None]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

> There's a lot more to natural language processing, of course! Have a look at [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101) for some examples.

In general, data preparation and feature engineering is a huge and difficult undertaking when using machine learning to analyse text. 

However, in what's called _deep learning_ (discussed in detail tomorrow) most of this work is done by the computer! That's because deep learning does feature extraction _and_ prediction in the same model. 

This results in much less work and, often, _in much better models_!

![MLvsDL](https://aylien.com/images/uploads/general/tumblr_inline_oabas5sThb1sleek4_540.png)

# Deep learning language model

We now come to a relatively new and very powerful idea for deep learning and NLP. An idea that created a small revolution in NLP a couple of years ago ([1](https://blog.openai.com/language-unsupervised/), [2](http://ruder.io/nlp-imagenet/))

We want to create a system that can classify text into one or more categories. This is a difficult problem as the computer must somehow implicitly learn to "read". 

Idea: why not _first_ teach the computer to "read" and _then_ let it loose on the classification task?

We can teach the computer to "understand" language by training it to predict the next word of a sentence, using as much training data we can get hold of. This is called ***language modelling*** in NLP. 

This is a difficult task: to guess the next word of a sentence one has to know a lot about language, and also a lot about the world.

> What word fits here? _"The light turned green and Per crossed the ___"_

Luckily, obtaining large amounts of training data for language models is simple: any text can be used. The labels are simply the next word of a subpart of the text. 

We can for example use Wikipedia. After the model performs alright at predicting the next word of Wikipedia text, we can fine-tune it on text that's closer to the classification task we're after. 

> This is often called ***transfer learning***.

We can use the tweet text to fine-tune a model that's already been pretrained on Wikipedia:

In [None]:
data_lm = TextDataLoaders.from_df(df, text_col='Tweet', is_lm=True, valid_pct=0.1)

data_lm.show_batch(max_n=3)

In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, pretrained=True, 
                               metrics=[accuracy, Perplexity()], wd=0.1).to_fp16()

Let's start training:

In [None]:
learn.fine_tune(50)

In [None]:
#learn.save('medweb_finetuned_lm')

learn.load('medweb_finetuned_lm')

...and save the parts of the model that we can reuse for classification later:

In [None]:
#learn.save_encoder('medweb_finetuned_encoder')

## Test the language model

We can test the language model by having it guess the next given number of words on a starting text:

In [None]:
def make_text(seed_text, nb_words):
    """
    Use the trained language model to produce text. 
    Input:
        seed_text: some text to get the model started
        nb_words: number of words to produce
    """
    pred = learn.predict(seed_text, nb_words, temperature=0.75)
    pp(pred)

In [None]:
make_text("I'm not feeling too good as my", 10)

In [None]:
make_text("No, that's a", 40)

Now we have something that seems to produce text that resembles the text to be classified. 

> **Note:** It's interesting to see that the model can come up with text that makes some sense (mostly thanks to training on Wikipedia), and that the text resembles the medical tweets (thanks to the fine-tuning). 

> **Note** also that an accuracy of 30-40% when predicting the next word of a sentence is pretty impressive, as the number of possibilities is very large (equal to the size of the vocabulary).

> **Also note** that this is not the task we care about: it's a pretext task before the tweet classification. 

# Classifier

In [None]:
medweb = DataBlock(blocks=(TextBlock.from_df(text_cols='Tweet', seq_len=12, vocab=data_lm.vocab), MultiCategoryBlock), 
                  get_x = ColReader(cols='text'), 
                  get_y = ColReader(cols='labels', label_delim=";"),
                  splitter = ColSplitter(col='is_test'))

data = medweb.dataloaders(df, bs=8)

Now our task is to predict the possible classes the tweets can be assigned to:

In [None]:
data.show_batch()

In [None]:
learn_clf = text_classifier_learner(data, AWD_LSTM, seq_len=16, pretrained=True, 
                                    drop_mult=0.5, metrics=accuracy_multi).to_fp16()

In [None]:
learn_clf = learn_clf.load_encoder('medweb_finetuned_encoder')

In [None]:
learn_clf.fine_tune(12, base_lr=1e-2)

In [None]:
#learn_clf.save('medweb_classifier')

learn_clf.load('medweb_classifier')

## Is it a good classifier?

We can test it out on some example text:

In [None]:
learn_clf.predict("I'm feeling really bad. My head hurts. My nose is runny. I've felt like this for days.")

It seems to produce reasonable results. _But remember that this is a very small data set._ One cannot expect very great things when asking the model to make predictions on text outside the small material it has been trained on. This illustrates the need for "big data" in deep learning.

### How does it compare to other approaches?

From the [original article](https://www.jmir.org/2019/2/e12783/) that presented the data set:

<img src="assets/medweb_results.png">

# End notes

* This of course only skratches the surface of NLP and deep learning applied to NLP. The goal was to "lift the curtain" and show some of the ideas behind modern text analysis software.
* If you're interested in digging into deep learning for NLP you should check out `fastai` (used above) and also `Hugging Face`: https://huggingface.co. 