_HELIKT620-2021_. Alexander S. Lundervold, 03.02.2021.

# Natural language processing and machine learning: a small case-study

This is a quick example of some techniques and ideas from natural language processing (NLP) and some modern approaches to NLP based on _deep learning_.

# Setup

We'll use the [spacy library]() for NLP and the [fastai]() library for deep learning.

In [1]:
import spacy

In [2]:
from fastai.text.all import *
from pprint import pprint as pp

In [3]:
!nvidia-smi

Wed Feb  3 19:07:10 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  TITAN V             Off  | 00000000:02:00.0 Off |                  N/A |
| 33%   48C    P8    30W / 250W |    786MiB / 12065MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 3090    Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   50C    P8    25W / 350W |      3MiB / 24268MiB |      0%      Defaul

# Load data

We use a data set collected in the work of Wakamiya et.al, _Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations_, 2019: https://www.jmir.org/2019/2/e12783/

![medweb-paper](assets/medweb-paper.png)

The data us supposed to represent tweets that discusses one or more of eight symptoms. 

From the original paper:
<img src="assets/medweb_examples.png">

We'll only look at the English language tweets:

In [4]:
df = pd.read_csv('data/medweb/medwebdata.csv')

In [5]:
df.head()

Unnamed: 0,ID,Tweet,Influenza,Diarrhea,Hayfever,Cough,Headache,Fever,Runnynose,Cold,labels,is_test
0,1en,The cold makes my whole body weak.,0,0,0,0,0,0,0,1,Cold,False
1,2en,It's been a while since I've had allergy symptoms.,0,0,1,0,0,0,1,0,Hayfever;Runnynose,False
2,3en,I'm so feverish and out of it because of my allergies. I'm so sleepy.,0,0,1,0,0,1,1,0,Hayfever;Fever;Runnynose,False
3,4en,"I took some medicine for my runny nose, but it won't stop.",0,0,0,0,0,0,1,0,Runnynose,False
4,5en,I had a bad case of diarrhea when I traveled to Nepal.,0,0,0,0,0,0,0,0,sober,False


In [6]:
pp(df['Tweet'][10])

("They say we will have less pollen next spring, but it doesn't really matter "
 'to me, since my allergy gets severe in the autumn.')


From this text the goal is to determine whether the person is talking about one or more of the eight symptoms or conditions listed above:

In [7]:
list(df.columns[2:-2])

['Influenza',
 'Diarrhea',
 'Hayfever',
 'Cough',
 'Headache',
 'Fever',
 'Runnynose',
 'Cold']

> **BUT:** How can a computer read??

<img src="http://2.bp.blogspot.com/_--uVHetkUIQ/TDae5jGna8I/AAAAAAAAAK0/sBSpLudWmcw/s1600/reading.gif">

# Prepare the data

For a computer, everything is numbers. We have to convert the text to a series of numbers, and then feed those to the computer. 

This can be done in two widely used steps in natural language processing: **tokenization** and **numericalization**:

## Tokenization

In tokenization the text is split into single words, called tokens. A simple way to achieve this is to split according to spaces in the text. But then we, among other things, lose punctuation, and also the fact that some words are contractions of multiple words (for example _isn't_ and _don't_). 

<img src="https://spacy.io/tokenization-57e618bd79d933c4ccd308b5739062d6.svg">

Here are some result after tokenization:

In [8]:
data_lm = TextDataLoaders.from_df(df, text_col='Tweet', is_lm=True, valid_pct=0.1)

data_lm.show_batch(max_n=2)

  return array(a, dtype, copy=False, order=order)


Unnamed: 0,text,text_
0,xxbos xxmaj i 've recovered from my cold ! xxmaj i 'm in perfect shape now ! xxbos xxmaj my allergies are already acting up this year . xxbos i cough too much and my chest hurts . xxbos xxmaj i 've had a bad headache since this morning . i wonder if that 's because i drank too much yesterday . xxbos xxmaj my lymph nodes are swollen and i have,xxmaj i 've recovered from my cold ! xxmaj i 'm in perfect shape now ! xxbos xxmaj my allergies are already acting up this year . xxbos i cough too much and my chest hurts . xxbos xxmaj i 've had a bad headache since this morning . i wonder if that 's because i drank too much yesterday . xxbos xxmaj my lymph nodes are swollen and i have a
1,"so runny . xxbos xxmaj how bad is the mumps ? xxmaj if an adult xxunk it . xxmaj please , somebody , tell me . xxbos i went to the xxup xxunk and was diagnosed with xxunk . xxmaj that makes me wonder if a runny nose is because of xxunk ? i have no idea , but i got a tremendous xxunk of medicine . xxbos xxmaj nepali xxunk is","runny . xxbos xxmaj how bad is the mumps ? xxmaj if an adult xxunk it . xxmaj please , somebody , tell me . xxbos i went to the xxup xxunk and was diagnosed with xxunk . xxmaj that makes me wonder if a runny nose is because of xxunk ? i have no idea , but i got a tremendous xxunk of medicine . xxbos xxmaj nepali xxunk is too"


Tokens starting with "xx" are special. `xxbos` means the beginning of the text, `xxmaj` means that the following word is capitalized, `xxup` means that the following word is in all caps, and so on.

The tokens `xxunk` replaces words that are rare in the text corpus. We keep only words that appear at least twice (with a set maximum number of different words, 60.000 in our case). This is called our **vocabulary**.

## Numericalization

We convert tokens to numbers by making a list of all the tokens that have been used and assign them to numbers.

The above text is replaced by numbers, as in this example

In [9]:
data_lm.train_ds[0][0]

TensorText([  2,   8,  67,  98,  63,  11, 118, 188,  38, 187,  10])

> **We are now in a position where the computer can compute on the text.**

# "Classical" versus deep learning-based NLP

In [10]:
#import sys
#!{sys.executable} -m spacy download en

In [11]:
nlp = spacy.load('en')

Let's have a look at some standard tools of NLP

### Sentence Boundary Detection: splitting into sentences

Example sentence:
> _"Patient presents for initial evaluation of cough. Cough is reported to have developed acutely and has been present for 4 days. Symptom severity is moderate. Will return next week."_

In [12]:
sentence = "Patient presents for initial evaluation of cough. Cough is reported to have developed acutely and has been present for 4 days. Symptom severity is moderate. Will return next week."
doc = nlp(sentence)
 
for sent in doc.sents:
    print(sent)

Patient presents for initial evaluation of cough.
Cough is reported to have developed acutely and has been present for 4 days.
Symptom severity is moderate.
Will return next week.


### Named Entity Recognition

In [13]:
for ent in doc.ents:
    print(ent.text, ent.label_)

4 days DATE
next week DATE


In [14]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

### Dependency parsing

In [15]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

> There's a lot more to natural language processing, of course! Have a look at [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101) for some examples.

In general, data preparation and feature engineering is a huge and difficult undertaking when using machine learning to analyse text. 

However, in what's called _deep learning_ (discussed in detail tomorrow) most of this work is done by the computer! That's because deep learning does feature extraction _and_ prediction in the same model. 

This results in much less work and, often, _in much better models_!

![MLvsDL](https://aylien.com/images/uploads/general/tumblr_inline_oabas5sThb1sleek4_540.png)

# Deep learning language model

We now come to a relatively new and very powerful idea for deep learning and NLP. An idea that created a small revolution in NLP a couple of years ago ([1](https://blog.openai.com/language-unsupervised/), [2](http://ruder.io/nlp-imagenet/))

We want to create a system that can classify text into one or more categories. This is a difficult problem as the computer must somehow implicitly learn to "read". 

Idea: why not _first_ teach the computer to "read" and _then_ let it loose on the classification task?

We can teach the computer to "understand" language by training it to predict the next word of a sentence, using as much training data we can get hold of. This is called ***language modelling*** in NLP. 

This is a difficult task: to guess the next word of a sentence one has to know a lot about language, and also a lot about the world.

> What word fits here? _"The light turned green and Per crossed the ___"_

Luckily, obtaining large amounts of training data for language models is simple: any text can be used. The labels are simply the next word of a subpart of the text. 

We can for example use Wikipedia. After the model performs alright at predicting the next word of Wikipedia text, we can fine-tune it on text that's closer to the classification task we're after. 

> This is often called ***transfer learning***.

We can use the tweet text to fine-tune a model that's already been pretrained on Wikipedia:

In [16]:
data_lm = TextDataLoaders.from_df(df, text_col='Tweet', is_lm=True, valid_pct=0.1)

data_lm.show_batch(max_n=3)

  return array(a, dtype, copy=False, order=order)


Unnamed: 0,text,text_
0,"xxbos i still have the flu ! xxbos xxmaj looks like i have a xxunk cold . xxmaj this sucks . xxbos i ca n't take this stuffy nose . i ca n't feel anything anymore . xxbos xxmaj i 've come down with the flu even though it 's not the season for it , i have a fever . xxbos i love the xxunk of her voice when she has","i still have the flu ! xxbos xxmaj looks like i have a xxunk cold . xxmaj this sucks . xxbos i ca n't take this stuffy nose . i ca n't feel anything anymore . xxbos xxmaj i 've come down with the flu even though it 's not the season for it , i have a fever . xxbos i love the xxunk of her voice when she has a"
1,"! xxbos xxmaj how do i say diarrhea in xxmaj english ? xxbos xxmaj way too many people in my club have the flu xxbos i thought it was allergies , but i got a fever and it turned out to be a cold ! i did n't think i would make a mistake like that . xxbos xxmaj it 's rough working when i have a headache , so i need","xxbos xxmaj how do i say diarrhea in xxmaj english ? xxbos xxmaj way too many people in my club have the flu xxbos i thought it was allergies , but i got a fever and it turned out to be a cold ! i did n't think i would make a mistake like that . xxbos xxmaj it 's rough working when i have a headache , so i need to"
2,"something bad . xxbos i never thought xxmaj i 'd get allergies . xxbos i wish i had a wife who would n't do anything with a fever . xxbos xxmaj it sucks that i have a cold . xxmaj my head feels xxunk , too . xxmaj scary . xxbos xxmaj diarrhea like xxmaj i 've never xxunk , all i can do is laugh . i wonder if it was","bad . xxbos i never thought xxmaj i 'd get allergies . xxbos i wish i had a wife who would n't do anything with a fever . xxbos xxmaj it sucks that i have a cold . xxmaj my head feels xxunk , too . xxmaj scary . xxbos xxmaj diarrhea like xxmaj i 've never xxunk , all i can do is laugh . i wonder if it was something"


In [17]:
learn = language_model_learner(data_lm, AWD_LSTM, pretrained=True, 
                               metrics=[accuracy, Perplexity()], wd=0.1).to_fp16()

Let's start training:

In [18]:
learn.fine_tune(50)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,5.318592,4.92261,0.223661,137.360626,00:03


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.985348,4.690354,0.23192,108.891762,00:02
1,4.879693,4.464879,0.235714,86.910515,00:03
2,4.770054,4.242317,0.242411,69.568871,00:02
3,4.65295,4.027235,0.249107,56.105541,00:02
4,4.527309,3.830331,0.258259,46.077793,00:02
5,4.388569,3.652602,0.270982,38.574898,00:02
6,4.26169,3.51807,0.291295,33.719273,00:02
7,4.13482,3.399183,0.314509,29.939615,00:02
8,4.017929,3.30426,0.325893,27.228392,00:03
9,3.907169,3.219349,0.339955,25.011829,00:02


In [None]:
#learn.save('medweb_finetuned_lm')

learn.load('medweb_finetuned_lm')

...and save the parts of the model that we can reuse for classification later:

In [None]:
#learn.save_encoder('medweb_finetuned_encoder')

## Test the language model

We can test the language model by having it guess the next given number of words on a starting text:

In [19]:
def make_text(seed_text, nb_words):
    """
    Use the trained language model to produce text. 
    Input:
        seed_text: some text to get the model started
        nb_words: number of words to produce
    """
    pred = learn.predict(seed_text, nb_words, temperature=0.75)
    pp(pred)

In [20]:
make_text("I'm not feeling too good as my", 10)

"I 'm not feeling too good as my wife has a fever , and she 's so weak"


In [21]:
make_text("No, that's a", 40)

("No , that 's a cold and I 've got phlegm all day . i wonder if I 'm going to "
 "take medicine for my allergies . i have a headache , so I 'm going to go to "
 'school')


Now we have something that seems to produce text that resembles the text to be classified. 

> **Note:** It's interesting to see that the model can come up with text that makes some sense (mostly thanks to training on Wikipedia), and that the text resembles the medical tweets (thanks to the fine-tuning). 

> **Note** also that an accuracy of 30-40% when predicting the next word of a sentence is pretty impressive, as the number of possibilities is very large (equal to the size of the vocabulary).

> **Also note** that this is not the task we care about: it's a pretext task before the tweet classification. 

# Classifier

In [22]:
medweb = DataBlock(blocks=(TextBlock.from_df(text_cols='Tweet', seq_len=12, vocab=data_lm.vocab), MultiCategoryBlock), 
                  get_x = ColReader(cols='text'), 
                  get_y = ColReader(cols='labels', label_delim=";"),
                  splitter = ColSplitter(col='is_test'))

data = medweb.dataloaders(df, bs=8)

  return array(a, dtype, copy=False, order=order)


Now our task is to predict the possible classes the tweets can be assigned to:

In [23]:
data.show_batch()

Unnamed: 0,text,None
0,"xxbos i heard that in xxmaj china , someone died of a new type of flu xxunk from the bird flu . xxmaj that got me wondering if there is any xxunk like some xxunk to check to see if i have the flu or not . xxmaj they talk about the flu a lot on xxup tv , but i have no idea how they xxunk the outbreak .",sober
1,"xxbos xxmaj this woman i work with called the office , coughing xxunk while saying that she had to take a day off because of a high fever . xxmaj then xxunk i found out on her blog that she had actually gone to an xxunk xxunk , which made me laugh .",sober
2,"xxbos i read some of the how - to xxunk on nasal xxunk , but xxunk of them was xxunk xxunk . i also see so many xxunk personal remedies on xxunk . i think i should just go to the doctor 's and have them take a look .",Runnynose
3,"xxbos xxmaj they told me at the doctor that i have allergies , i ca n't believe it ! xxmaj my throat hurts , and xxmaj i 've had a slight fever , so i thought it was an ordinary cold . xxmaj allergies . xxmaj who knew .",Cold;Cough;Fever;Hayfever;Runnynose
4,"xxbos i was like , xxmaj i 'm so lightheaded , i must be coming down with a cold . xxmaj now that i think about it , i feel weak , so xxmaj i 'm going to get some rest before it gets worse .",Cold
5,"xxbos xxmaj when it gets hot out , we xxunk out cool xxunk and xxunk to xxunk with the a / xxup xxunk , but it 's actually bad because the cold air xxunk can make you sick and end up with a fever .",Fever
6,"xxbos xxmaj there was a xxmaj tokyo scene in the xxmaj xxunk xxunk i saw yesterday , and the xxunk was so xxunk i blew snot out my nose . xxmaj there was no way i could stop myself from laughing at that .",sober
7,"xxbos i got to see my dog in one of the pictures i xxunk from my xxunk in xxmaj tokyo . i had been feeling sick from a cold and a runny nose , but it gave me a little xxunk .",Cold;Runnynose


In [24]:
learn_clf = text_classifier_learner(data, AWD_LSTM, seq_len=16, pretrained=True, 
                                    drop_mult=0.5, metrics=accuracy_multi).to_fp16()

In [25]:
learn_clf = learn_clf.load_encoder('medweb_finetuned_encoder')

In [26]:
learn_clf.fine_tune(12, base_lr=1e-2)

epoch,train_loss,valid_loss,accuracy_multi,time
0,0.356136,0.292944,0.884201,00:08


epoch,train_loss,valid_loss,accuracy_multi,time
0,0.260411,0.196204,0.925347,00:12
1,0.205894,0.166761,0.938368,00:11
2,0.180932,0.151657,0.940104,00:12
3,0.153667,0.144264,0.947222,00:11
4,0.135734,0.139834,0.948958,00:11
5,0.119323,0.129324,0.955382,00:11
6,0.093922,0.130301,0.954166,00:11
7,0.069797,0.139377,0.955903,00:11
8,0.054521,0.141066,0.956423,00:11
9,0.051655,0.145058,0.959028,00:11


In [None]:
#learn_clf.save('medweb_classifier')

learn_clf.load('medweb_classifier')

## Is it a good classifier?

We can test it out on some example text:

In [27]:
learn_clf.predict("I'm feeling really bad. My head hurts. My nose is runny. I've felt like this for days.")

((#2) ['Headache','Runnynose'],
 tensor([False, False, False, False, False,  True, False,  True, False]),
 tensor([4.5626e-01, 4.7552e-03, 1.0975e-04, 1.1502e-04, 6.0471e-04, 9.7146e-01,
         2.9312e-05, 9.8693e-01, 4.8674e-03]))

It seems to produce reasonable results. _But remember that this is a very small data set._ One cannot expect very great things when asking the model to make predictions on text outside the small material it has been trained on. This illustrates the need for "big data" in deep learning.

### How does it compare to other approaches?

From the [original article](https://www.jmir.org/2019/2/e12783/) that presented the data set:

<img src="assets/medweb_results.png">

# End notes

* This of course only skratches the surface of NLP and deep learning applied to NLP. The goal was to "lift the curtain" and show some of the ideas behind modern text analysis software.
* If you're interested in digging into deep learning for NLP you should check out `fastai` (used above) and also `Hugging Face`: https://huggingface.co. 