_ELMED219-2022_. Alexander S. Lundervold, 09.01.2022.

# Natural language processing and machine learning: a small case-study

This is a quick example of some techniques and ideas from natural language processing (NLP) and some approaches to NLP based on _deep learning_.

> Note: we'll take a close look at what deep learning is in tomorrow's lecture and lab.

> Note: If you want to run this notebook on your own computer, ask Alexander for assistance. The software requirements are different from the other ELMED219 notebooks (and also slightly more tricky to install, depending on your setup). 

> **NB:** If you're running the notebook on Colab, you should attach a GPU to the session by clicking "Runtime" --> "Change runtime type" and then selecting a GPU hardware accelerator. 

# Setup

In [1]:
# This is a quick check of whether the notebook is currently running 
# on Google Colaboratory
if 'google.colab' in str(get_ipython()):
    print('The notebook is running on Colab. colab=True.')
    colab=True
else:
    print('The notebook is not running on Colab. colab=False.')
    colab=False

The notebook is not running on Colab. colab=False.


In [2]:
if colab:
    import sys
    # Install fastai etc
    !pip install -Uqq fastbook
    !{sys.executable} -m spacy download en_core_web_sm
    from fastbook import *
if not colab:
    print("WARNING: To run this notebook locally you will have to install fastai")
    print("See the course repo for details")

See the course repo for details


We'll use the [spacy library]() for NLP and the [fastai]() library for deep learning.

In [3]:
import spacy

In [4]:
from fastai.text.all import *
from pprint import pprint as pp

# Load data

We use a data set collected in the work of Wakamiya et.al, _Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations_, 2019: https://www.jmir.org/2019/2/e12783/

![medweb-paper](https://github.com/MMIV-ML/ELMED219-2022/raw/main/Lab2-NLP/assets/medweb-paper.png)

The data us supposed to represent tweets that discusses one or more of eight symptoms. 

From the original paper:
<img src="https://github.com/MMIV-ML/ELMED219-2022/raw/main/Lab2-NLP/assets/medweb_examples.png">

We'll only look at the English language tweets:

In [5]:
df = pd.read_csv('https://github.com/MMIV-ML/ELMED219-2022/raw/main/Lab2-NLP/data/medwebdata.csv')

In [6]:
df.head()

Unnamed: 0,ID,Tweet,Influenza,Diarrhea,Hayfever,Cough,Headache,Fever,Runnynose,Cold,labels,is_test
0,1en,The cold makes my whole body weak.,0,0,0,0,0,0,0,1,Cold,False
1,2en,It's been a while since I've had allergy symptoms.,0,0,1,0,0,0,1,0,Hayfever;Runnynose,False
2,3en,I'm so feverish and out of it because of my allergies. I'm so sleepy.,0,0,1,0,0,1,1,0,Hayfever;Fever;Runnynose,False
3,4en,"I took some medicine for my runny nose, but it won't stop.",0,0,0,0,0,0,1,0,Runnynose,False
4,5en,I had a bad case of diarrhea when I traveled to Nepal.,0,0,0,0,0,0,0,0,sober,False


In [7]:
pp(df['Tweet'][10])

("They say we will have less pollen next spring, but it doesn't really matter "
 'to me, since my allergy gets severe in the autumn.')


From this text the goal is to determine whether the person is talking about one or more of the eight symptoms or conditions listed above:

In [8]:
list(df.columns[2:-2])

['Influenza',
 'Diarrhea',
 'Hayfever',
 'Cough',
 'Headache',
 'Fever',
 'Runnynose',
 'Cold']

> **BUT:** How can a computer read??

<img src="http://2.bp.blogspot.com/_--uVHetkUIQ/TDae5jGna8I/AAAAAAAAAK0/sBSpLudWmcw/s1600/reading.gif">

# Prepare the data

For a computer, everything is numbers. We have to convert the text to a series of numbers, and then feed those to the computer. 

This can be done in two widely used steps in natural language processing: **tokenization** and **numericalization**

## Tokenization

In tokenization the text is split into single words, called tokens. A simple way to achieve this is to split according to spaces in the text. But then we, among other things, lose punctuation, and also the fact that some words are contractions of multiple words (for example _isn't_ and _don't_). 

<img src="https://spacy.io/tokenization-57e618bd79d933c4ccd308b5739062d6.svg">

Here are some result after tokenization:

In [9]:
data_lm = TextDataLoaders.from_df(df, text_col='Tweet', is_lm=True, valid_pct=0.1)

data_lm.show_batch(max_n=2)

  return array(a, dtype, copy=False, order=order)


Unnamed: 0,text,text_
0,"xxbos xxmaj i 'm so tired from coughing so much . xxbos xxmaj the fever medicine xxmaj xxunk gave me works crazy well , i got over my cold in no time xxbos i put on a mask and am xxunk with headache medicine . xxbos i have to get xxunk for allergy season ! xxbos xxmaj i 've always had all xxunk of xxunk , like allergies and eczema . i","xxmaj i 'm so tired from coughing so much . xxbos xxmaj the fever medicine xxmaj xxunk gave me works crazy well , i got over my cold in no time xxbos i put on a mask and am xxunk with headache medicine . xxbos i have to get xxunk for allergy season ! xxbos xxmaj i 've always had all xxunk of xxunk , like allergies and eczema . i have"
1,"headache . xxmaj this is bad . xxbos i wonder if i caught a bug since i have a high fever . xxbos i got check out at the hospital because i was n't feeling well , and it turned out to be the flu . xxbos xxmaj they say a cold is not a disease . xxbos xxmaj this is bad , xxmaj xxunk has a cold . xxbos xxmaj by",". xxmaj this is bad . xxbos i wonder if i caught a bug since i have a high fever . xxbos i got check out at the hospital because i was n't feeling well , and it turned out to be the flu . xxbos xxmaj they say a cold is not a disease . xxbos xxmaj this is bad , xxmaj xxunk has a cold . xxbos xxmaj by the"


Tokens starting with "xx" are special. `xxbos` means the beginning of the text, `xxmaj` means that the following word is capitalized, `xxup` means that the following word is in all caps, and so on.

The tokens `xxunk` replaces words that are rare in the text corpus. We keep only words that appear at least twice (with a set maximum number of different words, 60.000 in our case). This is called our **vocabulary**.

## Numericalization

We convert tokens to numbers by making a list of all the tokens that have been used and assign them to numbers.

The above text is replaced by numbers, as in this example

In [10]:
data_lm.train_ds[0][0]

TensorText([  2,   8,  13, 140,  19, 187,  10,   8, 121,  53,  11, 228,  28,  17,
         35,  10])

> **We are now in a position where the computer can compute on the text.**

# "Classical" versus deep learning-based NLP

In [11]:
nlp = spacy.load('en_core_web_sm')

### Sentence Boundary Detection: splitting into sentences

Example sentence:
> _"Patient presents for initial evaluation of cough. Cough is reported to have developed acutely and has been present for 4 days. Symptom severity is moderate. Will return next week."_

In [12]:
sentence = "Patient presents for initial evaluation of cough. Cough is reported to have developed acutely and has been present for 4 days. Symptom severity is moderate. Will return next week."
doc = nlp(sentence)
 
for sent in doc.sents:
    print(sent)

Patient presents for initial evaluation of cough.
Cough is reported to have developed acutely and has been present for 4 days.
Symptom severity is moderate.
Will return next week.


### Named Entity Recognition

In [13]:
for ent in doc.ents:
    print(ent.text, ent.label_)

4 days DATE
Symptom PERSON
next week DATE


In [14]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

### Dependency parsing

In [15]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

> There's a lot more to natural language processing, of course! Have a look at [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101) for some examples.

In general, data preparation and feature engineering is a huge and difficult undertaking when using machine learning to analyse text. 

However, in what's called _deep learning_ (discussed in detail tomorrow) most of this work is done by the computer! That's because deep learning does feature extraction _and_ prediction in the same model. 

This results in much less work and, often, _in much better models_!

![MLvsDL](https://aylien.com/images/uploads/general/tumblr_inline_oabas5sThb1sleek4_540.png)

# Deep learning language model

We now come to a relatively new and very powerful idea for deep learning and NLP. An idea that created a small revolution in NLP a couple of years ago ([1](https://blog.openai.com/language-unsupervised/), [2](http://ruder.io/nlp-imagenet/))

We want to create a system that can classify text into one or more categories. This is a difficult problem as the computer must somehow implicitly learn to "read". 

Idea: why not _first_ teach the computer to "read" and _then_ let it loose on the classification task?

We can teach the computer to "understand" language by training it to predict the next word of a sentence, using as much training data we can get hold of. This is called ***language modelling*** in NLP. 

This is a difficult task: to guess the next word of a sentence one has to know a lot about language, and also a lot about the world.

> What word fits here? _"The light turned green and Per crossed the ___"_

Luckily, obtaining large amounts of training data for language models is simple: any text can be used. The labels are simply the next word of a subpart of the text. 

We can for example use Wikipedia. After the model performs alright at predicting the next word of Wikipedia text, we can fine-tune it on text that's closer to the classification task we're after. 

> This is often called ***transfer learning***.

We can use the tweet text to fine-tune a model that's already been pretrained on Wikipedia:

In [16]:
data_lm = TextDataLoaders.from_df(df, text_col='Tweet', is_lm=True, valid_pct=0.1)

data_lm.show_batch(max_n=3)

  return array(a, dtype, copy=False, order=order)


Unnamed: 0,text,text_
0,"xxbos xxmaj xxunk , i was n't feeling well last night , turns out it 's the flu . xxbos xxmaj i 'm taking medicine for this headache . xxbos xxmaj maybe not a lot of people know this , but there are fall allergies too ! xxbos xxmaj it was just a cold and a runny nose , but now my head is starting to hurt , so xxmaj i 'm","xxmaj xxunk , i was n't feeling well last night , turns out it 's the flu . xxbos xxmaj i 'm taking medicine for this headache . xxbos xxmaj maybe not a lot of people know this , but there are fall allergies too ! xxbos xxmaj it was just a cold and a runny nose , but now my head is starting to hurt , so xxmaj i 'm gon"
1,"i was coughing up , and it turned out that i had a high fever . xxbos xxmaj lately i have xxunk mild headaches . xxbos i wear a mask as a xxunk against the flu when i go out . xxbos xxmaj my abs hurt a lot from coughing xxunk . xxbos i have school but xxmaj i 'll stay in bed because i have a slight headache . xxbos i","was coughing up , and it turned out that i had a high fever . xxbos xxmaj lately i have xxunk mild headaches . xxbos i wear a mask as a xxunk against the flu when i go out . xxbos xxmaj my abs hurt a lot from coughing xxunk . xxbos i have school but xxmaj i 'll stay in bed because i have a slight headache . xxbos i have"
2,"thought it was a cold , but it turned out to be allergies . i had no idea it could cause a fever as well . xxbos i never thought i would suffer from allergies . xxbos i walk so xxunk when i have a headache , so that way , it does n't hurt as much . xxbos xxmaj my allergies are finally back . xxmaj it 's been quite a","it was a cold , but it turned out to be allergies . i had no idea it could cause a fever as well . xxbos i never thought i would suffer from allergies . xxbos i walk so xxunk when i have a headache , so that way , it does n't hurt as much . xxbos xxmaj my allergies are finally back . xxmaj it 's been quite a while"


In [17]:
learn = language_model_learner(data_lm, AWD_LSTM, pretrained=True, drop_mult=0.3,
                               metrics=[accuracy, Perplexity()], wd=0.1, model_dir='.').to_fp16()

Let's start training:

In [18]:
learn.fit_one_cycle(1, 5e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.856825,4.581619,0.24683,97.672424,00:01


In [19]:
learn.unfreeze()
learn.fit_one_cycle(3, 5e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.63757,2.974067,0.372962,19.571363,00:01
1,3.097656,2.602711,0.422554,13.500284,00:01
2,2.737985,2.581139,0.425272,13.212176,00:01


...and save the parts of the model that we can reuse for classification later:

In [20]:
learn.save_encoder('medweb_finetuned')

## Test the language model

We can test the language model by having it guess the next given number of words on a starting text:

In [21]:
def make_text(seed_text, nb_words):
    """
    Use the trained language model to produce text. 
    Input:
        seed_text: some text to get the model started
        nb_words: number of words to produce
    """
    pred = learn.predict(seed_text, nb_words, temperature=0.75)
    pp(pred)

In [22]:
make_text("I'm not feeling too good as my", 10)

"I 'm not feeling too good as my nose has hurt , so i do n't want to"


In [23]:
make_text("No, that's a", 40)

("No , that 's a bug getting it out of my mouth , it 's really bad , and my "
 "nose is so feverish . It 's so bad . My wife has a fever and is not sure if "
 "it 's")


Now we have something that seems to produce text that resembles the text to be classified. 

> **Note:** It's interesting to see that the model can come up with text that makes some sense (mostly thanks to training on Wikipedia), and that the text resembles the medical tweets (thanks to the fine-tuning). 

> **Note** also that an accuracy of 30-40% when predicting the next word of a sentence is pretty impressive, as the number of possibilities is very large (equal to the size of the vocabulary).

> **Also note** that this is not the task we care about: it's a pretext task before the tweet classification. 

# Classifier

In [24]:
medweb = DataBlock(blocks=(TextBlock.from_df(text_cols='Tweet', seq_len=12, vocab=data_lm.vocab), MultiCategoryBlock), 
                  get_x = ColReader(cols='text'), 
                  get_y = ColReader(cols='labels', label_delim=";"),
                  splitter = ColSplitter(col='is_test'))

data = medweb.dataloaders(df, bs=8)

  return array(a, dtype, copy=False, order=order)


Now our task is to predict the possible classes the tweets can be assigned to:

In [25]:
data.show_batch()

Unnamed: 0,text,None
0,"xxbos i heard that in xxmaj china , someone died of a new type of flu xxunk from the bird flu . xxmaj that got me wondering if there is any xxunk like some xxunk to check to see if i have the flu or not . xxmaj they talk about the flu a lot on xxup tv , but i have no idea how they xxunk the outbreak .",sober
1,"xxbos xxmaj not only do i have a runny nose today , but i also have a stuffy nose . xxmaj i 've had to breathe through my mouth , and now my mouth is crazy dry . i have an important meeting tomorrow that i ca n't miss . xxmaj this is so , so bad .",Runnynose
2,"xxbos xxmaj you and i both have colds . i do n't have any fever medicine . xxmaj there 's a meeting at the xxunk tomorrow . i want to go to the doctor , but even with all the scientific xxunk , there 's no medicine that will cure you in an xxunk .",Cold;Fever
3,"xxbos i read some of the how - to xxunk on nasal xxunk , but xxunk of them was xxunk xxunk . i also see so many xxunk personal remedies on xxunk . i think i should just go to the doctor 's and have them take a look .",Runnynose
4,"xxbos xxmaj my stuffy nose is finally gone , so i went for a xxmaj tokyo xxunk . xxmaj it 's great how you can see the xxunk you love while out on a walk . xxmaj like pretty xxunk , or a xxunk talking with their child .",sober
5,"xxbos xxmaj tokyo is cool these days - in xxunk , xxmaj i 'd say it 's cold . xxmaj it 's weird , it 's the xxunk . xxmaj people i know are catching colds and getting fevers , maybe because of this strange weather .",sober
6,xxbos xxmaj god . xxmaj it 's been a while since xxmaj i 've had such a xxunk headache . xxmaj it might be the kind that you 're not supposed to take a bath with . xxmaj it made my headache worse .,Headache
7,"xxbos xxmaj my nose wo n't stop running and my stomach feels xxunk . xxmaj if i get a fever to xxunk , i do n't know if xxmaj i 'll be able to go on the trip this weekend .",Diarrhea;Runnynose


In [26]:
learn_clf = text_classifier_learner(data, AWD_LSTM, seq_len=16, pretrained=True, 
                                    drop_mult=0.5, 
                                    metrics=[accuracy_multi, 
                                             F1ScoreMulti(average='micro'),
                                             F1ScoreMulti(average='macro'),
                                             PrecisionMulti(average='micro'),
                                             PrecisionMulti(average='macro'),
                                             RecallMulti(average='micro'),
                                             RecallMulti(average='macro'),
                                             HammingLossMulti(),
                                             ], 
                                    model_dir='.').to_fp16()

In [27]:
learn_clf = learn_clf.load_encoder('medweb_finetuned')

In [28]:
learn_clf.fine_tune(12, base_lr=1e-2)

epoch,train_loss,valid_loss,accuracy_multi,f1_score,f1_score.1,precision_score,precision_score.1,recall_score,recall_score.1,hamming_loss,time
0,0.375475,0.29422,0.890625,0.466102,0.41416,0.708763,0.783236,0.347222,0.307227,0.109375,00:03


epoch,train_loss,valid_loss,accuracy_multi,f1_score,f1_score.1,precision_score,precision_score.1,recall_score,recall_score.1,hamming_loss,time
0,0.27757,0.218796,0.915451,0.633007,0.626768,0.785047,0.774737,0.530303,0.544901,0.084549,00:04
1,0.212202,0.185644,0.930208,0.726158,0.74358,0.788462,0.787439,0.67298,0.757399,0.069792,00:04
2,0.179836,0.137869,0.946527,0.795485,0.800636,0.838936,0.823156,0.756313,0.793021,0.053472,00:04
3,0.165241,0.133466,0.952604,0.821452,0.822977,0.852103,0.844329,0.792929,0.809654,0.047396,00:04
4,0.146274,0.141117,0.950694,0.819797,0.820073,0.82398,0.803811,0.815657,0.853039,0.049306,00:04
5,0.105486,0.126308,0.955382,0.832355,0.828422,0.860999,0.845736,0.805556,0.826338,0.044618,00:04
6,0.095217,0.13661,0.957465,0.841629,0.844146,0.862252,0.849713,0.82197,0.84505,0.042535,00:04
7,0.075972,0.128434,0.957292,0.84251,0.84298,0.854545,0.839478,0.830808,0.859031,0.042708,00:04
8,0.058423,0.130946,0.961285,0.858771,0.859236,0.861499,0.844787,0.856061,0.884294,0.038715,00:04
9,0.050776,0.12711,0.963541,0.866071,0.863495,0.875,0.858302,0.857323,0.873717,0.036458,00:04


## Is it a good classifier?

We can test it out on some example text:

In [29]:
learn_clf.predict("I'm feeling really bad. My head hurts. My nose is runny. I've felt like this for days.")

((#2) ['Headache','Runnynose'],
 tensor([False, False, False, False, False,  True, False,  True, False]),
 tensor([1.1476e-01, 6.4132e-03, 1.1879e-03, 4.8030e-04, 1.7376e-02, 9.9395e-01,
         1.5937e-05, 9.4892e-01, 2.6214e-03]))

It seems to produce reasonable results. _But remember that this is a very small data set._ One cannot expect very great things when asking the model to make predictions on text outside the small material it has been trained on. 

Go ahead and try to have the model predict symptoms for a few example sentences and you'll see.

This illustrates the need for "big data" in deep learning.

### How does it compare to other approaches?

From the [original article](https://www.jmir.org/2019/2/e12783/) from 2019 that presented the data set:

<img src="https://github.com/MMIV-ML/ELMED219-2022/raw/main/Lab2-NLP/assets/medweb_results.png">

The "NAIST-en" models are _"ensembles of hierarchical attention network and deep character-level convolutional neural network with loss functions (negative loss function, hinge, and hinge squared)"_. I.e. also deep learning-based models.

# End notes

* This of course only skratches the surface of NLP and deep learning applied to NLP. The goal was to "lift the curtain" and show some of the ideas behind modern text analysis software.
* If you're interested in digging into deep learning for NLP you should check out [Hugging Face](https://huggingface.co). and also `fastai` (used above).  


> The approach taken above is not really the state-of-the-art for NLP anymore. Over the past two-three years so-called **Transformer** models have taken over in NLP. They are for example behind NLP breakthroughs such as [GPT-3](https://en.wikipedia.org/wiki/GPT-3). [Hugging Face](https://huggingface.co) is a great starting point to learn more and get started using Transfomer models.