_ELMED219-2021_. Alexander S. Lundervold, 10.01.2021.

# Natural language processing and machine learning: a small case-study

This is a quick example of some techniques and ideas from natural language processing (NLP) and some modern approaches to NLP based on _deep learning_.

> Note: we'll take a close look at what deep learning is in tomorrow's lecture and lab.

> Note: If you want to run this notebook on your own computer, ask Alexander for assistance. The software requirements are different from the other ELMED219 notebooks (and also slightly more tricky to install, depending on your setup). 

# Setup

We'll use the [spacy library]() for NLP and the [fastai]() library for deep learning.

In [1]:
import spacy

In [2]:
from fastai.text.all import *
from pprint import pprint as pp

# Load data

We use a data set collected in the work of Wakamiya et.al, _Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations_, 2019: https://www.jmir.org/2019/2/e12783/

![medweb-paper](assets/medweb-paper.png)

The data us supposed to represent tweets that discusses one or more of eight symptoms. 

From the original paper:
<img src="assets/medweb_examples.png">

We'll only look at the English language tweets:

In [3]:
df = pd.read_csv('data/medweb/medwebdata.csv')

In [4]:
df.head()

Unnamed: 0,ID,Tweet,Influenza,Diarrhea,Hayfever,Cough,Headache,Fever,Runnynose,Cold,labels,is_test
0,1en,The cold makes my whole body weak.,0,0,0,0,0,0,0,1,Cold,False
1,2en,It's been a while since I've had allergy symptoms.,0,0,1,0,0,0,1,0,Hayfever;Runnynose,False
2,3en,I'm so feverish and out of it because of my allergies. I'm so sleepy.,0,0,1,0,0,1,1,0,Hayfever;Fever;Runnynose,False
3,4en,"I took some medicine for my runny nose, but it won't stop.",0,0,0,0,0,0,1,0,Runnynose,False
4,5en,I had a bad case of diarrhea when I traveled to Nepal.,0,0,0,0,0,0,0,0,sober,False


In [5]:
pp(df['Tweet'][10])

("They say we will have less pollen next spring, but it doesn't really matter "
 'to me, since my allergy gets severe in the autumn.')


From this text the goal is to determine whether the person is talking about one or more of the eight symptoms or conditions listed above:

In [6]:
list(df.columns[2:-2])

['Influenza',
 'Diarrhea',
 'Hayfever',
 'Cough',
 'Headache',
 'Fever',
 'Runnynose',
 'Cold']

> **BUT:** How can a computer read??

<img src="http://2.bp.blogspot.com/_--uVHetkUIQ/TDae5jGna8I/AAAAAAAAAK0/sBSpLudWmcw/s1600/reading.gif">

# Prepare the data

For a computer, everything is numbers. We have to convert the text to a series of numbers, and then feed those to the computer. 

This can be done in two widely used steps in natural language processing: **tokenization** and **numericalization**:

## Tokenization

In tokenization the text is split into single words, called tokens. A simple way to achieve this is to split according to spaces in the text. But then we, among other things, lose punctuation, and also the fact that some words are contractions of multiple words (for example _isn't_ and _don't_). 

<img src="https://spacy.io/tokenization-57e618bd79d933c4ccd308b5739062d6.svg">

Here are some result after tokenization:

In [7]:
data_lm = TextDataLoaders.from_df(df, text_col='Tweet', is_lm=True, valid_pct=0.1)

data_lm.show_batch(max_n=2)

  return array(a, dtype, copy=False, order=order)


Unnamed: 0,text,text_
0,"xxbos xxmaj i 've had a slight fever for some time . xxmaj is that because of allergies ? xxbos xxmaj every time i see a miko , i get so excited i get a fever . xxbos xxmaj since i also have phlegm now , xxmaj i 'm going to see the doctor . xxbos i caught a cold and my nose wo n't stop running . xxmaj my xxunk xxunk","xxmaj i 've had a slight fever for some time . xxmaj is that because of allergies ? xxbos xxmaj every time i see a miko , i get so excited i get a fever . xxbos xxmaj since i also have phlegm now , xxmaj i 'm going to see the doctor . xxbos i caught a cold and my nose wo n't stop running . xxmaj my xxunk xxunk today"
1,"do n't want to go to my part time job , i want to have a fever as well . xxbos xxmaj it 's common courtesy to not face another person when you cough xxbos i think xxmaj i 'm getting better now , xxmaj i 've stopped coughing up phlegm ! xxbos xxmaj it 's a pain when there 's phlegm xxbos i thought i was over my cold , but","n't want to go to my part time job , i want to have a fever as well . xxbos xxmaj it 's common courtesy to not face another person when you cough xxbos i think xxmaj i 'm getting better now , xxmaj i 've stopped coughing up phlegm ! xxbos xxmaj it 's a pain when there 's phlegm xxbos i thought i was over my cold , but i"


Tokens starting with "xx" are special. `xxbos` means the beginning of the text, `xxmaj` means that the following word is capitalized, `xxup` means that the following word is in all caps, and so on.

The tokens `xxunk` replaces words that are rare in the text corpus. We keep only words that appear at least twice (with a set maximum number of different words, 60.000 in our case). This is called our **vocabulary**.

## Numericalization

We convert tokens to numbers by making a list of all the tokens that have been used and assign them to numbers.

The above text is replaced by numbers, as in this example

In [8]:
data_lm.train_ds[0][0]

TensorText([  2,   8,   9,  46,  48,  29,  39,  11, 112,  79,  10,   8,  27,  19,
         61,  12,  19,  21,  16,  37])

> **We are now in a position where the computer can compute on the text.**

# "Classical" versus deep learning-based NLP

In [9]:
#import sys
#!{sys.executable} -m spacy download en

In [10]:
nlp = spacy.load('en')

### Sentence Boundary Detection: splitting into sentences

Example sentence:
> _"Patient presents for initial evaluation of cough. Cough is reported to have developed acutely and has been present for 4 days. Symptom severity is moderate. Will return next week."_

In [11]:
sentence = "Patient presents for initial evaluation of cough. Cough is reported to have developed acutely and has been present for 4 days. Symptom severity is moderate. Will return next week."
doc = nlp(sentence)
 
for sent in doc.sents:
    print(sent)

Patient presents for initial evaluation of cough.
Cough is reported to have developed acutely and has been present for 4 days.
Symptom severity is moderate.
Will return next week.


### Named Entity Recognition

In [12]:
for ent in doc.ents:
    print(ent.text, ent.label_)

4 days DATE
next week DATE


In [13]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

### Dependency parsing

In [14]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

> There's a lot more to natural language processing, of course! Have a look at [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101) for some examples.

In general, data preparation and feature engineering is a huge and difficult undertaking when using machine learning to analyse text. 

However, in what's called _deep learning_ (discussed in detail tomorrow) most of this work is done by the computer! That's because deep learning does feature extraction _and_ prediction in the same model. 

This results in much less work and, often, _in much better models_!

![MLvsDL](https://aylien.com/images/uploads/general/tumblr_inline_oabas5sThb1sleek4_540.png)

# Deep learning language model

We now come to a relatively new and very powerful idea for deep learning and NLP. An idea that created a small revolution in NLP a couple of years ago ([1](https://blog.openai.com/language-unsupervised/), [2](http://ruder.io/nlp-imagenet/))

We want to create a system that can classify text into one or more categories. This is a difficult problem as the computer must somehow implicitly learn to "read". 

Idea: why not _first_ teach the computer to "read" and _then_ let it loose on the classification task?

We can teach the computer to "understand" language by training it to predict the next word of a sentence, using as much training data we can get hold of. This is called ***language modelling*** in NLP. 

This is a difficult task: to guess the next word of a sentence one has to know a lot about language, and also a lot about the world.

> What word fits here? _"The light turned green and Per crossed the ___"_

Luckily, obtaining large amounts of training data for language models is simple: any text can be used. The labels are simply the next word of a subpart of the text. 

We can for example use Wikipedia. After the model performs alright at predicting the next word of Wikipedia text, we can fine-tune it on text that's closer to the classification task we're after. 

> This is often called ***transfer learning***.

We can use the tweet text to fine-tune a model that's already been pretrained on Wikipedia:

In [15]:
data_lm = TextDataLoaders.from_df(df, text_col='Tweet', is_lm=True, valid_pct=0.1)

data_lm.show_batch(max_n=3)

  return array(a, dtype, copy=False, order=order)


Unnamed: 0,text,text_
0,"xxbos xxmaj you were all , xxunk , this intense xxunk pain and exhaustion are not normal , so i went with you to the doctor and it turned out to be an ordinary cold . xxmaj xxunk that xxunk look on your face is gon na make me laugh . xxbos xxmaj when i have a fever , i can sleep xxunk . xxbos i did my job as a miko","xxmaj you were all , xxunk , this intense xxunk pain and exhaustion are not normal , so i went with you to the doctor and it turned out to be an ordinary cold . xxmaj xxunk that xxunk look on your face is gon na make me laugh . xxbos xxmaj when i have a fever , i can sleep xxunk . xxbos i did my job as a miko and"
1,"i might stay home today , since my nose is way too runny . xxbos i was suffering from diarrhea the whole time i was traveling in xxmaj nepal for a soccer match . xxbos xxmaj at last , there 's blood xxunk with my phlegm … xxbos xxmaj this medicine is good for allergies . xxbos xxmaj it seems like i can stop myself from coughing by xxunk hot water with","might stay home today , since my nose is way too runny . xxbos i was suffering from diarrhea the whole time i was traveling in xxmaj nepal for a soccer match . xxbos xxmaj at last , there 's blood xxunk with my phlegm … xxbos xxmaj this medicine is good for allergies . xxbos xxmaj it seems like i can stop myself from coughing by xxunk hot water with xxunk"
2,". xxmaj the xxunk xxunk is so xxunk . xxbos xxmaj i 'm so happy ! xxmaj i 'm gon na xxunk snot out of my nose for xxunk that my xxunk showed me . xxbos i do n't have a fever anymore , but my nose is still runny . i guess i should get rest at home . xxbos i was playing with a dog and now i have a","xxmaj the xxunk xxunk is so xxunk . xxbos xxmaj i 'm so happy ! xxmaj i 'm gon na xxunk snot out of my nose for xxunk that my xxunk showed me . xxbos i do n't have a fever anymore , but my nose is still runny . i guess i should get rest at home . xxbos i was playing with a dog and now i have a runny"


In [16]:
learn = language_model_learner(data_lm, AWD_LSTM, pretrained=True, 
                               metrics=[accuracy, Perplexity()], wd=0.1).to_fp16()

Let's start training:

In [17]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.860197,4.271955,0.259795,71.66156,00:02


In [18]:
learn.unfreeze()
learn.fit_one_cycle(10, 1e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.159355,3.639123,0.288946,38.05846,00:01
1,3.803387,3.193312,0.351213,24.369015,00:01
2,3.55328,2.964089,0.370336,19.377037,00:01
3,3.347135,2.782259,0.391558,16.155468,00:01
4,3.174557,2.673444,0.419776,14.489787,00:01
5,3.039278,2.635424,0.421642,13.949227,00:01
6,2.925674,2.596819,0.428172,13.420974,00:01
7,2.831434,2.578371,0.430504,13.175652,00:01
8,2.76264,2.574084,0.432836,13.119292,00:01
9,2.696316,2.573173,0.433069,13.107352,00:01


...and save the parts of the model that we can reuse for classification later:

In [19]:
learn.save_encoder('medweb_finetuned')

## Test the language model

We can test the language model by having it guess the next given number of words on a starting text:

In [20]:
def make_text(seed_text, nb_words):
    """
    Use the trained language model to produce text. 
    Input:
        seed_text: some text to get the model started
        nb_words: number of words to produce
    """
    pred = learn.predict(seed_text, nb_words, temperature=0.75)
    pp(pred)

In [21]:
make_text("I'm not feeling too good as my", 10)

"I 'm not feeling too good as my wife , i have diarrhea . My wife has"


In [22]:
make_text("No, that's a", 40)

("No , that 's a cold or a cold . It seems like a Spanish flu is going around "
 "! It 's been a while since i had a headache . i think it 's been a while "
 'since i had')


Now we have something that seems to produce text that resembles the text to be classified. 

> **Note:** It's interesting to see that the model can come up with text that makes some sense (mostly thanks to training on Wikipedia), and that the text resembles the medical tweets (thanks to the fine-tuning). 

> **Note** also that an accuracy of 30-40% when predicting the next word of a sentence is pretty impressive, as the number of possibilities is very large (equal to the size of the vocabulary).

> **Also note** that this is not the task we care about: it's a pretext task before the tweet classification. 

# Classifier

In [23]:
medweb = DataBlock(blocks=(TextBlock.from_df(text_cols='Tweet', seq_len=12, vocab=data_lm.vocab), MultiCategoryBlock), 
                  get_x = ColReader(cols='text'), 
                  get_y = ColReader(cols='labels', label_delim=";"),
                  splitter = ColSplitter(col='is_test'))

data = medweb.dataloaders(df, bs=8)

  return array(a, dtype, copy=False, order=order)


Now our task is to predict the possible classes the tweets can be assigned to:

In [24]:
data.show_batch()

Unnamed: 0,text,None
0,"xxbos i heard that in xxmaj china , someone died of a new type of flu xxunk from the bird flu . xxmaj that got me wondering if there is any xxunk like some xxunk to check to see if i have the flu or not . xxmaj they talk about the flu a lot on xxup tv , but i have no idea how they xxunk the outbreak .",sober
1,"xxbos i saw on my phone that they 're xxunk a part - time miko at a shrine in xxmaj tokyo . xxmaj there are fewer than there were last year , though . i wo n't be able to xxunk if i have a fever , so i have to take care of myself .",sober
2,"xxbos xxmaj you and i both have colds . i do n't have any fever medicine . xxmaj there 's a meeting at the xxunk tomorrow . i want to go to the doctor , but even with all the scientific xxunk , there 's no medicine that will cure you in an xxunk .",Cold;Fever
3,"xxbos xxmaj my wife said she thought she was coming down with a cold . xxmaj so xxmaj she was worried and went to the doctor . xxmaj it turns out it 's allergies , so my wife and i are having their first allergies at the same time .",Cold;Hayfever;Runnynose
4,"xxbos xxmaj tokyo is cool these days - in xxunk , xxmaj i 'd say it 's cold . xxmaj it 's weird , it 's the xxunk . xxmaj people i know are catching colds and getting fevers , maybe because of this strange weather .",sober
5,"xxbos xxmaj so sleepy , so sleepy , so sleepy , so sleepy , so sleepy , so sleepy , so sleepy , so sleepy , xxmaj i 'm xxunk calling in sick , saying i have a cold .",sober
6,"xxbos xxmaj my big brother 's been staying home from his club xxunk because he has the flu , but even though last night he said his back hurt , he 's all better today and seems super xxunk .",sober
7,"xxbos xxmaj my little brother said his nose was running like crazy and his stomach hurt , so i thought he had a cold , but now he has a xxunk . xxmaj it 's a mystery .",Cold;Diarrhea;Fever;Runnynose


In [25]:
learn_clf = text_classifier_learner(data, AWD_LSTM, seq_len=16, pretrained=True, 
                                    drop_mult=0.5, metrics=accuracy_multi).to_fp16()

In [26]:
learn_clf = learn_clf.load_encoder('medweb_finetuned')

In [27]:
learn_clf.fine_tune(12, base_lr=1e-2)

epoch,train_loss,valid_loss,accuracy_multi,time
0,0.363378,0.297283,0.886632,00:04


epoch,train_loss,valid_loss,accuracy_multi,time
0,0.26433,0.200015,0.924479,00:07
1,0.214057,0.155513,0.941666,00:06
2,0.190555,0.142081,0.945139,00:07
3,0.165468,0.127787,0.951909,00:07
4,0.14644,0.131911,0.955208,00:07
5,0.113103,0.125289,0.956771,00:06
6,0.092998,0.123545,0.959028,00:06
7,0.074361,0.140137,0.954167,00:07
8,0.06647,0.156207,0.95625,00:07
9,0.049043,0.136072,0.958333,00:07


## Is it a good classifier?

We can test it out on some example text:

In [28]:
learn_clf.predict("I'm feeling really bad. My head hurts. My nose is runny. I've felt like this for days.")

((#2) ['Headache','Runnynose'],
 tensor([False, False, False, False, False,  True, False,  True, False]),
 tensor([7.4231e-02, 4.2316e-03, 5.5062e-04, 7.0390e-03, 4.0386e-03, 9.4777e-01,
         4.3659e-05, 9.4911e-01, 9.5965e-03]))

It seems to produce reasonable results. _But remember that this is a very small data set._ One cannot expect very great things when asking the model to make predictions on text outside the small material it has been trained on. This illustrates the need for "big data" in deep learning.

### How does it compare to other approaches?

From the [original article](https://www.jmir.org/2019/2/e12783/) that presented the data set:

<img src="assets/medweb_results.png">

# End notes

* This of course only skratches the surface of NLP and deep learning applied to NLP. The goal was to "lift the curtain" and show some of the ideas behind modern text analysis software.
* If you're interested in digging into deep learning for NLP you should check out `fastai` (used above) and also `Hugging Face`: https://huggingface.co. 