Denne notebooken er en noe justert versjon av `lesson3-imdb.ipynb` fra <a href="http://fast.ai">fastai</a>

# Deep learning for tekstanalyse: et lite case-study

For å avmystifisere deep learning og kunstig intelligens litt går vi her gjennom et kort eksempel på **state-of-the-art** deep learning for naturlig språkprosessering. Du vil møte en god del nye begrep og også en del uforklart kode. Selv om det er mye ukjent vil du forhåpentligvis være med på hovedtrekkene i historien, og slik forstå at dette ikke er magi. Du vil også se at det med moderne verktøy ikke er så vanskelig å produsere resultater helt i verdensklasse...

Målet for maskinlæringssystemet vi skal lage er å predikere hvorvidt filmomtaler gitt av IMDB-brukere er positive eller ikke, med så stor treffsikkerhet som mulig. 

> «I can't believe they got the actors and actresses of that caliber to do this movie. That's all I've got to say - the movie speaks for itself!!»

**Med litt fantasi er det klart at får man til dette er det mye en kan få til også innen analyse av medisinske eller andre faglige tekster.**

> Vi skal gå frem på et høymoderne vis, ved å bruke teknikker som er helt i forskningsfronten i deep learning. Teknikker som ikke var oppfunnet for bare noen måneder tilbake...

# Setup

Vi er nå inne i en **[Jupyter Notebook](https://jupyter.org)**: et nyttig verktøy for å kombinere tekst og kode i samme dokument (et rammeverk som forøvrig er spådd til å bli [fremtiden format for vitenskapelige publikasjoner](https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676).) All kode er skrevet i [Python](https://python.org), det mest populære programmeringsspråket i maskinlæring.  

Vi bruker Python-bibliotekene [fastai](https://docs.fast.ai), bygget oppå [PyTorch](https://pytorch.org), til å sette opp og trene våre deep learning-modeller. 

In [1]:
from fastai.text import *
from pprint import pprint as pp

In [2]:
torch.cuda.set_device(0)

# Last inn data

Vi bruker et datasett bestående av 100.000 filmomtaler fra IMDB, der 50.000 er merket som positive eller negative. 

In [3]:
df = pd.read_csv('data/eksempler.csv')

Her er de første fem omtalene i vårt datasett:

In [4]:
df.head()

Unnamed: 0,label,text
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...
1,positive,This is a extremely well-made film. The acting...
2,negative,Every once in a long while a movie will come a...
3,positive,Name just says it all. I watched this movie wi...
4,negative,This movie succeeds at being one of the most u...


Vi printer ut en av omtalene:

In [5]:
pp(df['text'][1])

('This is a extremely well-made film. The acting, script and camera-work are '
 'all first-rate. The music is good, too, though it is mostly early in the '
 'film, when things are still relatively cheery. There are no really '
 'superstars in the cast, though several faces will be familiar. The entire '
 'cast does an excellent job with the script.<br /><br />But it is hard to '
 'watch, because there is no good end to a situation like the one presented. '
 'It is now fashionable to blame the British for setting Hindus and Muslims '
 'against each other, and then cruelly separating them into two countries. '
 "There is some merit in this view, but it's also true that no one forced "
 'Hindus and Muslims in the region to mistreat each other as they did around '
 'the time of partition. It seems more likely that the British simply saw the '
 'tensions between the religions and were clever enough to exploit them to '
 'their own ends.<br /><br />The result is that there is much cruelty an

Fra denne teksten skal maskinen altså klare å avgjøre om dette er en positiv eller negativ omtale. 

> **Men**: Hvordan kan en maskin *lese*??

<img src="http://2.bp.blogspot.com/_--uVHetkUIQ/TDae5jGna8I/AAAAAAAAAK0/sBSpLudWmcw/s1600/reading.gif">

# Klargjør data

For en datamaskin er alt tall. Vi må konvertere teksten til en serie med tall, og så mate maskinen med disse. 

Dette gjøres i to (viktige og mye brukte!) steg innen naturlig språkanalyse: **tokenization** og **numericalization**:

## Tokenization

I **tokenization** deles teksten opp i enkeltstående ord, kalt **tokens**. En enkel måte å gjøre dette på er å splitte ved mellomrom. Men da mister vi blant tegnsetting, og at noen ord er kontraksjoner av flere ord (isn't og don't for eksempel). Vi bruker [spaCy](https://spacy.io/usage/spacy-101) sin tokenizer gjennom fastai: 

### Litt mer om tokenization (om nysgjerrig)

<img src="https://spacy.io/assets/img/tokenization.svg">

In [6]:
data = TextDataBunch.from_csv('data', 'sample_texts.csv')

Her er resultatet etter tokenization:

In [7]:
data.show_batch()

text,target
"xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj",negative
"xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of "" xxmaj at xxmaj the xxmaj movies "" in taking xxmaj steven xxmaj soderbergh to task . \n\n xxmaj it 's usually satisfying to watch a film director change his style / subject ,",negative
"xxbos \n\n i 'm sure things did n't exactly go the same way in the real life of xxmaj homer xxmaj hickam as they did in the film adaptation of his book , xxmaj rocket xxmaj boys , but the movie "" xxmaj october xxmaj sky "" ( an xxunk of the book 's title ) is good enough to stand alone . i have not read xxmaj hickam 's",positive
"xxbos xxmaj to review this movie , i without any doubt would have to quote that memorable scene in xxmaj tarantino 's "" xxmaj pulp xxmaj fiction "" ( xxunk ) when xxmaj jules and xxmaj vincent are talking about xxmaj xxunk xxmaj wallace and what she does for a living . xxmaj jules tells xxmaj vincent that the "" xxmaj only thing she did worthwhile was pilot "" .",negative
"xxbos xxmaj how viewers react to this new "" adaption "" of xxmaj shirley xxmaj jackson 's book , which was promoted as xxup not being a remake of the original 1963 movie ( true enough ) , will be based , i suspect , on the following : those who were big fans of either the book or original movie are not going to think much of this one",negative


Tokens som starter med "xx" er spesielle. `xxbos` betyr «beginning of text», `xxmaj` betyr at neste ord startet med stor forbokstav i teksten, `xxup` betyr at neste ord i teksten hadde kun store bokstaver, osv. 

## Numericalization

Vi konverterer tokens til tall ved å lage en liste av alle tokens som er brukt, og så tilordne alle tokens som opptrer mer enn to ganger blant de 60.000 mest brukte tokens (dette kalles *vokabularet*). Resten erstatter vi med «Unknown» (UNK).

Denne teksten:

In [8]:
data.train_ds[0][0]

Text xxbos xxmaj this film struck me as a project that had a lot of the right xxunk , but somewhere along the way they did n't quite come together . i do n't know who made it , but it has a slightly xxmaj disney - xxunk feel . xxmaj while parts of it are improbable ( like when a pre - teen runs for a public office ) and tend to prevent the story from being taken seriously , there is a healthy dose of xxunk ( whatever that is ) to keep things xxunk and in perspective . xxmaj the acting is alright . xxmaj strangely , the relationship between xxmaj frankie and her grandmother is convincing , but the relationship between xxmaj hazel and xxmaj frankie is a bit ... off . xxmaj it 's interesting to see how she has to work hard to keep a balance between her best friend , her grandmother , and her two xxunk : ballet and baseball . xxmaj being a baseball player myself , it was quite painful to watch xxmaj frankie try to hold her own on a team of boys , but it does a good job of showing the strug

Blir erstattet med denne listen av tall:

In [9]:
print(data.train_ds[0][0].data.tolist())

[2, 4, 19, 31, 4640, 88, 27, 12, 868, 20, 81, 12, 229, 13, 8, 240, 0, 10, 30, 1159, 446, 8, 117, 43, 87, 34, 171, 245, 278, 9, 18, 61, 34, 141, 51, 113, 16, 10, 30, 16, 63, 12, 1016, 4, 1465, 23, 0, 248, 9, 4, 152, 439, 13, 16, 37, 2745, 36, 49, 71, 12, 2429, 23, 1837, 1017, 28, 12, 910, 938, 33, 11, 1377, 14, 2746, 8, 83, 48, 139, 672, 581, 10, 54, 15, 12, 3168, 3169, 13, 0, 36, 774, 20, 15, 33, 14, 406, 209, 0, 11, 17, 1378, 9, 4, 8, 122, 15, 2430, 9, 4, 4641, 10, 8, 628, 175, 4, 4642, 11, 60, 1838, 15, 708, 10, 30, 8, 628, 175, 4, 4643, 11, 4, 4642, 15, 12, 216, 108, 154, 9, 4, 16, 22, 264, 14, 86, 114, 77, 63, 14, 194, 246, 14, 406, 12, 3170, 175, 60, 149, 474, 10, 60, 1838, 10, 11, 60, 115, 0, 89, 1839, 11, 2166, 9, 4, 139, 12, 2166, 2167, 566, 10, 16, 24, 171, 1295, 14, 127, 4, 4642, 383, 14, 939, 60, 218, 35, 12, 838, 13, 693, 10, 30, 16, 95, 12, 66, 314, 13, 1105, 8, 1379, 77, 1569, 9, 18, 273, 1159, 20, 77, 15, 34, 80, 0, 10, 30, 8, 775, 17, 19, 31, 87, 12, 69, 66, 314, 13, 24

> **Nå er vi i en posisjon der maskinen kan gjøre beregninger med teksten!**

## Mer
 Ta en titt på [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101) for mer informasjon.

### Ekstra: Eksempler på andre standard tekstprosesserings-oppgaver

spaCy kan brukes til en rekke standard språkprosesseringsoppgaver. Her er noen eksempler:

In [10]:
import spacy

In [11]:
#!/home/ubuntu/anaconda3/envs/fastai/bin/python -m spacy download en

In [12]:
nlp = spacy.load('en')

#### Splitte opp tekst i setninger (Sentence Boundary Detection)

In [13]:
sentence = "Patient presents for initial evaluation of cough. Cough is reported to have developed acutely and has been present for 4 days. Symptom severity is moderate. Will return next week."
doc = nlp(sentence)
 
for sent in doc.sents:
    print(sent)


Patient presents for initial evaluation of cough.
Cough is reported to have developed acutely and has been present for 4 days.
Symptom severity is moderate.
Will return next week.


#### Named Entity Recognition

In [14]:
for ent in doc.ents:
    print(ent.text, ent.label_)

4 days DATE
next week DATE


In [15]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

#### Dependency Parsing

In [16]:
from spacy import displacy

displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

# Language model

Nå kommer vi til den nye, og ekstremt kraftige ideen. En idé som i fjor skapte en liten revolusjon innen NLP ([1](https://blog.openai.com/language-unsupervised/), [2](http://ruder.io/nlp-imagenet/))

Vi ønsker å lage et system som kan klassifisere tekst inn i to kategorier: positiv og negativ. Dette er et veldig vanskelig problem siden maskinen underveis må lære seg å "lese". 

Idé: Hvorfor ikke først lære maskinen å lese før vi slipper den løs på vår oppgave? 

Vi kan lære maskinen å "forstå" språk ved å trene den til å gjette neste ord i en setning, basert på så mye tekst vi bare kan få tak i (dette kalles «language modelling» i NLP). En veldig vanskelig oppgave: for å gjette neste ord må en kunne mye om språket, og en må også forstå mye om verden. 

Hvilket ord passer best inn her? 

> «Lyset ble grønt og Per krysset ___________»

Vi kan trene maskinen å utføre denne oppgaven ved å slippe den løs på Wikipedia. Etter at den er i stand til å predikere OK her kan vi fin-tune den til å predikere ord i vårt IMDB-datasett. Deretter kan vi bruke hva maskinen har lært til å klassifisere våre filmomtaler. 

> Dette kalles ofte **transfer learning**.

In [17]:
bs=48
data_lm = TextLMDataBunch.load('.', f'tmp_lm', bs=bs)

In [18]:
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103_1, drop_mult=0.3)

Nå kan vi sette i gang treningen: 

> Det tar ca **6 timer** å trene denne modellen til en OK accuracy. Det er for lenge for oss å vente så vi laster inn en ferdig trent modell nedenfor. 

In [19]:
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7)) # Her er første steg i treningen, til info

epoch,train_loss,valid_loss,accuracy
1,4.207361,4.057117,0.293408


In [20]:
learn.load(f'fine_tuned_bs{bs}');

## La oss teste modellen

In [21]:
TEXT = "i liked this movie because"

In [22]:
print(learn.predict(TEXT, 40, temperature=0.75))

i liked this movie because it was definitely not for kids . xxmaj it 's more of a family film , if it has a somewhat the lead , though it is very hard to get a good laughs out of it . xxmaj it


Vi kan si oss fornøyd med dette og gå videre til å bygge vår klassifikator.

# Klassifikasjon

Vi laster inn datasettet vi ønsker å bygge en klassifikator til:

In [23]:
data_clas = TextClasDataBunch.load('.', 'tmp_clas', bs=bs)

Her er noen omtaler markert med hvorvidt de er positive eller ikke:

In [24]:
data_clas.show_batch()

text,target
xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules,pos
"xxbos xxmaj this movie was recently released on xxup dvd in the xxup us and i finally got the chance to see this hard - to - find gem . xxmaj it even came with original theatrical previews of other xxmaj italian horror classics like "" xxup xxunk "" and "" xxup beyond xxup the xxup darkness "" . xxmaj unfortunately , the previews were the best thing about this",neg
"xxbos i 've rented and watched this movie for the 1st time on xxup dvd without reading any reviews about it . xxmaj so , after 15 minutes of watching i 've noticed that something is wrong with this movie ; it 's xxup terrible ! i mean , in the trailers it looked scary and serious ! \n\n i think that xxmaj eli xxmaj roth ( xxmaj mr. xxmaj",neg
"xxbos xxmaj it is not as great a film as many people believe ( including my late aunt , who said it was her favorite movie ) . xxmaj but due to the better sections of this film noir , particularly that justifiably famous "" fun house "" finale , xxup the xxup lady xxup from xxup shanghai has gained a position of importance beyond it 's actual worth as",pos
"xxbos xxmaj within the realm of xxmaj science xxmaj fiction , two particular themes consistently elicit interest , were initially explored in the literature of a pre - cinematic era , and have since been periodically revisited by filmmakers and writers alike , with varying degrees of success . xxmaj the first theme , that of time travel , has held an unwavering fascination for fans of film , as",neg


Vi laster så inn modellen trent på Wikipedia som vi fin-tunet på IMDB-omtalene i sted:

In [25]:
learn = text_classifier_learner(data_clas, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')

...og starter treningen:

In [26]:
# Første steg i treningen, til illustrasjon:
learn.freeze()
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy
1,0.325724,0.213311,0.919160


Etter ca. 25 minutters trening har vi en modell som er ca. **94.5% nøyaktig**.

Her er noen prediksjoner:

In [27]:
learn.load('third');
learn.show_results(data_clas.valid_ds, rows=10)

text,target,prediction
xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules,pos,pos
"xxbos * * xxmaj attention xxmaj spoilers * * \n\n xxmaj first of all , let me say that xxmaj rob xxmaj roy is one of the best films of the 90 's . xxmaj it was an amazing achievement for all those involved , especially the acting of xxmaj liam xxmaj neeson , xxmaj jessica xxmaj lange , xxmaj john xxmaj hurt , xxmaj brian xxmaj cox , and",pos,pos
"xxbos * ! ! - xxup spoilers - ! ! * \n\n xxmaj before i begin this , let me say that i have had both the advantages of seeing this movie on the big screen and of having seen the "" xxmaj authorized xxmaj version "" of this movie , remade by xxmaj stephen xxmaj king , himself , in 1997 . \n\n xxmaj both advantages made me appreciate",pos,pos
"xxbos xxmaj by now you 've probably heard a bit about the new xxmaj disney dub of xxmaj miyazaki 's classic film , xxmaj laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky . xxmaj during late summer of 1998 , xxmaj disney released "" xxmaj kiki 's xxmaj delivery xxmaj service "" on video which included a preview of the xxmaj laputa dub saying it was due out",pos,pos
"xxbos xxmaj titanic directed by xxmaj james xxmaj cameron presents a fictional love story on the historical setting of the xxmaj titanic . xxmaj the plot is simple , xxunk , or not for those who love plots that twist and turn and keep you in suspense . xxmaj the end of the movie can be figured out within minutes of the start of the film , but the love",pos,pos
"xxbos xxmaj some have praised _ xxunk _ as a xxmaj disney adventure for adults . i do n't think so -- at least not for thinking adults . \n\n xxmaj this script suggests a beginning as a live - action movie , that struck someone as the type of crap you can not sell to adults anymore . xxmaj the "" crack staff "" of many older adventure movies",neg,neg
"xxbos xxmaj some have praised xxunk :- xxmaj the xxmaj lost xxmaj xxunk as a xxmaj disney adventure for adults . i do n't think so -- at least not for thinking adults . \n\n xxmaj this script suggests a beginning as a live - action movie , that struck someone as the type of crap you can not sell to adults anymore . xxmaj the "" crack staff """,neg,neg
"xxbos xxmaj warning : xxmaj does contain spoilers . \n\n xxmaj open xxmaj your xxmaj eyes \n\n xxmaj if you have not seen this film and plan on doing so , just stop reading here and take my word for it . xxmaj you have to see this film . i have seen it four times so far and i still have n't made up my mind as to what",pos,pos
"xxbos * * * xxmaj warning - this review contains "" plot spoilers , "" though nothing could "" spoil "" this movie any more than it already is . xxmaj it really xxup is that bad . * * * \n\n xxmaj before i begin , i 'd like to let everyone know that this definitely is one of those so - incredibly - bad - that - you",neg,neg
"xxbos xxmaj this movie was recently released on xxup dvd in the xxup us and i finally got the chance to see this hard - to - find gem . xxmaj it even came with original theatrical previews of other xxmaj italian horror classics like "" xxup xxunk "" and "" xxup beyond xxup the xxup darkness "" . xxmaj unfortunately , the previews were the best thing about this",neg,neg


Vi lage våre egne omtaler og sjekke hva modellen tenker:

In [28]:
learn.predict("I loved that movie!")

(Category pos, tensor(1), tensor([0.0334, 0.9666]))

In [29]:
learn.predict("An interesting movie.")

(Category pos, tensor(1), tensor([0.0100, 0.9900]))

# Er dette et bra resultat? 

Her var **state-of-the-art** for kort tid siden:

<img width=30% src="sota-imdb.png">

Og med litt videre tuning kunne fremgangsmåten vi brukte nådd 95.4%, som er det beste noen har fått til på dette datasettet noensinne. Se [ref](https://github.com/sebastianruder/NLP-progress/blob/master/english/sentiment_analysis.md). 