In [1]:
!pwd

/home/jupyter/fastai-d1/tutorials/fastai/course-v3/nbs/dl1


In [3]:
! cp /home/jupyter/fastai-d1/tutorials/fastai/course-v3/nbs/dl1/lesson3-imdb.ipynb /home/jupyter/fastai-d1/mendes/notebooks/

# IMDB

In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
from fastai.text import *

## Preparing the data

First let's download the dataset we are going to study. The [dataset](http://ai.stanford.edu/~amaas/data/sentiment/) has been curated by Andrew Maas et al. and contains a total of 100,000 reviews on IMDB. 25,000 of them are labelled as positive and negative for training, another 25,000 are labelled for testing (in both cases they are highly polarized). The remaning 50,000 is an additional unlabelled data (but we will find a use for it nonetheless).

We'll begin with a sample we've prepared for you, so that things run quickly before going over the full dataset.

In [4]:
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()

[PosixPath('/home/jupyter/.fastai/data/imdb_sample/texts.csv'),
 PosixPath('/home/jupyter/.fastai/data/imdb_sample/data_save.pkl')]

It only contains one csv file, let's have a look at it.

In [5]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [6]:
# df['text'][1]

It contains one line per review, with the label ('negative' or 'positive'), the text and a flag to determine if it should be part of the validation set or the training set. If we ignore this flag, we can create a DataBunch containing this data in one line of code:

In [7]:
# data_lm = TextDataBunch.from_csv(path, 'texts.csv')

By executing this line a process was launched that took a bit of time. Let's dig a bit into it. Images could be fed (almost) directly into a model because they're just a big array of pixel values that are floats between 0 and 1. A text is composed of words, and we can't apply mathematical functions to them directly. We first have to convert them to numbers. This is done in two differents steps: tokenization and numericalization. A `TextDataBunch` does all of that behind the scenes for you.

Before we delve into the explanations, let's take the time to save the things that were calculated.

In [8]:
# data_lm.save()

Next time we launch this notebook, we can skip the cell above that took a bit of time (and that will take a lot more when you get to the full dataset) and load those results like this:

In [9]:
# data = load_data(path)

### Tokenization

The first step of processing we make the texts go through is to split the raw sentences into words, or more exactly tokens. The easiest way to do this would be to split the string on spaces, but we can be smarter:

- we need to take care of punctuation
- some words are contractions of two different words, like isn't or don't
- we may need to clean some parts of our texts, if there's HTML code for instance

To see what the tokenizer had done behind the scenes, let's have a look at a few texts in a batch.

In [10]:
# data = TextClasDataBunch.from_csv(path, 'texts.csv')
# data.show_batch()

The texts are truncated at 100 tokens for more readability. We can see that it did more than just split on space and punctuation symbols: 
- the "'s" are grouped together in one token
- the contractions are separated like this: "did", "n't"
- content has been cleaned for any HTML symbol and lower cased
- there are several special tokens (all those that begin by xx), to replace unknown tokens (see below) or to introduce different text fields (here we only have one).

### Numericalization

Once we have extracted tokens from our texts, we convert to integers by creating a list of all the words used. We only keep the ones that appear at least twice with a maximum vocabulary size of 60,000 (by default) and replace the ones that don't make the cut by the unknown token `UNK`.

The correspondance from ids to tokens is stored in the `vocab` attribute of our datasets, in a dictionary called `itos` (for int to string).

In [11]:
# data.vocab.itos[:10]

And if we look at what a what's in our datasets, we'll see the tokenized text as a representation:

In [12]:
# data.train_ds[0]

But the underlying data is all numbers

In [13]:
# data.train_ds[0][0].data[:10]

### With the data block API

We can use the data block API with NLP and have a lot more flexibility than what the default factory methods offer. In the previous example for instance, the data was randomly split between train and validation instead of reading the third column of the csv.

With the data block API though, we have to manually call the tokenize and numericalize steps. This allows more flexibility, and if you're not using the defaults from fastai, the variaous arguments to pass will appear in the step they're revelant, so it'll be more readable.

In [15]:
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [16]:
# data = (TextList.from_csv(path, 'texts.csv', cols='text')
#                 .split_from_df(col=2)
#                 .label_from_df(cols=0)
#                 .databunch())

## Language model

Note that language models can use a lot of GPU, so you may need to decrease batchsize here.

In [17]:
bs=48

Now let's grab the full dataset for what follows.

In [18]:
path = untar_data(URLs.IMDB)

The reviews are in a training and test set following an imagenet structure. The only difference is that there is an `unsup` folder on top of `train` and `test` that contains the unlabelled data.

We're not going to train a model that classifies the reviews from scratch. Like in computer vision, we'll use a model pretrained on a bigger dataset (a cleaned subset of wikipedia called [wikitext-103](https://einstein.ai/research/blog/the-wikitext-long-term-dependency-language-modeling-dataset)). That model has been trained to guess what the next word, its input being all the previous words. It has a recurrent structure and a hidden state that is updated each time it sees a new word. This hidden state thus contains information about the sentence up to that point.

We are going to use that 'knowledge' of the English language to build our classifier, but first, like for computer vision, we need to fine-tune the pretrained model to our particular dataset. Because the English of the reviews left by people on IMDB isn't the same as the English of wikipedia, we'll need to adjust the parameters of our model by a little bit. Plus there might be some words that would be extremely common in the reviews dataset but would be barely present in wikipedia, and therefore might not be part of the vocabulary the model was trained on.

This is where the unlabelled data is going to be useful to us, as we can use it to fine-tune our model. Let's create our data object with the data block API (next line takes a few minutes).

In [19]:
# data_lm = (TextList.from_folder(path)
#            #Inputs: all the text files in path
#             .filter_by_folder(include=['train', 'test', 'unsup']) 
#            #We may have other temp folders that contain text files so we only keep what's in train and test
#             .split_by_rand_pct(0.1)
#            #We randomly split and keep 10% (10,000 reviews) for validation
#             .label_for_lm()           
#            #We want to do a language model so we label accordingly
#             .databunch(bs=bs))
# data_lm.save('data_lm.pkl')

We have to use a special kind of `TextDataBunch` for the language model, that ignores the labels (that's why we put 0 everywhere), will shuffle the texts at each epoch before concatenating them all together (only for training, we don't shuffle for the validation set) and will send batches that read that text in order with targets that are the next word in the sentence.

The line before being a bit long, we want to load quickly the final ids by using the following cell.

In [20]:
data_lm = load_data(path, 'data_lm.pkl', bs=bs)

In [24]:
type(data_lm)

fastai.text.data.TextLMDataBunch

In [21]:
data_lm.show_batch()

idx,text
0,". xxmaj and to make things worse , they have n't got any customers ( as their former boss predicted ) . xxmaj when the man pays them a visit in their shop , he challenges them to provide the meat for a dinner party he organizes . \n \n xxmaj than a tragic accident happens . xxmaj one of the butchers locks the electrician into the freezing chamber"
1,"he has knit his ideas - foibles and all - into a meticulously paced arc . \n \n xxmaj inside this does indeed sit the central performance of xxmaj bogarde 's xxmaj aschenbach . xxmaj rather than a simpering , xxmaj johnny - come - lately gay , he manages to give a pathetic composer beaten by tragedy and misunderstood integrity who sees salvation in xxmaj tadzio . xxmaj"
2,never understand everything from the subtitles i was able to enjoy the film . \n \n xxmaj can you really hate a film where a staff turns into a flock of birds that defecate over the enemy ? xxmaj what does character development matter when faced with a lesbian alien princess whose people built the pyramids ? xxmaj why does xxmaj buddha wear seriously xxunk diamond earrings ? xxmaj
3,"conflict he had was forced to resolve . \n \n i highly recommend this film xxbos xxmaj george xxmaj scott gave the performance of a lifetime in xxmaj paddy xxmaj chayefsky 's xxup the xxup hospital , a very dark drama about an aging big city hospital and a middle - aged physician on the verge of suicide . xxmaj along comes xxmaj diana xxmaj rigg as a free"
4,"who 's xxunk is anything but ordinary . xxmaj before long , xxmaj sharky 's crimefighting xxmaj machine uncovers a conspiracy of the highest order , which threatens to corrupt the inner body of xxmaj atlanta . xxmaj as a resident of xxmaj metro xxmaj atlanta , i recall the excitement in town during the movie 's production . xxmaj sharky 's xxmaj machine goes to great lengths to"


We can then put this in a learner object very easily with a model loaded with the pretrained weights. They'll be downloaded the first time you'll execute the following line and stored in `~/.fastai/models/` (or elsewhere if you specified different paths in your config file).

In [23]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

In [None]:
# learn.lr_find()

In [None]:
# learn.recorder.plot(skip_end=15)

In [None]:
# learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

In [None]:
# learn.save('fit_head')

In [None]:
# learn.load('fit_head');

To complete the fine-tuning, we can then unfeeze and launch a new training.

In [None]:
# learn.unfreeze()

In [None]:
#learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))

In [None]:
# learn.save('fine_tuned')

How good is our model? Well let's try to see what it predicts after a few given words.

In [None]:
learn.load('fine_tuned');

In [None]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2

In [None]:
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

We not only have to save the model, but also it's encoder, the part that's responsible for creating and updating the hidden state. For the next part, we don't care about the part that tries to guess the next word.

In [25]:
# learn.save_encoder('fine_tuned_enc')

## Classifier

Now, we'll create a new data object that only grabs the labelled data and keeps those labels. Again, this line takes a bit of time.

In [26]:
path = untar_data(URLs.IMDB)

In [27]:
# data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
#              #grab all the text files in path
#              .split_by_folder(valid='test')
#              #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
#              .label_from_folder(classes=['neg', 'pos'])
#              #label them all with their folders
#              .databunch(bs=bs))

# data_clas.save('data_clas.pkl')

In [28]:
data_clas = load_data(path, 'data_clas.pkl', bs=bs)

In [29]:
data_clas.show_batch()

text,target
xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules,pos
"xxbos * * * xxmaj warning - this review contains "" plot spoilers , "" though nothing could "" spoil "" this movie any more than it already is . xxmaj it really xxup is that bad . * * * \n \n xxmaj before i begin , i 'd like to let everyone know that this definitely is one of those so - incredibly - bad - that",neg
"xxbos "" a xxmaj damsel in xxmaj distress "" is definitely not one of xxmaj fred xxmaj astaire 's better musicals . xxmaj but even xxmaj astaire 's bad films always had some good moments . \n \n xxmaj in "" xxmaj damsel , "" xxmaj astaire is xxmaj jerry xxmaj halliday , an xxmaj american musical star who is in xxmaj london on a personal appearance tour .",neg
"xxbos xxmaj who knew ? xxmaj dowdy xxmaj queen xxmaj victoria , the plump xxmaj monarch who was a virtual recluse for 40 years after the death of her husband , xxmaj prince xxmaj albert , actually led a life fraught with drama and intrigue in her younger days . ' xxmaj the xxmaj young xxmaj victoria ' not only chronicles the young xxmaj queen 's romance with her husband",pos
"xxbos * xxmaj some spoilers * \n \n xxmaj this movie is sometimes subtitled "" xxmaj life xxmaj everlasting . "" xxmaj that 's often taken as reference to the final scene , but more accurately describes how dead and buried this once - estimable series is after this sloppy and illogical send - off . \n \n xxmaj there 's a "" hey kids , let 's",neg


We can then create a model to classify those reviews and load the encoder we saved before.

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')

In [None]:
# learn.lr_find()

In [None]:
# learn.recorder.plot()

In [None]:
# learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))

In [None]:
# learn.save('first')

In [None]:
# learn.load('first');

In [None]:
# learn.freeze_to(-2)
# learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

In [None]:
# learn.save('second')

In [None]:
learn.load('second');

In [None]:
# learn.freeze_to(-3)
# learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

In [None]:
# learn.save('third')

In [None]:
# learn.load('third');

In [None]:
# learn.unfreeze()
# learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

In [None]:
learn.predict("I really loved that movie, it was awesome!")

In [None]:
learn.predict("This movie was ok. I enjoyed the characters but the pace was too slow!")

In [None]:
learn.predict("I really liked the slow pace of the movie!")

In [None]:
learn.predict("This movie was ok. the characters and the slow pace")