# IMDB

In [3]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [4]:
from fastai import *
from fastai.text import *

## Preparing the data

First let's download the dataset we are going to study. The [dataset](http://ai.stanford.edu/~amaas/data/sentiment/) has been curated by Andrew Maas et al. and contains a total of 100,000 reviews on IMDB. 25,000 of them are labelled as positive and negative for training, another 25,000 are labelled for testing (in both cases they are highly polarized). The remaning 50,000 is an additional unlabelled data (but we will find a use for it nonetheless).

We'll begin with a sample we've prepared for you, so that things run quickly before going over the full dataset.

In [5]:
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()

[PosixPath('/home/jupyter/.fastai/data/imdb_sample/tmp'),
 PosixPath('/home/jupyter/.fastai/data/imdb_sample/texts.csv')]

It only contains one csv file, let's have a look at it.

In [6]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [7]:
df['text'][1]

'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.<br /><br />The result is that there is much cruelty and inhumanity in the situation and this is very u

It contains one line per review, with the label ('negative' or 'positive'), the text and a flag to determine if it should be part of the validation set or the training set. If we ignore this flag, we can create a DataBunch containing this data in one line of code:

In [8]:
data_lm = TextDataBunch.from_csv(path, 'texts.csv')

By executing this line a process was launched that took a bit of time. Let's dig a bit into it. Images could be fed (almost) directly into a model because they're just a big array of pixel values that are floats between 0 and 1. A text is composed of words, and we can't apply mathematical functions to them directly. We first have to convert them to numbers. This is done in two differents steps: tokenization and numericalization. A `TextDataBunch` does all of that behind the scenes for you.

Before we delve into the explanations, let's take the time to save the things that were calculated.

In [9]:
data_lm.save()

Next time we launch this notebook, we can skip the cell above that took a bit of time (and that will take a lot more when you get to the full dataset) and load those results like this:

In [10]:
data = TextDataBunch.load(path)

### Tokenization

The first step of processing we make texts go through is to split the raw sentences into words, or more exactly tokens. The easiest way to do this would be to split the string on spaces, but we can be smarter:

- we need to take care of punctuation
- some words are contractions of two different words, like isn't or don't
- we may need to clean some parts of our texts, if there's HTML code for instance

To see what the tokenizer had done behind the scenes, let's have a look at a few texts in a batch.

In [11]:
data = TextClasDataBunch.load(path)
data.show_batch()

text,label
"xxbos xxfld 1 xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure",negative
"xxbos xxfld 1 xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of "" xxmaj at xxmaj the xxmaj movies "" in taking xxmaj steven",negative
xxbos xxfld 1 xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the xxunk - up xxunk of xxmaj new xxmaj,positive
"xxbos xxfld 1 i really wanted to love this show . i truly , honestly did . \n\n xxmaj for the first time , gay viewers get their own version of the "" xxmaj the xxmaj bachelor "" . xxmaj with the help of his obligatory "" hag "" xxmaj",negative
"xxbos xxfld 1 xxmaj many xxunk that this is n't just a classic due to the fact that it 's the first 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the",positive


The texts are truncated at 100 tokens for more readability. We can see that it did more than just split on space and punctuation symbols: 
- the "'s" are grouped together in one token
- the contractions are separated like his: "did", "n't"
- content has been cleaned for any HTML symbol and lower cased
- there are several special tokens (all those that begin by xx), to replace unkown tokens (see below) or to introduce different text fields (here we only have one).

### Numericalization

Once we have extracted tokens from our texts, we convert to integers by creating a list of all the words used. We only keep the ones that appear at list twice with a maximum vocabulary size of 60,000 (by default) and replace the ones that don't make the cut by the unknown token `UNK`.

The correspondance from ids tokens is stored in the `vocab` attribute of our datasets, in a dictionary called `itos` (for int to string).

In [12]:
data.vocab.itos[:10]

['xxunk', 'xxpad', 'xxmaj', 'the', ',', '.', 'and', 'a', 'of', 'to']

And if we look at what a what's in our datasets, we'll see the tokenized text as a representation:

In [13]:
data.train_ds[0][0]

Text xxbos xxfld 1 xxmaj he now has a name , an identity , some memories and a a lost girlfriend . xxmaj all he wanted was to disappear , but still , they xxunk him and destroyed the world he hardly built . xxmaj now he wants some explanation , and to get ride of the people how made him what he is . xxmaj yeah , xxmaj jason xxmaj bourne is back , and this time , he 's here with a vengeance . 

 ok , this movie does n't have the most xxunk script in the world , but its thematics are very clever and ask some serious questions about our society . xxmaj of course , like every xxmaj xxunk movie since the end of the 90 's , " xxmaj the xxmaj bourne xxmaj xxunk " is a super - heroes story . xxmaj jason xxmaj bourne is a xxmaj captain - xxmaj america project - like , who 's gone completely wrong . xxmaj in the first movie , the hero discovered his abilities and he accepted them in the second one . xxmaj he now fights against what he considers like evil , after a person close to him has been ki

But the underlying data is all numbers

In [14]:
data.train_ds[0][0].data[:10]

array([ 44,  45,  41,   2,  35, 172,  63,   7, 353,   4])

### With the data block API

We can use the data block API with NLP and have a lot more flexibility than what the default factory methods offer. In the previous example for instance, the data was randomly split between train and validation instead of reading the third column of the csv.

With the data block API though, we have to manually call the tokenize and numericalize steps. This allows more flexibility, and if you're not using the defaults from fastai, the variaous arguments to pass will appear in the step they're revelant, so it'll be more readable.

In [15]:
data = (TextList.from_csv(path, 'texts.csv', cols='text')
                .split_from_df(col=2)
                .label_from_df(cols=0)
                .databunch())

Now let's grab the full dataset for what follows.

In [16]:
path = untar_data(URLs.IMDB)
path.ls()

[PosixPath('/home/jupyter/.fastai/data/imdb/imdb.vocab'),
 PosixPath('/home/jupyter/.fastai/data/imdb/README'),
 PosixPath('/home/jupyter/.fastai/data/imdb/test'),
 PosixPath('/home/jupyter/.fastai/data/imdb/tmp_clas'),
 PosixPath('/home/jupyter/.fastai/data/imdb/tmp_lm'),
 PosixPath('/home/jupyter/.fastai/data/imdb/train'),
 PosixPath('/home/jupyter/.fastai/data/imdb/models')]

In [17]:
(path/'train').ls()

[PosixPath('/home/jupyter/.fastai/data/imdb/train/labeledBow.feat'),
 PosixPath('/home/jupyter/.fastai/data/imdb/train/pos'),
 PosixPath('/home/jupyter/.fastai/data/imdb/train/unsupBow.feat'),
 PosixPath('/home/jupyter/.fastai/data/imdb/train/neg'),
 PosixPath('/home/jupyter/.fastai/data/imdb/train/unsup')]

The reviews are in a training and test set following an imagenet structure. The only difference is that there is an `unsup` folder in `train` that contains the unlabelled data.

## Language model

We're not going to train a model that classifies the reviews from scratch. Like in computer vision, we'll use a model pretrained on a bigger dataset (a cleaned subset of wikipeia called [wikitext-103](https://einstein.ai/research/blog/the-wikitext-long-term-dependency-language-modeling-dataset)). That model has been trained to guess what the next word, its input being all the previous words. It has a recurrent structure and a hidden state that is updated each time it sees a new word. This hidden state thus contains information about the sentence up to that point.

We are going to use that 'knowledge' of the English language to build our classifier, but first, like for computer vision, we need to fine-tune the pretrained model to our particular dataset. Because the English of the reviex lefts by people on IMDB isn't the same as the English of wikipedia, we'll need to adjust a little bit the parameters of our model. Plus there might be some words extremely common in that dataset that were barely present in wikipedia, and therefore might no be part of the vocabulary the model was trained on.

This is where the unlabelled data is going to be useful to us, as we can use it to fine-tune our model. Let's create our data object with the data block API (next line takes a few minutes).

In [18]:
data_lm = (TextList.from_folder(path)                           
           #Inputs: all the text files in path
            .filter_by_folder(include=['train', 'test']) 
           #We may have other temp folders that contain text files so we only keep what's in train and test
            .random_split_by_pct(0.1)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()           
           #We want to do a language model so we label accordingly
            .databunch())
data_lm.save('tmp_lm')

We have to use a special kind of `TextDataBunch` for the language model, that ignores the labels (that's why we put 0 everywhere), will shuffle the texts at each epoch before concatenating them all together (only for training, we don't shuffle for the validation set) and will send batches that read that text in order with targets that are the next word in the sentence.

The line before being a bit long, we want to load quickly the final ids by using the following cell.

In [26]:
data_lm = TextLMDataBunch.load(path, 'tmp_lm', bs=30)

In [27]:
data_lm.show_batch()

idx,text
0,"xxbos xxmaj this is one of xxmaj chaplin 's xxmaj first xxmaj national films from the period between his glorious xxmaj mutual shorts and the more mature xxmaj united xxmaj artists features . xxmaj more opulent than the xxmaj mutual films , it continues xxmaj chaplin 's quest for perfecting his comic expression . xxmaj most people forget that the film is actually a dream that xxmaj charlie has while awaiting being sent off to the front . \n\n xxmaj there is plenty of slapstick via the xxmaj limburger cheese being used to gas the"
1,"shown as heroines who are greater than everyone else because of how different they are and how they do n't let themselves controlled by society . xxmaj on the contrary . xxmaj this is a good point for the movie , since the realism is n't too overwhelming . xxmaj however , the depressing atmosphere of the movie can be if you are n't into this . xxmaj as the movie keeps going , it gets more and more depressing and the ending just leaves you with a "" bitter taste in your mouth """
2,"attack the xxmaj gas xxmaj station is the wacky story of a "" gang "" of 4 xxmaj korean youths who have a bad case of boredom , so what better to do then rob the local gas station ? ! xxmaj after the high of the robbery wears off , the 4 teenagers find themselves right where they were before the robbery , bored . xxmaj the only solution they can come up with is to rob the gas station again ! xxmaj as the gang attends to the customers to keep all quiet"
3,": "" a monkey is funny , anytime , anywhere . "" xxmaj there is one exception to this : xxup going xxup bananas . xxmaj it is quite simply the xxup worst xxup movie i have ever seen . xxmaj it 's worse than xxup plan 9 , worse than xxup the xxup beast of xxup yucca xxup flats . xxmaj it is xxup terrible . xxmaj the talking monkey gag gets old after about three minutes , and believe me that 's all there is . xxmaj make sure you have a bunch"
4,"blames her negligence for the accident . xxmaj the seventeen years old xxmaj evie xxmaj brighton ( xxmaj lauren xxmaj ambrose ) loves her sister and reads poems and stories for xxmaj emily . xxmaj their father xxmaj harry xxmaj brighton ( xxmaj john xxmaj savage ) , a bank investor , lives in the basement with his models of trains and railroads . xxmaj evie mysteriously sabotages her interviews for different universities being rejected , and teaches the xxunk of her own to xxmaj emily . xxmaj when xxmaj martha hears xxmaj emily repeating"
5,"steady cams , crane camera movement is brilliant , the smooth cuts and soft transitions boosts the romantic dimension of the storyline . xxmaj as expected , the music score by xxmaj khaled xxmaj xxunk is expressive , romantic and adds a lot of depth to many scenes . xxmaj the tempo of the movie is just right , giving enough time for actors to express their feelings through longer than usual shots that will never leave you bored . xxmaj action scenes are also very well done . \n\n a must see piece of"
6,"the actors were particularly talented especially the very handsome xxmaj mr. xxmaj schnaas as the killer xxmaj karl , xxmaj they could not save this movie . xxmaj even the castration scene was boring . xxmaj mr. xxmaj schnaas , xxmaj make us a better film ! xxbos xxmaj this film is a masterpiece to put it simply . xxmaj especially the double exposure made by the cameraman xxmaj julius xxmaj jaenzon . xxmaj it is skillfully made even with the standards we are used to today seventy eight years later . xxmaj viktor xxmaj"
7,"falls apart towards the end . xxmaj all and all this is some great xxmaj woody xxmaj allen work and i certainly enjoyed it . xxbos xxmaj liv xxmaj xxunk in her sexiest movie ! \n\n xxmaj she incorporates the "" xxmaj femme xxmaj fatale "" role in an astonishing way , while in the same time she manages to appear a super sexy woman while keeping the "" sweet girl "" stand and not being over - wicked like other similar movies ( e.g. "" xxmaj femme xxmaj fatale "" with xxmaj rebecca xxmaj"
8,"who the one turning sixteen is . i bet half of 'em does n't know it is a sweet sixteen party . xxmaj never mind . xxmaj even i would go to that party . xxmaj and i would be one of the last mentioned . \n\n xxmaj still , i watch the show because it 's so bratty that it keeps me entertained for some half hour . xxmaj does it make sense ? xxmaj not for me , either . xxmaj but it 's true . i deal with it . \n\n xxmaj"
9,". boards and rant hopelessly , and this comment list is filled with the exact same thing ( in an opposite fashion ) , so forget it . \n\n xxmaj my recommendation to the xxmaj christian parent : xxmaj check it out and pick it up ( i 'm sure it 's cheap ) . xxmaj as other parents stated before , it does have the values you 're looking for . xxmaj nonetheless , do n't expect anything fantastic , like more mainstream religious entertainment . \n\n xxmaj my recommendation to the rest of"


We can then put this in a learner object very easily with a model loaded with the pretrained weights. They'll be downloaded the first time you'll execute the following line and stored in './fastai/models/' (or elsewhere if you specified different paths in your config file).

In [30]:
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.3, bptt = 30)

In [31]:
learn.lr_find()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.


RuntimeError: CUDA error: out of memory

In [None]:
learn.recorder.plot(skip_end=15)

In [None]:
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

In [None]:
learn.save('fit_head')

In [None]:
learn.load('fit_head');

To complete the fine-tuning, we can then unfeeze and launch a new training.

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))

In [None]:
learn.save('fine_tuned')

How good is our model? Well let's try to see what it predicts after a few given words.

In [None]:
learn.load('fine_tuned');

In [None]:
learn.predict('I liked this movie because ', 100, temperature=1.1, min_p=0.001)

We have to save the model but also it's encoder, the part that's responsible for creating and updating the hidden state. For the next part, we don't care about the part that tries to guess the next word.

In [None]:
learn.save_encoder('fine_tuned_enc')

## Classifier

Now, we'll create a new data object that only grabs the labelled data and keeps those labels. Again, this line takes a bit of time.

In [23]:
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
             #grab all the text files in path
            .split_by_folder(valid='test')
             #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
            .label_from_folder(classes=['neg', 'pos'])
             #label them all with their folders
            .databunch(bs=50))
data_clas.save('tmp_clas')

In [24]:
data_clas = TextClasDataBunch.load(path, 'tmp_clas', bs=50)
data_clas.show_batch()

text,label
"xxbos * * * xxup spoilers * * * * * * xxup spoilers * * * xxmaj the first xxmaj godzilla movie in the third movie series , whereas xxup godzilla vs . xxup xxunk , the previous entry , aptly ended the second series . xxmaj what else",[['neg' 'pos']]
"xxbos xxmaj the problem with trying to describe this movie is coming up with the right adjectives . xxmaj words like flashy , colorful , gaudy and flat keep coming to mind ; but the essential fault with ` xxmaj there 's xxmaj no xxmaj business xxmaj like xxmaj show",[['neg' 'pos']]
"xxbos xxmaj the first thing to note about xxmaj wes xxmaj anderson 's new film ( featuring xxmaj owen xxmaj wilson , xxmaj jason xxmaj schwartzman , and xxmaj adrien xxmaj brody , as the xxmaj whitman brothers , xxmaj francis , xxmaj jack , and xxmaj peter respectively )",[['neg' 'pos']]
"xxbos xxmaj ashutosh xxmaj gowariker 's "" xxmaj jodhaa xxmaj akbar "" is the most ambitious film to emerge from xxmaj bollywood 's stables in quite a while . xxmaj based on the historical alliance between xxmaj india 's greatest xxmaj mughal emperor and a xxmaj rajput xxmaj hindu princess",[['neg' 'pos']]
xxbos 10 . xxmaj the script \n\n xxmaj uncredited as a scriptwriter is novelist f. xxmaj scott xxmaj fitzgerald . xxmaj his love scenes are extremely elaborate and exquisitely structured . xxmaj they also introduce innovations that have since become clichés and the hallmark of ' women pictures ' everywhere,[['neg' 'pos']]


We can then create a model to classify those reviews and load the encoder we saved before.

In [25]:
learn = text_classifier_learner(data_clas, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')
learn.freeze()

FileNotFoundError: [Errno 2] No such file or directory: '/home/jupyter/.fastai/data/imdb/models/fine_tuned_enc.pth'

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

In [2]:
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))

NameError: name 'learn' is not defined

In [None]:
learn.save('first')

In [None]:
learn.load('first');

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

In [None]:
learn.save('second')

In [None]:
learn.load('second');

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

In [None]:
learn.save('third')

In [None]:
learn.load('third');

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

In [None]:
learn.predict("I really loved that movie, it was awesome!")