In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [0]:
from fastai.text import *

In [3]:
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()

Downloading http://files.fast.ai/data/examples/imdb_sample


[PosixPath('/root/.fastai/data/imdb_sample/texts.csv')]

In [4]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [5]:
df['text'][1]

'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.<br /><br />The result is that there is much cruelty and inhumanity in the situation and this is very u

A text is composed of words, and we can't apply mathematical functions to them directly. We first have to convert them to numbers. This is done in two differents steps: tokenization and numericalization. A TextDataBunch does all of that behind the scenes for you.

In [6]:
data_lm = TextDataBunch.from_csv(path, 'texts.csv')

In [0]:
data_lm.save()

In [0]:
data = load_data(path)

**Tokenization**

To see what the tokenizer had done behind the scenes, let's have a look at a few texts in a batch.

In [9]:
data = TextClasDataBunch.from_csv(path, 'texts.csv')
data.show_batch()

text,target
"xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n \n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , steaming bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj",negative
"xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with",positive
"xxbos xxmaj this film sat on my xxmaj tivo for weeks before i watched it . i dreaded a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj xxunk . \n \n xxmaj the format is the same as xxmaj xxunk xxmaj xxunk ' "" xxmaj la xxmaj ronde",positive
"xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first xxup 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - rounded gaming experience in general . xxmaj with graphics",positive
"xxbos i really wanted to love this show . i truly , honestly did . \n \n xxmaj for the first time , gay viewers get their own version of the "" xxmaj the xxmaj bachelor "" . xxmaj with the help of his obligatory "" hag "" xxmaj xxunk , xxmaj james , a good looking , well - to - do thirty - something has the chance",negative


**Numericalization**

Once we have extracted tokens from our texts, we convert to integers by creating a list of all the words used. We only keep the ones that appear at least twice with a maximum vocabulary size of 60,000 (by default) and replace the ones that don't make the cut by the unknown token UNK.

In [10]:
data.vocab.itos[:10]

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the']

In [11]:
data.train_ds[0][0]

Text xxbos xxmaj the direction struck me as poor man 's xxmaj xxunk xxmaj bergman . xxmaj the xxunk dialogue was annoying . xxmaj the xxunk xxunk that all characters except xxmaj xxunk ' showed made me think they were drugged . i think the director ruined it for me .

In [12]:
data.train_ds[0][0].data[:10]

array([   2,    5,    9,  513, 3118,   88,   29,  322,  144,   23])

**with the Datablock API**

With the data block API though, we have to manually call the tokenize and numericalize steps.

In [13]:
data = (TextList.from_csv(path, 'texts.csv', cols='text')
                .split_from_df(col=2)
                .label_from_df(cols=0)
                .databunch())

**Language Model**

In [0]:
bs=48

In [15]:
path = untar_data(URLs.IMDB)
path.ls()

Downloading https://s3.amazonaws.com/fast-ai-nlp/imdb


[PosixPath('/root/.fastai/data/imdb/test'),
 PosixPath('/root/.fastai/data/imdb/tmp_lm'),
 PosixPath('/root/.fastai/data/imdb/tmp_clas'),
 PosixPath('/root/.fastai/data/imdb/train'),
 PosixPath('/root/.fastai/data/imdb/imdb.vocab'),
 PosixPath('/root/.fastai/data/imdb/unsup'),
 PosixPath('/root/.fastai/data/imdb/README')]

In [16]:
(path/'train').ls()

[PosixPath('/root/.fastai/data/imdb/train/unsupBow.feat'),
 PosixPath('/root/.fastai/data/imdb/train/neg'),
 PosixPath('/root/.fastai/data/imdb/train/pos'),
 PosixPath('/root/.fastai/data/imdb/train/labeledBow.feat')]

We first pre-train our language wiki103 model with the unlabelled data to make it adaptable to our data. 

Making the unlabelled data using datablock API

In [17]:
data_lm = (TextList.from_folder(path)
            .filter_by_folder(include=['train', 'test', 'unsup'])
            .split_by_rand_pct(0.1)
            .label_for_lm()
            .databunch(bs=bs))
data_lm.save('data_lm.pkl')

In [0]:
data_lm = load_data(path, 'data_lm.pkl', bs=bs)

In [19]:
data_lm.show_batch()

idx,text
0,"masterpiece and that all those who in such big words condemned xxmaj harlin 's version and praised xxmaj xxunk , even if no one had ever seen it , would have been right . \n \n xxmaj but they were n't . \n \n xxmaj to put it in a nutshell : xxmaj schrader has no idea what a horror film should be , and it shows in"
1,"xxmaj bleah . xxbos i had very high hopes walking into this movie . xxmaj after all , xxmaj ocean 's 11 was a truly great xxmaj hollywood product . xxmaj its rapid - fire jokes , incredible star power and tight script made it one of the most fun caper films i have ever seen . xxmaj of course , with all the money it made , a sequel"
2,"up more . \n \n xxmaj oh and xxup ps : i do n't know what 's this guy "" xxmaj uwe "" capable of creating ... i certainly do n't think this movie is bad just because he 's the director . i checked the list of his work , and this is his xxunk probably the last ) creation that i 've seen . \n \n"
3,"me . xxmaj my chief complaint is that it 's needlessly exploitative of xxmaj jillian mcwhirter 's nudity , i 'm no prude but these nude scenes just drag on and on and on ... only to culminate ( virtually every time ) in a tawdry * wink , nudge * insinuation of sexual violence . xxmaj the scene where she attempts a coat hanger abortion after several minutes of"
4,"the incidental music sound like a pig snorting ? xxmaj what i mean by that is where we hear this baritone saxophone being played with drums accompanying it , but the melodies are basically tuneless ! xxbos xxmaj this movie was obviously made with a very low budget , but did they have to make it so obvious ? xxmaj it looked like they made no effort to make the"


In [20]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

Downloading https://s3.amazonaws.com/fast-ai-modelzoo/wt103-fwd


In [22]:
learn.lr_find()
learn.recorder.plot(skip_end=15)

epoch,train_loss,valid_loss,accuracy,time


LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.


KeyboardInterrupt: ignored

In [0]:
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time


In [0]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
dest = Path("/content/gdrive/My Drive/")

In [0]:
learn.save(dest/'fit_head')

In [0]:
learn.load(dest/'fit_head');

In [0]:
learn.unfreeze()

In [0]:
learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))

In [0]:
learn.save(dest/'fine_tuned')

Test our langauge model on sample data 

In [0]:
learn.load(dest/'fine_tuned');

In [0]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2

In [0]:
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

We have to save not only the model, but also its encoder, the part that's responsible for creating and updating the hidden state. 

For the classifier we just need the encoder

In [0]:
learn.save_encoder('fine_tuned_enc')

**Classifier to classify the reviews**