# IMDB NLP

The [large movie view dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains 50,000 movie reviews from [IMDB](http://www.imdb.com/). 25,000 of the reviews give a positive score ≥ 7 out of 10, and the other 25,000 reviews give a score ≤ 4. Our task is to train a predictive model that can read a review and decide whether it is positive or negative.

This model will be built in two stages:

1. Train a [language model](https://en.wikipedia.org/wiki/Language_model) over the moview reviews. A language model is a system that predicts the next word given a previous word(s). Training the language model will also create movie review contextualized embeddings for each token (usually a word) in the vocabulary (set of tokens over the dataset).
2. Fine-tune the language model into the classification model.

This notebook follows along [a notebook](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson4-imdb.ipynb) to do language modeling and text sentiment classification. It does however contain my commentary and thoughts.

## Imports

* fastai will provide specialzed techniques to do language modeling and fine-tuning.
* torchtext is PyTorch's NLP helper library which will be useful for data processing and datasets
* dill is a pickle replacement for storing processed data

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle

## Data
Set up various paths to load data and save weights / temporary files.

In [2]:
PATH='data/aclImdb/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

imdbEr.txt  imdb.vocab  [0m[01;34mmodels[0m/  README  [01;34mtest[0m/  [01;34mtmp[0m/  [01;34mtrain[0m/


In [3]:
!tree -d {PATH}

[01;34mdata/aclImdb/[00m
├── [01;34mmodels[00m
├── [01;34mtest[00m
│   ├── [01;34mall[00m
│   ├── [01;34mneg[00m
│   └── [01;34mpos[00m
├── [01;34mtmp[00m
└── [01;34mtrain[00m
    ├── [01;34mall[00m
    ├── [01;34mneg[00m
    ├── [01;34mpos[00m
    └── [01;34munsup[00m

11 directories


In [4]:
trn_files = !ls {TRN}
trn_files[:10]

['0_0.txt',
 '0_3.txt',
 '0_9.txt',
 '10000_0.txt',
 '10000_4.txt',
 '10000_8.txt',
 '1000_0.txt',
 '10001_0.txt',
 '10001_10.txt',
 '10001_4.txt']

In [5]:
review = !cat {TRN}{trn_files[6]}
review[0]

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop-socky fung-ku, but what I got instead was a comedy. So, it wasn't quite was I was expecting, but I really liked it anyway! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them!! I was laughing my ass off. I mean, the cops were just so bad! And when I say bad, I mean The Shield Vic Macky bad. But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose...man, oh man. What can you say about that hottie. She was great and put those other actresses to shame. She should work more often!!!!! I also really liked the fight scene outside of the building. That was done really well. Lots of fighting and people getting their heads banged up. FUN! Last, but not least Joe Estevez and William Smith were great as the...well, I wasn't sure what they were, but they seemed to be having fun and throwing out 

In [6]:
!find {TRN} -name '*.txt' | xargs cat | wc -w

17486581


In [7]:
!find {VAL} -name '*.txt' | xargs cat | wc -w

5686719


In [8]:
' '.join(spacy_tok(review[0]))[:20]

'I have to say when a'

In [9]:
TEXT = data.Field(lower=True, tokenize=spacy_tok)

In [10]:
bs=64; bptt=65

In [11]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

In [12]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

In [13]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(4957, 34933, 1, 20626674)

In [14]:
# 'itos': 'int-to-string'
TEXT.vocab.itos[:12]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'it', 'in']

In [15]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['the']

2

In [16]:
md.trn_ds[0].text[:12]

['this',
 'short',
 'film',
 'that',
 'inspired',
 'the',
 'soon',
 '-',
 'to',
 '-',
 'be',
 'full']

In [17]:
next(iter(md.trn_dl))

(Variable containing:
     13     57     17  ...      34    116     11
    366      8     15  ...     334     57     72
     25    146     59  ...       4   1036    306
         ...            ⋱           ...         
      7     11      3  ...      63     19      5
      2     13     24  ...    3354   2266     11
    818     25    173  ...     912     11     72
 [torch.cuda.LongTensor of size 76x64 (GPU 0)], Variable containing:
    366
      8
     15
   ⋮   
     11
      6
    675
 [torch.cuda.LongTensor of size 4864 (GPU 0)])

## Language model

In [18]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

In [19]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

In [20]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)

In [21]:
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [22]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

epoch      trn_loss   val_loss                                
    0      4.861833   4.727922  
    1      4.665112   4.519968                                
    2      4.550101   4.43793                                 
    3      4.579395   4.460163                                
    4      4.514097   4.385937                                
    5      4.436209   4.329441                                
    6      4.39791    4.313221                                
    7      4.55677    4.407767                                
    8      4.50389    4.376834                                
    9      4.473746   4.345392                                
    10     4.420918   4.313469                                
    11     4.392581   4.283422                                
    12     4.356932   4.259883                                
    13     4.328219   4.247171                                
    14     4.316426   4.244043                                



[4.244043]

In [None]:
# learner.save_encoder('adam1_enc')
learner.load_encoder('adam1_enc')

In [None]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)

epoch      trn_loss   val_loss                                
    0      4.485255   4.359861  
                                                              

In [22]:
# learner.save_encoder('adam2_enc')
learner.load_encoder('adam2_enc')

In [23]:
m=learner.model
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
ss="""film festival"""
s = [spacy_tok(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])

'film festival'

In [24]:
t

Variable containing:
   25
 1330
[torch.cuda.LongTensor of size 2x1 (GPU 0)]

In [25]:
# Set batch size to 1
m[0].bs=1
# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was
m[0].bs=bs

In [26]:
nexts = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

['.', ',', 'and', 'in', 'that', 'at', 'on', 'for', 'is', '(']

In [27]:
print(ss,"\n")
for i in range(50):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

film festival 

. 

 the film is a bit of a mess , but it 's not a bad one . it 's a shame that the film is n't so much a film as a comedy . it 's a shame that the film is n't so much a comedy as ...


## Sentiment

In [28]:
TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))

In [29]:
IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')

downloading aclImdb_v1.tar.gz


In [30]:
t = splits[0].examples[0]

In [31]:
t.label, ' '.join(t.text[:16])

('pos',
 'this short film that inspired the soon - to - be full length feature - spatula')

In [32]:
md2 = TextData.from_splits(PATH, splits, bs)

In [34]:
m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
           dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3)
m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
m3.load_encoder('adam2_enc')

In [35]:
m3.clip=25.
lrs=np.array([1e-4,1e-3,1e-2])

In [36]:
m3.freeze_to(-1)
m3.fit(lrs/2, 1, metrics=[accuracy])
m3.unfreeze()
m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1)

epoch      trn_loss   val_loss   accuracy                   
    0      1.096075   1.044184   0.436899  



epoch      trn_loss   val_loss   accuracy                    
    0      0.476723   0.35695    0.91895   



[0.35694996, 0.9189503205128206]

In [37]:
m3.fit(lrs, 7, metrics=[accuracy], cycle_len=2, cycle_save_name='imdb2')

epoch      trn_loss   val_loss   accuracy                    
    0      0.419083   0.342301   0.91895   
    1      0.386147   0.31562    0.926643                    
    2      0.389862   0.313194   0.928686                    
    3      0.364697   0.294177   0.931731                    
    4      0.357255   0.305301   0.929567                    
    5      0.334539   0.294033   0.933293                    
    6      0.355661   0.292309   0.931931                    
    7      0.324982   0.280241   0.934776                    
    8      0.33962    0.29087    0.934776                    
    9      0.325175   0.273182   0.936378                    
    10     0.330323   0.278216   0.933694                    
    11     0.323876   0.275894   0.934696                    
    12     0.317398   0.273455   0.934095                    
    13     0.317698   0.281065   0.934575                    



[0.28106526, 0.9345753205128206]

In [38]:
m3.load_cycle('imdb2', 4)

In [39]:
accuracy_np(*m3.predict_with_targs())

0.9360176282051282