# NLP

The [large movie view dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains 50,000 movie reviews from IMDB. 25,000 of the reviews gave a positive score ≥ 7 out of 10, and the other 25,000 reviews gave a score ≤ 4. Our task is to train a predictive model that can read a review and decide whether it is positive or negative.

This model will be built in two stages:

1. Train a [language model](https://en.wikipedia.org/wiki/Language_model) over the moview reviews. A language model is a system that predicts the next word given a previous word(s). Training the language model will also create moview review contextualized embeddings for each token (usually a word) in the vocabulary (set of tokens over the dataset).
2. Fine-tune the language model into the classification model.

This notebook follows along [a notebook](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson4-imdb.ipynb) to do language modeling and text sentiment classification. It does however contain my commentary and thoughts.

## Imports

* fastai will provide specialzed techniques to do language modeling and fine-tuning.
* torchtext is PyTorch's NLP helper library which will be useful for data processing and datasets
* dill is a pickle replacement for storing processed data

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle

## Data

In [2]:
PATH='data/aclImdb/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

imdbEr.txt  imdb.vocab  [0m[01;34mmodels[0m/  README  [01;34mtest[0m/  [01;34mtmp[0m/  [01;34mtrain[0m/


In [3]:
!tree -d {PATH}

[01;34mdata/aclImdb/[00m
├── [01;34mmodels[00m
├── [01;34mtest[00m
│   ├── [01;34mall[00m
│   ├── [01;34mneg[00m
│   └── [01;34mpos[00m
├── [01;34mtmp[00m
└── [01;34mtrain[00m
    ├── [01;34mall[00m
    ├── [01;34mneg[00m
    ├── [01;34mpos[00m
    └── [01;34munsup[00m

11 directories


In [4]:
trn_files = !ls {TRN}
trn_files[:10]

['0_0.txt',
 '0_3.txt',
 '0_9.txt',
 '10000_0.txt',
 '10000_4.txt',
 '10000_8.txt',
 '1000_0.txt',
 '10001_0.txt',
 '10001_10.txt',
 '10001_4.txt']

In [5]:
review = !cat {TRN}{trn_files[6]}
review[0]

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop-socky fung-ku, but what I got instead was a comedy. So, it wasn't quite was I was expecting, but I really liked it anyway! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them!! I was laughing my ass off. I mean, the cops were just so bad! And when I say bad, I mean The Shield Vic Macky bad. But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose...man, oh man. What can you say about that hottie. She was great and put those other actresses to shame. She should work more often!!!!! I also really liked the fight scene outside of the building. That was done really well. Lots of fighting and people getting their heads banged up. FUN! Last, but not least Joe Estevez and William Smith were great as the...well, I wasn't sure what they were, but they seemed to be having fun and throwing out 

In [6]:
!find {TRN} -name '*.txt' | xargs cat | wc -w

17486581


In [7]:
!find {VAL} -name '*.txt' | xargs cat | wc -w

5686719


In [8]:
' '.join(spacy_tok(review[0]))

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop - socky fung - ku , but what I got instead was a comedy . So , it was n't quite was I was expecting , but I really liked it anyway ! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them ! ! I was laughing my ass off . I mean , the cops were just so bad ! And when I say bad , I mean The Shield Vic Macky bad . But unlike that show I was laughing when they shot people and smoked dope . \n\n Felissa Rose ... man , oh man . What can you say about that hottie . She was great and put those other actresses to shame . She should work more often ! ! ! ! ! I also really liked the fight scene outside of the building . That was done really well . Lots of fighting and people getting their heads banged up . FUN ! Last , but not least Joe Estevez and William Smith were great as the ... well , I was n't sure what they were , but they seemed t

In [9]:
TEXT = data.Field(lower=True, tokenize=spacy_tok)
TEXT

<torchtext.data.field.Field at 0x7f0f37bfdb70>

In [10]:
bs=64; bptt=65

In [11]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

In [12]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

In [13]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(4957, 34933, 1, 20626674)

In [14]:
# 'itos': 'int-to-string'
TEXT.vocab.itos[:12]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'it', 'in']

In [15]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['the']

2

In [16]:
md.trn_ds[0].text[:12]

['this',
 'short',
 'film',
 'that',
 'inspired',
 'the',
 'soon',
 '-',
 'to',
 '-',
 'be',
 'full']

In [17]:
next(iter(md.trn_dl))

(Variable containing:
     13     57     17  ...      34    116     11
    366      8     15  ...     334     57     72
     25    146     59  ...       4   1036    306
         ...            ⋱           ...         
      7     11      3  ...      63     19      5
      2     13     24  ...    3354   2266     11
    818     25    173  ...     912     11     72
 [torch.cuda.LongTensor of size 76x64 (GPU 0)], Variable containing:
    366
      8
     15
   ⋮   
     11
      6
    675
 [torch.cuda.LongTensor of size 4864 (GPU 0)])

In [18]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

In [19]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

In [20]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)

In [None]:
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [None]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

epoch      trn_loss   val_loss                                
    0      4.861833   4.727922  
 64%|██████▎   | 3160/4957 [05:25<03:05,  9.71it/s, loss=4.71]

In [None]:
learner.save_encoder('adam1_enc')

In [None]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)