#               Sentiment Classification 🎟🎬📽

![Bokeh](https://images.unsplash.com/photo-1529941779042-87d8d0bdc403?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=cf1566f201a7b54ad7de02457eb56309&auto=format&fit=crop&w=634&q=80)

In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy


import os




  from numpy.core.umath_tests import inner1d


We need to create a model that can understand language, then configure it to be able to determine if a given text in a movie review is positive or negative.  This model is a RNN

We need to set up paths for the model and training/validation data

In [3]:
PATH='/home/ubuntu/data/aclImdb/'


TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'

TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

imdbEr.txt  imdb.vocab  [0m[01;34mmodels[0m/  README  [01;34mtest[0m/  [01;34mtmp[0m/  [01;34mtrain[0m/


Lets see what is inside the training folder:

In [3]:
trn_files = !ls {TRN}
trn_files[:10]

['0_0.txt',
 '0_3.txt',
 '0_9.txt',
 '10000_0.txt',
 '10000_4.txt',
 '10000_8.txt',
 '1000_0.txt',
 '10001_0.txt',
 '10001_10.txt',
 '10001_4.txt']

Lets peek into a review:

In [6]:
review = !cat {TRN}{trn_files[10]}
review[0]

"cat: '{TRN}{trn_files[10]}': No such file or directory"

This person has mixed feelings towards this film!

Let us see how words are in our training and validation sets respectively:

In [0]:
#!find {TRN} -name '*.txt' | xargs cat | wc -w

17486581


In [0]:
#!find {VAL} -name '*.txt' | xargs cat | wc -w

5686719


TRN : 17486581

VAL: 5686719

In [9]:
#python -m spacy download en


The text needs to be tokenized before we can analyse it. This is a process of splitting a sentence into an array of words. 

In [4]:
spacy_tok = spacy.load('en')

In [7]:
' '.join([sent.string.strip() for sent in spacy_tok(review[0])])

"cat : ' { TRN}{trn_files[10 ] } ' : No such file or directory"

Above, Pytorch's torchtext lib preprocessed our data and used spacy to handle tokenization. 

We are going to make torchtext create a field to preprocess a peice of text, then it will make everything lowercase and tokenize it with spacy!

In [8]:
TEXT = data.Field(lower=True, tokenize="spacy")

Below we have the batch size parameter `bs` and the `bptt` (backprop through time)  parameter that will define the number of words that need to be processed at a time in each row of the mini-batch.  It also specifies how many layers will backdrop through the model. A high `bptt` will increase time and memory requirements. But the model will be able to handle long sentences. 

In [9]:
bs=64; bptt=70

Next we will create a ModelData object via the `LangugageModelData`. It will be passed the torchtext field object and paths to our two datasets. 

In [10]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

After the object is built will fil the `TEXT` object with a `TEXT.vocab` attribute. This vocab stores token that have been seen in the text and each will be mapped to a unique interger id. 

In [11]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl', 'wb'))

Below we have the # of batches;  # unique tokens in the vocab; #tokens in training set; # sentences

In [12]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(4583, 37392, 1, 20540756)

We start mapping interger IDs to unique tokens

In [13]:
TEXT.vocab.itos[:12] #'itos' : 'int-to-string'

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'in', 'it']

In [14]:
TEXT.vocab.stoi['a'] # 'stoi' : 'string to int'

6

the `LanguageModelData` object has only one item in each dataset. All the words of text are joined together.

In [15]:
md.trn_ds[0].text[:12]

['picking',
 'up',
 'the',
 'jacket',
 'of',
 'this',
 'dvd',
 'in',
 'the',
 'video',
 'store',
 'i']

the `LanguageModelData` object creates batches with 64 columns and 80 tokens. Each batch has the same data as labels.

In [16]:
next(iter(md.trn_dl))

(Variable containing:
   3701    297      9  ...    1667     12     14
     68  13351    359  ...      43    173     33
      2     51     24  ...     472      6    324
         ...            ⋱           ...         
   5529     19      4  ...     151     17      3
     68   2915     58  ...      82   8619     24
     22     21     26  ...     340    116     14
 [torch.cuda.LongTensor of size 77x64 (GPU 0)], Variable containing:
     68
  13351
    359
   ⋮   
   2793
     28
     37
 [torch.cuda.LongTensor of size 4928 (GPU 0)])

## Training Time!

Parameters to set:

In [17]:
em_sz = 200 # size of each embedding vector
nh = 500 # hidden activations per layer
nl = 3 # layers

We need to create a version of Adam optimizer. 

In [18]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

This code will enable us to use Kaggle's file system:

In [0]:
LEARNER_KWARGS = [
    'tmp_name', 'models_name', 'metrics', 'clip', 'crit',
]

def get_model(self, opt_fn, emb_sz, n_hid, n_layers, **kwargs):
    lm_kwargs = {k:v for k,v in kwargs.items() if k not in LEARNER_KWARGS}
    m = get_language_model(self.nt, emb_sz, n_hid, n_layers, self.pad_idx, **lm_kwargs)
    model = SingleModel(to_gpu(m))
    learner_kwargs = {k:v for k,v in kwargs.items() if k in LEARNER_KWARGS}
    return RNN_Learner(self, model, opt_fn=opt_fn, **learner_kwargs)

LanguageModelData.get_model = get_model

In [19]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
                      dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [24]:
learner

SequentialRNN(
  (0): RNN_Encoder(
    (encoder): Embedding(37392, 200, padding_idx=1)
    (encoder_with_dropout): EmbeddingDropout(
      (embed): Embedding(37392, 200, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDrop(
        (module): LSTM(200, 500, dropout=0.05)
      )
      (1): WeightDrop(
        (module): LSTM(500, 500, dropout=0.05)
      )
      (2): WeightDrop(
        (module): LSTM(500, 200, dropout=0.05)
      )
    )
    (dropouti): LockedDropout(
    )
    (dropouths): ModuleList(
      (0): LockedDropout(
      )
      (1): LockedDropout(
      )
      (2): LockedDropout(
      )
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=200, out_features=37392, bias=False)
    (dropout): LockedDropout(
    )
  )
)

In [31]:
md.nt = 34945

In [32]:
learner.load('imdb_adam3_c1_cl10_cyc_0')

RuntimeError: While copying the parameter named 0.encoder.weight, whose dimensions in the model are torch.Size([37392, 200]) and whose dimensions in the checkpoint are torch.Size([34945, 200]).

In [22]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=1, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                                
    0      4.871698   4.754862  



[array([4.75486])]

In [28]:
math.exp(4.75486)

116.14739139130678

In [46]:
learner.save_encoder('adam3_10_enc')

In [47]:
learner.load_encoder('adam3_10_enc')

In [29]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

TypeError: no default __reduce__ due to non-trivial __cinit__

In [30]:
m=learner.model
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
s = [TEXT.preprocess(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])

". so , it was n't quite was i was expecting , but i really liked it anyway ! the best"

In [31]:
m[0].bs=1
m.eval()
m.reset()
res,*_=m(t)
m[0].bs=bs

Lets get the top 10 predictions for the next word.

In [32]:
nexts = torch.topk(res[-1],10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

['part',
 'thing',
 'scene',
 'of',
 'i',
 'character',
 'movie',
 'aspect',
 'one',
 'is']

Lets see if the model can generate more text by itself:

In [35]:
print(ss, "\n")
for i in range(50):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

. So, it wasn't quite was I was expecting, but I really liked it anyway! The best 

part of the film , the film is a bit of a comedy . it 's a very good movie , and it 's a very good movie . <eos> i was n't expecting a movie about a man who was a little bit of a man . i was ...


# Sentiment 

We need the saved vocab from the language model

In [36]:
TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))

EOFError: Ran out of input

In [37]:
IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT,IMDB_LABEL,'/home/ubuntu/data/aclImdb/')

downloading aclImdb_v1.tar.gz


In [38]:
t = splits[0].examples[0]

In [41]:
p = splits[0].examples[5]

In [39]:
t.label, ' '.join(t.text[:16])

('pos',
 'this modern film noir with its off beat humour and dizzying succession of plot twists delivers')

In [40]:
t.label, ' '.join(t.text[:23])

('pos',
 'this modern film noir with its off beat humour and dizzying succession of plot twists delivers a story full of surprises , dangerous')

In [42]:
p.label, ' '.join(p.text[:23])

('pos',
 "it 's not too bad a b movie , with sanders , barrie , hale , cowen , hamilton , gargan , fitzgerald")

We are creating a ModelDAta Object from torchtext splits:

In [43]:
md2 = TextData.from_splits(PATH, splits, bs)

In [48]:
m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl,
                  dropout=0.1,dropouti=0.4, wdrop=0.5, dropoute=0.05,dropouth=0.3)
m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
m3.load_encoder(f'adam3_10_enc')

We will use diffrential learning rates to fine-tune the model, also we need to increase the max gradient for clipping to imoprove SGDR to work better. 

In [49]:
m3.clip=25.
lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])

In [50]:
m3.freeze_to(-1)
m3.fit(lrs/2, 1, metrics=[accuracy])
m3.unfreeze()
m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                    
    0      0.6021     0.654455   0.734544  



HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss   accuracy                    
    0      0.513876   0.593159   0.795688  



[array([0.59316]), 0.7956881495339909]