for sentiment classification we first need a model which understands the language so we first make a model which predicts next word and then finetune it to make it do sentiment classification

In [0]:
!pip install Pillow==4.1.1
!pip install "fastai==0.7.0"
!pip install torchtext==0.2.3


In [0]:
from fastai.learner import *
import torchtext
from torchtext import vocab,data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *
import dill as pickle
from os import path

import spacy

In [15]:
!wget http://files.fast.ai/data/aclImdb.tgz

--2019-01-09 10:43:12--  http://files.fast.ai/data/aclImdb.tgz
Resolving files.fast.ai (files.fast.ai)... 67.205.15.147
Connecting to files.fast.ai (files.fast.ai)|67.205.15.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 145982645 (139M) [text/plain]
Saving to: ‘aclImdb.tgz’


2019-01-09 10:43:13 (103 MB/s) - ‘aclImdb.tgz’ saved [145982645/145982645]



In [0]:
!mkdir data/
!tar -xvzf aclImdb.tgz -C data/


In [27]:
PATH='data/aclImdb/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

imdbEr.txt  imdb.vocab  README  [0m[01;34mtest[0m/  [01;34mtrain[0m/


In [28]:
trn_files = !ls {TRN}
trn_files[:10]

['0_0.txt       1562_10.txt  24997_0.txt\t34371_0.txt  43748_0.txt  6248_7.txt',
 '0_3.txt       15621_0.txt  24998_0.txt\t3437_1.txt   43749_0.txt  6249_0.txt',
 '0_9.txt       1562_1.txt   24999_0.txt\t34372_0.txt  437_4.txt\t  6249_2.txt',
 '10000_0.txt   15622_0.txt  25000_0.txt\t34373_0.txt  43750_0.txt  6249_7.txt',
 '10000_4.txt   15623_0.txt  2500_0.txt\t34374_0.txt  4375_0.txt   624_9.txt',
 '10000_8.txt   15624_0.txt  25001_0.txt\t34375_0.txt  43751_0.txt  6250_0.txt',
 '1000_0.txt    15625_0.txt  2500_1.txt\t34376_0.txt  4375_1.txt   6250_10.txt',
 '10001_0.txt   15626_0.txt  25002_0.txt\t34377_0.txt  43752_0.txt  6250_1.txt',
 '10001_10.txt  15627_0.txt  25003_0.txt\t34378_0.txt  43753_0.txt  625_0.txt',
 '10001_4.txt   15628_0.txt  25004_0.txt\t3437_8.txt   43754_0.txt  625_10.txt']

In [29]:
# finding words in dataset
!find {TRN} -name '*.txt' | xargs cat| wc -w

17486581


In [0]:
#loading the tokenizer
spacy_tok = spacy.load('en')

In [31]:
review = !cat {TRN}{trn_files[6]}
review[0]

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop-socky fung-ku, but what I got instead was a comedy. So, it wasn't quite was I was expecting, but I really liked it anyway! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them!! I was laughing my ass off. I mean, the cops were just so bad! And when I say bad, I mean The Shield Vic Macky bad. But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose...man, oh man. What can you say about that hottie. She was great and put those other actresses to shame. She should work more often!!!!! I also really liked the fight scene outside of the building. That was done really well. Lots of fighting and people getting their heads banged up. FUN! Last, but not least Joe Estevez and William Smith were great as the...well, I wasn't sure what they were, but they seemed to be having fun and throwing out 

In [32]:
' '.join([sent.string.strip() for sent in spacy_tok(review[0])])

"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop - socky fung - ku , but what I got instead was a comedy . So , it was n't quite was I was expecting , but I really liked it anyway ! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them ! ! I was laughing my ass off . I mean , the cops were just so bad ! And when I say bad , I mean The Shield Vic Macky bad . But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose ... man , oh man . What can you say about that hottie . She was great and put those other actresses to shame . She should work more often ! ! ! ! ! I also really liked the fight scene outside of the building . That was done really well . Lots of fighting and people getting their heads banged up . FUN ! Last , but not least Joe Estevez and William Smith were great as the ... well , I was n't sure what they were , but they see

In [0]:
# First, we create a torchtext field, which describes how to preprocess a piece of text - in this case, 
# we tell torchtext to make everything lowercase, and tokenize it with spacy.

TEXT = data.Field(lower = True,tokenize ="spacy")

In [0]:
# bptt; this define how many words are processing at a time in each row of the mini-batch. 
# More importantly, it defines how many 'layers' we will backprop through. Making this number higher will increase time 
# and memory requirements, but will improve the model's ability to handle long sentences.
# Back Prop Through Time. It means how long a sentence we will stick on the GPU at once

bs = 64
bptt = 70

In [35]:
# create a ModelData object for language modeling by taking advantage of LanguageModelData, passing it our torchtext field object, 
# and the paths to our training, test, and validation sets. 
# In this case, we don't have a separate test set, so we'll just use VAL_PATH for that too.

# min_freq=10 : In a moment, we are going to be replacing words with integers (a unique index for every word). 
# If there are any words that occur less than 10 times, just call it unknown.

%%time
FILES = dict(train = TRN_PATH,test = VAL_PATH,validation =VAL_PATH)
model_data = LanguageModelData.from_text_files(PATH,TEXT,**FILES,bs=bs,bptt = bptt,min_freq = 10)

# After building our ModelData object, it automatically fills the TEXT object with a very important attribute: TEXT.vocab. 
# This is a vocabulary, which stores which words (or tokens) have been seen in the text, 
# and how each word will be mapped to a unique integer id. We'll need to use this information again later, so we save it.



CPU times: user 4min 52s, sys: 9.36 s, total: 5min 1s
Wall time: 5min 1s


In [0]:
# save the info

!mkdir {PATH}models/
pickle.dump(TEXT,open(f'{PATH}models/TEXT.pkl','wb'))

# (Technical note: python's standard Pickle library can't handle this correctly, 
# so at the top of this notebook we used the dill library instead and imported it as pickle)

In [37]:
# batches; # unique tokens in the vocab; # tokens in the training set; # sentences
len(model_data.trn_dl), model_data.nt, len(model_data.trn_ds), len(model_data.trn_ds[0].text)

(4583, 37392, 1, 20540756)

In [38]:
# integer to string maps in order of frequency except 1st 3 unnamed and padding # part of torchtext
TEXT.vocab.itos[:12]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'in', 'it']

In [39]:
# string to int
TEXT.vocab.stoi['the']

2

In [40]:
# Note that in a LanguageModelData object there is only one item in each dataset: all the words of the text joined together.

model_data.trn_ds[0].text[:12]


['again',
 ',',
 'we',
 'see',
 'what',
 'could',
 'be',
 'a',
 'really',
 'good',
 'movie',
 'fail']

In [0]:
# change to integer // do mapping
TEXT.numericalize([model_data.trn_ds[0].text[:12]])


## BPTT and Batch Size

we create colms of selected bs and rows of all remaining values in bs

  [64 batches]  take 70 values at once for processing [bptt]

---


  [64 batches]

---


  .
  .
  upto number of words/bs times 

In [0]:
# get batch of data 
next(iter(model_data.trn_dl))

# torchtext randomly change bptt values during each epoch same as shuffling images
# as we cant shuffle words works same

# 1st colm is the first 75 words of 1st segment
# 2nd colm is first 75 words of 2nd seg

# in downside the matrix if flattened out for technical reasons but have same order


### now we have modeldata object that can feed us batches so we now make our model

In [0]:
##  Generally, an embedding size for a word will be somewhere between 50 and 600.


em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

In [0]:
# Researchers have found that large amounts of momentum (which we’ll learn about later) don’t work well with these kinds of RNN models, so we create a version of the Adam # optimizer with less momentum than its default of 0.9. Any time you are doing NLP, 
# you should probably include this line:

opt_fn = partial(optim.Adam, betas=(0.7, 0.99)) #optimizer function


In [0]:
# fastai uses a variant of the state of the art AWD LSTM Language Model developed by Stephen Merity. 
# A key feature of this model is that it provides excellent regularization through Dropout. 
# There is no simple way known (yet!) to find the best values of the dropout parameters below - 
# you just have to experiment...

# However, the other parameters (alpha, beta, and clip) shouldn't generally need tuning.

learner = model_data.get_model(opt_fn,em_sz,nh,nl,dropouti =0.05,dropout= 0.05,dropoute=0.02,dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg,alpha =2,beta =1) 
learner.clip =0.3

# if you try to build an NLP model and you are under-fitting, then decrease all these dropouts, if overfitting, 
# then increase all these dropouts in roughly this ratio.
# when you look at your gradients and you multiply them by the learning rate to decide how much to update your weights by, 
# this will not allow them be more than 0.3

In [0]:
%%time
learner.lr_find()

In [0]:
learner.sched.plot_lr()

In [0]:
# # wds = weight decay
## no of cycles n_cycle = 4
#########################################
#########################################
#original# learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)
learner.fit(3e-1,2,wds = 1e-2,cycle_len=1,cycle_mult=2)


In [0]:
learner.save_encoder('adam1_enc')

In [0]:
learner.load_encoder('adam1_enc')

In [0]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)

In [0]:
# In the sentiment analysis section, we'll just need half of the language model - the encoder, so we save that part.

learner.save_encoder('adam3_10_enc')

In [0]:
learner.load_encoder('adam3_10_enc')

In [0]:
# Language modeling accuracy is generally measured using the metric perplexity, which is simply exp() of the loss function we used.
# here our loss is 4.16 so
math.exp(4.165)


In [0]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))