[View in Colaboratory](https://colab.research.google.com/github/AmoDinho/fastai/blob/master/%20%20courses/dl1/%20kernel.ipynb)

#               Sentiment Classification 🎟🎬📽

![Bokeh](https://images.unsplash.com/photo-1529941779042-87d8d0bdc403?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=cf1566f201a7b54ad7de02457eb56309&auto=format&fit=crop&w=634&q=80)

In [1]:
!wget -NS --content-disposition "https://console.clouderizer.com/givemeinitsh/vkZGDOvx" && bash ./clouderizer_init.sh

MessageError: ignored

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy


import os




We need to create a model that can understand language, then configure it to be able to determine if a given text in a movie review is positive or negative.  This model is a RNN

We need to set up paths for the model and training/validation data

In [1]:
PATH=''


TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'

TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

clouderizer_init.sh  install_clouderizer_service.sh  wget-log    wget-log.5
clouderizer.jar      [0m[01;34m__pycache__[0m/                    wget-log.1  wget-log.6
clouderizer.service  [01;34msample_data[0m/                    wget-log.2  wget-log.7
colab_init.sh        updatestatus                    wget-log.3
colab.py             updatestatus.1                  wget-log.4


Lets see what is inside the training folder:

In [0]:
trn_files = !ls {TRN}
trn_files[:10]

['0_0.txt',
 '0_3.txt',
 '0_9.txt',
 '10000_0.txt',
 '10000_4.txt',
 '10000_8.txt',
 '10001_0.txt',
 '10001_10.txt',
 '10001_4.txt',
 '10002_0.txt']

Lets peek into a review:

In [0]:
review = !cat {TRN}{trn_files[10]}
review[0]

'Sorry everyone,,, I know this is supposed to be an "art" film,, but wow, they should have handed out guns at the screening so people could blow their brains out and not watch. Although the scene design and photographic direction was excellent, this story is too painful to watch. The absence of a sound track was brutal. The loooonnnnng shots were too long. How long can you watch two people just sitting there and talking? Especially when the dialogue is two people complaining. I really had a hard time just getting through this film. The performances were excellent, but how much of that dark, sombre, uninspired, stuff can you take? The only thing i liked was Maureen Stapleton and her red dress and dancing scene. Otherwise this was a ripoff of Bergman. And i\'m no fan f his either. I think anyone who says they enjoyed 1 1/2 hours of this is,, well, lying.'

This person has mixed feelings towards this film!

Let us see how words are in our training and validation sets respectively:

In [0]:
#!find {TRN} -name '*.txt' | xargs cat | wc -w

17486581


In [0]:
#!find {VAL} -name '*.txt' | xargs cat | wc -w

5686719


TRN : 17486581

VAL: 5686719

The text needs to be tokenized before we can analyse it. This is a process of splitting a sentence into an array of words. 

In [0]:
spacy_tok = spacy.load('en')

In [0]:
' '.join([sent.string.strip() for sent in spacy_tok(review[0])])

'Sorry everyone , , , I know this is supposed to be an " art " film , , but wow , they should have handed out guns at the screening so people could blow their brains out and not watch . Although the scene design and photographic direction was excellent , this story is too painful to watch . The absence of a sound track was brutal . The loooonnnnng shots were too long . How long can you watch two people just sitting there and talking ? Especially when the dialogue is two people complaining . I really had a hard time just getting through this film . The performances were excellent , but how much of that dark , sombre , uninspired , stuff can you take ? The only thing i liked was Maureen Stapleton and her red dress and dancing scene . Otherwise this was a ripoff of Bergman . And i \'m no fan f his either . I think anyone who says they enjoyed 1 1/2 hours of this is , , well , lying .'

Above, Pytorch's torchtext lib preprocessed our data and used spacy to handle tokenization. 

We are going to make torchtext create a field to preprocess a peice of text, then it will make everything lowercase and tokenize it with spacy!

In [0]:
TEXT = data.Field(lower=True, tokenize="spacy")

Below we have the batch size parameter `bs` and the `bptt` (backprop through time)  parameter that will define the number of words that need to be processed at a time in each row of the mini-batch.  It also specifies how many layers will backdrop through the model. A high `bptt` will increase time and memory requirements. But the model will be able to handle long sentences. 

In [0]:
bs=64; bptt=70

Next we will create a ModelData object via the `LangugageModelData`. It will be passed the torchtext field object and paths to our two datasets. 

In [0]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

After the object is built will fil the `TEXT` object with a `TEXT.vocab` attribute. This vocab stores token that have been seen in the text and each will be mapped to a unique interger id. 

In [0]:
pickle.dump(TEXT, open(f'{PATH_WRITE}models/TEXT.pkl', 'wb'))

Below we have the # of batches;  # unique tokens in the vocab; #tokens in training set; # sentences

In [0]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(4583, 37392, 1, 20540756)

We start mapping interger IDs to unique tokens

In [0]:
TEXT.vocab.itos[:12] #'itos' : 'int-to-string'

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'in', 'it']

In [0]:
TEXT.vocab.stoi['a'] # 'stoi' : 'string to int'

6

the `LanguageModelData` object has only one item in each dataset. All the words of text are joined together.

In [0]:
md.trn_ds[0].text[:12]

['this',
 'movie',
 'is',
 'as',
 'unique',
 'as',
 'it',
 'is',
 'overlooked',
 '......',
 'a',
 'different']

the `LanguageModelData` object creates batches with 64 columns and 80 tokens. Each batch has the same data as labels.

In [0]:
next(iter(md.trn_dl))

(Variable containing:
     13     24     77  ...      19     96      2
     23     50      9  ...       2     47     63
      9     16    340  ...     242   1745      7
         ...            ⋱           ...         
     23     20    340  ...     573   3263     20
     19      2      3  ...    1045     20    246
   1477     25     12  ...       3    216     20
 [torch.cuda.LongTensor of size 80x64 (GPU 0)], Variable containing:
     23
     50
      9
   ⋮   
      5
    413
     13
 [torch.cuda.LongTensor of size 5120 (GPU 0)])

## Training Time!

Parameters to set:

In [0]:
em_sz = 200 # size of each embedding vector
nh = 500 # hidden activations per layer
nl = 3 # layers

We need to create a version of Adam optimizer. 

In [0]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

This code will enable us to use Kaggle's file system:

In [0]:
LEARNER_KWARGS = [
    'tmp_name', 'models_name', 'metrics', 'clip', 'crit',
]

def get_model(self, opt_fn, emb_sz, n_hid, n_layers, **kwargs):
    lm_kwargs = {k:v for k,v in kwargs.items() if k not in LEARNER_KWARGS}
    m = get_language_model(self.nt, emb_sz, n_hid, n_layers, self.pad_idx, **lm_kwargs)
    model = SingleModel(to_gpu(m))
    learner_kwargs = {k:v for k,v in kwargs.items() if k in LEARNER_KWARGS}
    return RNN_Learner(self, model, opt_fn=opt_fn, **learner_kwargs)

LanguageModelData.get_model = get_model

In [0]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
                      dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05, 
                       tmp_name=TMP_PATH, models_name=MODELS_PATH)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [0]:
learner

SequentialRNN(
  (0): RNN_Encoder(
    (encoder): Embedding(37392, 200, padding_idx=1)
    (encoder_with_dropout): EmbeddingDropout(
      (embed): Embedding(37392, 200, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDrop(
        (module): LSTM(200, 500, dropout=0.05)
      )
      (1): WeightDrop(
        (module): LSTM(500, 500, dropout=0.05)
      )
      (2): WeightDrop(
        (module): LSTM(500, 200, dropout=0.05)
      )
    )
    (dropouti): LockedDropout(
    )
    (dropouths): ModuleList(
      (0): LockedDropout(
      )
      (1): LockedDropout(
      )
      (2): LockedDropout(
      )
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=200, out_features=37392, bias=False)
    (dropout): LockedDropout(
    )
  )
)

In [0]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

A Jupyter Widget

  0%|          | 0/4583 [00:00<?, ?it/s]


AssertionError: 