<a href="https://colab.research.google.com/github/Aravinda89/fastai_bootcamp/blob/main/Gayan_DL201_10_nlp_own_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Deep Dive - Own Code

Own refactored code and notes for *Chapter 10: NLP Deep Dive: RNNs* ([`10_nlp.ipynb`](https://colab.research.google.com/github/vtecftwy/fastbook/blob/master/10_nlp.ipynb)).

## Instructions

It is recommended that you work in two steps:
1. Copy the code from the fastbook notebook and make sure it works
2. Refactor (i.e. rewrite the code in your own style) by 
    - regrouping things together that make sense ro you
    - adding text cells to explain what to code does in your own words and possible references to the doc you may have consulsted
    - deleting code you think was only there to explain things but are not required once you run models end to end

When you have done that, you get a customized reference notebook for you which you can consult later on when you forgot the details, withouht having to read the full notebook from fastbook.

## Your code

In [None]:
!pip install -Uqq fastbook
import fastbook
# fastbook.setup_book(bind_drive=False)

from fastbook import *
from IPython.display import display,HTML

[K     |████████████████████████████████| 727kB 14.2MB/s 
[K     |████████████████████████████████| 1.2MB 27.3MB/s 
[K     |████████████████████████████████| 194kB 48.1MB/s 
[K     |████████████████████████████████| 51kB 7.3MB/s 
[K     |████████████████████████████████| 61kB 8.2MB/s 
[K     |████████████████████████████████| 61kB 9.2MB/s 
[?25h

In [None]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

In [None]:
path

Path('/root/.fastai/data/imdb')

In [None]:
path.absolute()

Path('/root/.fastai/data/imdb')

In [None]:
# 1. List all the folders under path (using the path.iterdir() method)
print(f"path to dataset: {path.absolute()}")
[f"{'file:  ' if p.is_file() else 'folder:'} {p.name}" for p in path.iterdir()]

path to dataset: /root/.fastai/data/imdb


['file:   README',
 'folder: train',
 'folder: tmp_clas',
 'folder: unsup',
 'folder: tmp_lm',
 'folder: test',
 'file:   imdb.vocab']

In [None]:
# 2. Get the full text of README
with open(path/'README', mode='r') as f:
    txt = f.readlines()
print(''.join(txt))

Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a scor

In [None]:
# List the folders and list the files
print('Folders:')
display([p.name for p in path.iterdir() if p.is_dir()])
print('Files:')
display([p.name for p in path.iterdir() if p.is_file()])

Folders:


['train', 'tmp_clas', 'unsup', 'tmp_lm', 'test']

Files:


['README', 'imdb.vocab']

In [None]:
# Content of the training set (in train folder), test/validation set (in test folder) and in unsupervised (excluding text files)
[p.name for p in (path/'train').iterdir()], [p.name for p in (path/'test').iterdir()], [p.name for p in (path/'unsup').iterdir() if 'txt' not in p.suffix]

(['neg', 'pos', 'labeledBow.feat', 'unsupBow.feat'],
 ['neg', 'pos', 'labeledBow.feat'],
 [])

In [None]:
# First files for training in the positive review folder (pos) and negative review (neg). As mentioned in read.me the format is id_rating.txt
[p.name for p in (path/'train/pos').iterdir()][:5], [p.name for p in (path/'train/neg').iterdir()][:5]

(['11302_8.txt', '7374_10.txt', '1971_9.txt', '10207_10.txt', '4016_10.txt'],
 ['7594_2.txt', '4828_1.txt', '4232_1.txt', '9133_2.txt', '10783_2.txt'])

In [None]:
# First files for testing in the positive review folder (pos) and negative review (neg). As mentioned in read.me the format is id_rating.txt
[p.name for p in (path/'test/pos').iterdir()][:5], [p.name for p in (path/'test/neg').iterdir()][:5]

(['9667_8.txt', '8396_10.txt', '7150_10.txt', '1561_10.txt', '10207_10.txt'],
 ['4676_1.txt', '7594_2.txt', '6474_1.txt', '2282_2.txt', '9061_4.txt'])

In [None]:
# First files in unsup folder (pos). As mentioned in read.me the format is id_rating.txt, where rating is 0
[p.name for p in (path/'unsup').iterdir()][:5]

['5123_0.txt', '24961_0.txt', '7179_0.txt', '37452_0.txt', '27840_0.txt']

In [None]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])
files

(#100000) [Path('/root/.fastai/data/imdb/train/neg/7594_2.txt'),Path('/root/.fastai/data/imdb/train/neg/4828_1.txt'),Path('/root/.fastai/data/imdb/train/neg/4232_1.txt'),Path('/root/.fastai/data/imdb/train/neg/9133_2.txt'),Path('/root/.fastai/data/imdb/train/neg/10783_2.txt'),Path('/root/.fastai/data/imdb/train/neg/9061_4.txt'),Path('/root/.fastai/data/imdb/train/neg/5695_1.txt'),Path('/root/.fastai/data/imdb/train/neg/8920_1.txt'),Path('/root/.fastai/data/imdb/train/neg/6956_2.txt'),Path('/root/.fastai/data/imdb/train/neg/10118_4.txt')...]

In [None]:
len(files)

100000

Tokenization

In [None]:
txt = files[1].open().read()
txt[:150]

'I picked up TRAN SCAN from the library and brought it home. We have considered taking a trip out east and thought it would give us a feel of what it w'

In [None]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#257) ['I','picked','up','TRAN','SCAN','from','the','library','and','brought','it','home','.','We','have','considered','taking','a','trip','out','east','and','thought','it','would','give','us','a','feel','of'...]


In [None]:
len(toks)

257

In [None]:
first(spacy(['The U.S. dollar $1 is $1.00.']))

(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

In [None]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn('The U.S. dollar $1 is $1.00.'),20))

(#13) ['xxbos','xxmaj','the','xxup','u.s','.','dollar','$','1','is','$','1.00','.']


In [None]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#288) ['xxbos','i','picked','up','xxup','tran','xxup','scan','from','the','library','and','brought','it','home','.','xxmaj','we','have','considered','taking','a','trip','out','east','and','thought','it','would','give','us'...]


In [None]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html>,
 <function fastai.text.core.replace_rep>,
 <function fastai.text.core.replace_wrep>,
 <function fastai.text.core.spec_add_spaces>,
 <function fastai.text.core.rm_useless_spaces>,
 <function fastai.text.core.replace_all_caps>,
 <function fastai.text.core.replace_maj>,
 <function fastai.text.core.lowercase>]

In [None]:
replace_rep??

In [None]:
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

In [None]:
spacy([txt])

<generator object SpacyTokenizer.__call__.<locals>.<genexpr> at 0x7fa572dad850>

In [None]:
tokens = spacy([txt])
tokens

<generator object SpacyTokenizer.__call__.<locals>.<genexpr> at 0x7fa572dade50>

In [None]:
next(tokens, None)

(#257) ['I','picked','up','TRAN','SCAN','from','the','library','and','brought'...]

In [None]:
tokens = spacy([txt])
first(tokens)

(#257) ['I','picked','up','TRAN','SCAN','from','the','library','and','brought'...]

In [None]:
tokens = spacy([txt])
display(tokens)

tokens = spacy([txt])
display(next(tokens, None))

tokens = spacy([txt])
display(first(tokens))

txt0 = files[0].open().read()
print(1, ': ', txt0[0:90])

txt1 = files[1].open().read()
print(2, ': ', txt1[0:90])

txt2 = files[2].open().read()
print(3, ': ', txt2[0:90])

txt_collection = [txt0, txt1, txt2]
toks_collection = spacy(txt_collection)

print("")
display(first(toks_collection))
display(first(toks_collection))
display(first(toks_collection))

<generator object SpacyTokenizer.__call__.<locals>.<genexpr> at 0x7fa572dadbd0>

(#257) ['I','picked','up','TRAN','SCAN','from','the','library','and','brought'...]

(#257) ['I','picked','up','TRAN','SCAN','from','the','library','and','brought'...]

1 :  Given the history of the director of this movie, it is hard to believe that this was such 
2 :  I picked up TRAN SCAN from the library and brought it home. We have considered taking a tr
3 :  Karl Jr and his dad are now running an army on a remote island. They capture a trio of guy



(#310) ['Given','the','history','of','the','director','of','this','movie',','...]

(#257) ['I','picked','up','TRAN','SCAN','from','the','library','and','brought'...]

(#197) ['Karl','Jr','and','his','dad','are','now','running','an','army'...]

Subword Tokenization

In [None]:
txts = L(o.open().read() for o in files[:2000])

In [None]:
txts[0]

'Given the history of the director of this movie, it is hard to believe that this was such a painfully bad movie to sit through. I was at the European premiere last night and one of the Executive Producers was there. He was yet to see the movie and, boy, was he in for a surprise. I have not read the book that this is based upon, nor do I know if it highly rated or appreciated, but I have read "Captain Correlli\'s Mandolin" and given how poorly that was adapted for screen and how bad this movie was, I can only presume that something similar has happened here. The acting wasn\'t bad albeit that there were a couple-too-many raised eyebrows from Farrell. Honestly, I can\'t believed how little I cared for any character in this movie. Situations play out on the screen in an empty sequence of nothingness. Donald Sutherland\'s part comprises a few scenes where he opens a door, says something and closes it again. I kept looking at my watch when I wasn\'t cringing at the dialogue on the screen. 

In [None]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

In [None]:
subword(1000)

'▁I ▁p ick ed ▁up ▁T R AN ▁S C AN ▁from ▁the ▁li br ary ▁and ▁ br ough t ▁it ▁home . ▁W e ▁have ▁consider ed ▁ta k ing ▁a ▁tri p ▁out ▁e as t ▁and'

In [None]:
subword(200)

'▁I ▁p i ck ed ▁ u p ▁ T R A N ▁S C A N ▁f ro m ▁the ▁ li br ar y ▁and ▁b ro u g h t ▁it ▁h o m e . ▁'

In [None]:
subword(10000)

'▁I ▁pick ed ▁up ▁TRAN ▁SC AN ▁from ▁the ▁library ▁and ▁brough t ▁it ▁home . ▁We ▁have ▁considered ▁ taking ▁a ▁trip ▁out ▁ east ▁and ▁thought ▁it ▁would ▁give ▁us ▁a ▁feel ▁of ▁what ▁it ▁was ▁like .'

Numericalization with fastai

In [None]:
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

(#288) ['xxbos','i','picked','up','xxup','tran','xxup','scan','from','the','library','and','brought','it','home','.','xxmaj','we','have','considered','taking','a','trip','out','east','and','thought','it','would','give','us'...]


In [None]:
toks200 = txts[:200].map(tkn)
toks200[0]

(#328) ['xxbos','xxmaj','given','the','history','of','the','director','of','this'...]

In [None]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

"(#1960) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','it','is','i','in'...]"

In [None]:
nums = num(toks)[:20]
nums

TensorText([   2,   18,  947,   87,    7,    0,    7,    0,   62,    9,    0,   13, 1446,   16,  380,   10,    8,  100,   39, 1447])

In [None]:
' '.join(num.vocab[o] for o in nums)

'xxbos i picked up xxup xxunk xxup xxunk from the xxunk and brought it home . xxmaj we have considered'

Putting Our Texts into Batches for a Language Model

In [None]:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)

In [None]:
tokens

(#90) ['xxbos','xxmaj','in','this','chapter',',','we','will','go','back'...]

In [None]:
bs, seq_len = 6, 15

In [None]:
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])

In [None]:
d_tokens

array([['xxbos', 'xxmaj', 'in', 'this', 'chapter', ',', 'we', 'will', 'go', 'back', 'over', 'the', 'example', 'of', 'classifying'],
       ['movie', 'reviews', 'we', 'studied', 'in', 'chapter', '1', 'and', 'dig', 'deeper', 'under', 'the', 'surface', '.', 'xxmaj'],
       ['first', 'we', 'will', 'look', 'at', 'the', 'processing', 'steps', 'necessary', 'to', 'convert', 'text', 'into', 'numbers', 'and'],
       ['how', 'to', 'customize', 'it', '.', 'xxmaj', 'by', 'doing', 'this', ',', 'we', "'ll", 'have', 'another', 'example'],
       ['of', 'the', 'preprocessor', 'used', 'in', 'the', 'data', 'block', 'xxup', 'api', '.', '\n', 'xxmaj', 'then', 'we'],
       ['will', 'study', 'how', 'we', 'build', 'a', 'language', 'model', 'and', 'train', 'it', 'for', 'a', 'while', '.']], dtype='<U12')

In [None]:
df = pd.DataFrame(d_tokens)

display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


In [None]:
bs,seq_len = 6, 5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
xxbos,xxmaj,in,this,chapter
movie,reviews,we,studied,in
first,we,will,look,at
how,to,customize,it,.
of,the,preprocessor,used,in
will,study,how,we,build


In [None]:
bs, seq_len = 6, 5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
",",we,will,go,back
chapter,1,and,dig,deeper
the,processing,steps,necessary,to
xxmaj,by,doing,this,","
the,data,block,xxup,api
a,language,model,and,train


In [None]:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
over,the,example,of,classifying
under,the,surface,.,xxmaj
convert,text,into,numbers,and
we,'ll,have,another,example
.,\n,xxmaj,then,we
it,for,a,while,.


In [None]:
toks200[:5]



In [None]:
num

Numericalize:
encodes: (object,object) -> encodes
decodes: (object,object) -> decodes

In [None]:
nums200 = toks200.map(num)

In [None]:
nums200[0]

TensorText([   2,    8,  444,    9,  622,   14,    9,  286,   14,   20,   25,   11,   16,   17,  227,   15,  228,   21,   20,   22,  172,   12,    0,   82,   25,   15,  710,  150,   10,   18,   22,   50,
           9,    8,  623,    0,  273,  240,   13,   46,   14,    9,    8,    0,    8,  711,   22,   56,   10,    8,   42,   22,  262,   15,   96,    9,   25,   13,   11,  569,   11,   22,   42,   19,
          30,   12,  512,   10,   18,   39,   37,  341,    9,  379,   21,   20,   17,  357,  570,   11,  943,   48,   18,  124,   49,   16,  808,  624,   52,    0,   11,   27,   18,   39,  341,   23,
           0,    8,    0,   24,    8,    0,   23,   13,  444,   95,  513,   21,   22,    0,   30,  241,   13,   95,   82,   20,   25,   22,   11,   18,   83,   86,    0,   21,  154,  712,   75,  445,
         181,   10,    8,    9,  106,   22,   28,   82,    0,   21,   56,   80,   12,  446,   26,  110,   26,  136,    0,    0,   62,    8,    0,   10,    8,  944,   11,   18,  155,   28,    0,   

In [None]:
len(nums200[0])

328

In [None]:
dl = LMDataLoader(nums200)

In [None]:
dl

<fastai.text.data.LMDataLoader at 0x7fa5727dae50>

In [None]:
x, y = first(dl)
x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

In [None]:
x

LMTensorText([[   2,    8,  444,  ...,   39,   37,  341],
        [  38,  576,  954,  ...,  158, 1164,   71],
        [   0,   11,  970,  ...,  288,   11,   27],
        ...,
        [  28,  209,   63,  ..., 1794,   19,    8],
        [ 287,   19,  986,  ...,   38,  504,   62],
        [   0,  490,   14,  ...,   33,    8,  709]])

In [None]:
y

TensorText([[   8,  444,    9,  ...,   37,  341,    9],
        [ 576,  954, 1461,  ..., 1164,   71,  718],
        [  11,  970,   11,  ...,   11,   27,    0],
        ...,
        [ 209,   63,  213,  ...,   19,    8,    0],
        [  19,  986,   10,  ...,  504,   62,    0],
        [ 490,   14,    8,  ...,    8,  709,   10]])

In [None]:
' '.join(num.vocab[o] for o in x[0][:20])

'xxbos xxmaj given the history of the director of this movie , it is hard to believe that this was'

In [None]:
' '.join(num.vocab[o] for o in y[0][:20])

'xxmaj given the history of the director of this movie , it is hard to believe that this was such'

## Training a Text Classifier


Language Model Using DataBlock

In [None]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

In [None]:
path

Path('/root/.fastai/data/imdb')

In [None]:
dls_lm = DataBlock(blocks=TextBlock.from_folder(path, is_lm=True),
                   get_items=get_imdb, splitter=RandomSplitter(0.1)
                   ).dataloaders(path, path=path, bs=128, seq_len=80)

In [None]:
dls_lm.show_batch(max_n=5)

Unnamed: 0,text,text_
0,"xxbos xxmaj this movie reminds me of "" irréversible ( 2002 ) "" , another art - work movie with is a violent and radical approach of human nature . i did not like the movie but i can not say that it is a bad movie , it is just special . i reminds me also of "" camping xxmaj cosmos ( 1996 ) "" where a bunch of low - class figures are residents of a camp at","xxmaj this movie reminds me of "" irréversible ( 2002 ) "" , another art - work movie with is a violent and radical approach of human nature . i did not like the movie but i can not say that it is a bad movie , it is just special . i reminds me also of "" camping xxmaj cosmos ( 1996 ) "" where a bunch of low - class figures are residents of a camp at the"
1,"xxmaj santini , a girl claiming to be the most popular girl at her school , a title that xxmaj lola must have no matter what . xxmaj after trying to nab the lead role in the school play , the competition between the two girls culminates at a sold - out concert by xxmaj lola 's favorite band that xxmaj carla conveniently has tickets to see . \n\n xxmaj the previews made the film seem boring and for the","santini , a girl claiming to be the most popular girl at her school , a title that xxmaj lola must have no matter what . xxmaj after trying to nab the lead role in the school play , the competition between the two girls culminates at a sold - out concert by xxmaj lola 's favorite band that xxmaj carla conveniently has tickets to see . \n\n xxmaj the previews made the film seem boring and for the most"
2,the mini - series was shown only a couple of times on xxup pbs at the beginning of the 1980s and then apparently vanished into oblivion . \n\n ' oppenheimer ' compares favorably to the more recent ' fat xxmaj man & xxmaj little xxmaj boy ' feature film with xxmaj paul xxmaj newman as xxmaj leslie xxmaj groves ( the chronically overweight and rather homely xxmaj general would be thoroughly flattered ) and xxmaj dwight xxmaj schultz ( alumnus,mini - series was shown only a couple of times on xxup pbs at the beginning of the 1980s and then apparently vanished into oblivion . \n\n ' oppenheimer ' compares favorably to the more recent ' fat xxmaj man & xxmaj little xxmaj boy ' feature film with xxmaj paul xxmaj newman as xxmaj leslie xxmaj groves ( the chronically overweight and rather homely xxmaj general would be thoroughly flattered ) and xxmaj dwight xxmaj schultz ( alumnus of
3,"weak in places : xxmaj jay 's explanation of why he had introduced xxmaj max to xxmaj sam provoked for me the biggest guffaw of the film ( one of the very few ) . xxmaj best part of the film ? xxmaj the xxmaj harry xxmaj connick xxmaj jr . song over the opening credits . \n\n xxmaj overall , it gets a 3 ; a waste of my time and money - it was i who was the","in places : xxmaj jay 's explanation of why he had introduced xxmaj max to xxmaj sam provoked for me the biggest guffaw of the film ( one of the very few ) . xxmaj best part of the film ? xxmaj the xxmaj harry xxmaj connick xxmaj jr . song over the opening credits . \n\n xxmaj overall , it gets a 3 ; a waste of my time and money - it was i who was the xxup"
4,", trying to hold together the disparate subplots to the point of xxmaj keystone xxmaj cop tactics . \n\n xxmaj jimi ( chris xxmaj xxunk ) is a medical school student who is gay and has a lover xxmaj jack ( peter xxmaj ash ) and they live with xxmaj jack 's obese , alcoholic , loose morals aunt xxmaj vanessa ( sally xxmaj xxunk ) and xxmaj sally 's chubby daughter xxmaj hannah ( katy xxmaj clayton ) .","trying to hold together the disparate subplots to the point of xxmaj keystone xxmaj cop tactics . \n\n xxmaj jimi ( chris xxmaj xxunk ) is a medical school student who is gay and has a lover xxmaj jack ( peter xxmaj ash ) and they live with xxmaj jack 's obese , alcoholic , loose morals aunt xxmaj vanessa ( sally xxmaj xxunk ) and xxmaj sally 's chubby daughter xxmaj hannah ( katy xxmaj clayton ) . xxmaj"


### Fine-Tuning the Language Model

In [None]:
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3, 
                               metrics=[accuracy, Perplexity()]
                               ).to_fp16()

In [None]:
learn.summary()

SequentialRNN (Input shape: 128)
Layer (type)         Output Shape         Param #    Trainable 
                     []                  
LSTM                                                           
LSTM                                                           
LSTM                                                           
RNNDropout                                                     
RNNDropout                                                     
RNNDropout                                                     
____________________________________________________________________________
                     128 x 80 x 60008    
Linear                                    24063208   True      
RNNDropout                                                     
____________________________________________________________________________

Total params: 24,063,208
Total trainable params: 24,063,208
Total non-trainable params: 0

Optimizer used: <function Adam at 0x7fa5ba5fb9e0>
Loss functi

In [None]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.015022,3.90567,0.300373,49.683365,12:47


In [None]:
from google.colab import drive

drive.mount("/content/gdrive", force_remount=True)

Mounted at /content/gdrive


In [None]:
cd models

/content/gdrive/MyDrive/models


Saving and Loading Models

In [None]:
learn.save('/content/gdrive/MyDrive/models/1epoch_nlp_class')

Path('/content/gdrive/MyDrive/models/1epoch_nlp_class.pth')

In [None]:
learn = learn.load('/content/gdrive/MyDrive/models/1epoch_nlp_class')

In [None]:
learn.unfreeze()
learn.fit_one_cycle(1, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.681549,3.677847,0.32689,39.561123,13:22


In [None]:
learn.save_encoder('/content/gdrive/MyDrive/models/1epoch_nlp_class_finetuned')

Text Generation

In [None]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2

preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

In [None]:
print("\n".join(preds))

i liked this movie because it was a bit of a horror movie . It 's a typical Universal horror movie that goes from show to movie . It 's fun to watch , such as the villain - in - a
i liked this movie because it was a classic . The acting was great , the story was great . The story was very interesting , the special effects were very good , and the blood of the game was great .


Creating the Classifier DataLoaders

In [None]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In [None]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos xxmaj by now you 've probably heard a bit about the new xxmaj disney dub of xxmaj miyazaki 's classic film , xxmaj laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky . xxmaj during late summer of 1998 , xxmaj disney released "" kiki 's xxmaj delivery xxmaj service "" on video which included a preview of the xxmaj laputa dub saying it was due out in "" 1 xxrep 3 9 "" . xxmaj it 's obviously way past that year now , but the dub has been finally completed . xxmaj and it 's not "" laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky "" , just "" castle xxmaj in xxmaj the xxmaj sky "" for the dub , since xxmaj laputa is not such a nice word in xxmaj spanish ( even though they use the word xxmaj laputa many times",pos
2,"xxbos xxmaj warning : xxmaj does contain spoilers . \n\n xxmaj open xxmaj your xxmaj eyes \n\n xxmaj if you have not seen this film and plan on doing so , just stop reading here and take my word for it . xxmaj you have to see this film . i have seen it four times so far and i still have n't made up my mind as to what exactly happened in the film . xxmaj that is all i am going to say because if you have not seen this film , then stop reading right now . \n\n xxmaj if you are still reading then i am going to pose some questions to you and maybe if anyone has any answers you can email me and let me know what you think . \n\n i remember my xxmaj grade 11 xxmaj english teacher quite well . xxmaj",pos


In [None]:
nums_samp = toks200[:10].map(num)

In [None]:
nums_samp.map(len)

(#10) [328,288,215,309,72,172,146,165,197,92]

In [None]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

In [None]:
learn.summary()

SequentialRNN (Input shape: 128)
Layer (type)         Output Shape         Param #    Trainable 
                     []                  
LSTM                                                           
LSTM                                                           
LSTM                                                           
RNNDropout                                                     
RNNDropout                                                     
RNNDropout                                                     
BatchNorm1d                               2400       True      
Dropout                                                        
____________________________________________________________________________
                     128 x 50            
Linear                                    60000      True      
ReLU                                                           
BatchNorm1d                               100        True      
Dropout                               

In [None]:
learn = learn.load_encoder('/content/gdrive/MyDrive/models/1epoch_nlp_class_finetuned')

# Fine-Tuning the Classifier

In [None]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.286729,0.225543,0.90944,01:02


In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.247361,0.197575,0.9228,01:04


In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.20267,0.166143,0.9374,01:09


In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.164385,0.163494,0.937,01:16
1,0.15197,0.165391,0.93772,01:18


In [None]:
learn.save('/content/gdrive/MyDrive/models/final_1epoch_nlp_class')

Path('/content/gdrive/MyDrive/models/final_1epoch_nlp_class.pth')