<a href="https://colab.research.google.com/github/Aravinda89/fastai_bootcamp/blob/main/Gayan_DL201_10_nlp_own_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Deep Dive - Own Code

Own refactored code and notes for *Chapter 10: NLP Deep Dive: RNNs* ([`10_nlp.ipynb`](https://colab.research.google.com/github/vtecftwy/fastbook/blob/master/10_nlp.ipynb)).

## Instructions

It is recommended that you work in two steps:
1. Copy the code from the fastbook notebook and make sure it works
2. Refactor (i.e. rewrite the code in your own style) by 
    - regrouping things together that make sense ro you
    - adding text cells to explain what to code does in your own words and possible references to the doc you may have consulsted
    - deleting code you think was only there to explain things but are not required once you run models end to end

When you have done that, you get a customized reference notebook for you which you can consult later on when you forgot the details, withouht having to read the full notebook from fastbook.

## Your code

In [1]:
!pip install -Uqq fastbook
import fastbook
# fastbook.setup_book(bind_drive=False)

from fastbook import *
from IPython.display import display,HTML

[K     |████████████████████████████████| 727kB 5.0MB/s 
[K     |████████████████████████████████| 194kB 45.0MB/s 
[K     |████████████████████████████████| 51kB 6.7MB/s 
[K     |████████████████████████████████| 1.2MB 30.1MB/s 
[K     |████████████████████████████████| 61kB 8.2MB/s 
[K     |████████████████████████████████| 61kB 7.9MB/s 
[?25h

In [2]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

In [3]:
path

Path('/root/.fastai/data/imdb')

In [4]:
path.absolute()

Path('/root/.fastai/data/imdb')

In [5]:
# 1. List all the folders under path (using the path.iterdir() method)
print(f"path to dataset: {path.absolute()}")
[f"{'file:  ' if p.is_file() else 'folder:'} {p.name}" for p in path.iterdir()]

path to dataset: /root/.fastai/data/imdb


['folder: unsup',
 'folder: tmp_lm',
 'folder: test',
 'folder: train',
 'folder: tmp_clas',
 'file:   imdb.vocab',
 'file:   README']

In [6]:
# 2. Get the full text of README
with open(path/'README', mode='r') as f:
    txt = f.readlines()
print(''.join(txt))

Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a scor

In [7]:
# List the folders and list the files
print('Folders:')
display([p.name for p in path.iterdir() if p.is_dir()])
print('Files:')
display([p.name for p in path.iterdir() if p.is_file()])

Folders:


['unsup', 'tmp_lm', 'test', 'train', 'tmp_clas']

Files:


['imdb.vocab', 'README']

In [8]:
# Content of the training set (in train folder), test/validation set (in test folder) and in unsupervised (excluding text files)
[p.name for p in (path/'train').iterdir()], [p.name for p in (path/'test').iterdir()], [p.name for p in (path/'unsup').iterdir() if 'txt' not in p.suffix]

(['unsupBow.feat', 'neg', 'pos', 'labeledBow.feat'],
 ['neg', 'pos', 'labeledBow.feat'],
 [])

In [9]:
# First files for training in the positive review folder (pos) and negative review (neg). As mentioned in read.me the format is id_rating.txt
[p.name for p in (path/'train/pos').iterdir()][:5], [p.name for p in (path/'train/neg').iterdir()][:5]

(['11625_7.txt', '3371_8.txt', '2767_8.txt', '12409_7.txt', '2694_10.txt'],
 ['5733_2.txt', '10743_2.txt', '2380_1.txt', '6850_3.txt', '7259_4.txt'])

In [10]:
# First files for testing in the positive review folder (pos) and negative review (neg). As mentioned in read.me the format is id_rating.txt
[p.name for p in (path/'test/pos').iterdir()][:5], [p.name for p in (path/'test/neg').iterdir()][:5]

(['6770_8.txt', '7643_8.txt', '7169_10.txt', '546_7.txt', '6855_7.txt'],
 ['2380_1.txt', '5907_1.txt', '4026_2.txt', '8691_1.txt', '9146_3.txt'])

In [11]:
# First files in unsup folder (pos). As mentioned in read.me the format is id_rating.txt, where rating is 0
[p.name for p in (path/'unsup').iterdir()][:5]

['31370_0.txt', '14077_0.txt', '12655_0.txt', '42005_0.txt', '11246_0.txt']

In [12]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])
files

(#100000) [Path('/root/.fastai/data/imdb/unsup/31370_0.txt'),Path('/root/.fastai/data/imdb/unsup/14077_0.txt'),Path('/root/.fastai/data/imdb/unsup/12655_0.txt'),Path('/root/.fastai/data/imdb/unsup/42005_0.txt'),Path('/root/.fastai/data/imdb/unsup/11246_0.txt'),Path('/root/.fastai/data/imdb/unsup/29343_0.txt'),Path('/root/.fastai/data/imdb/unsup/12214_0.txt'),Path('/root/.fastai/data/imdb/unsup/10198_0.txt'),Path('/root/.fastai/data/imdb/unsup/33536_0.txt'),Path('/root/.fastai/data/imdb/unsup/4347_0.txt')...]

In [13]:
len(files)

100000

In [14]:
txt = files[1].open().read()
txt[:150]

"I can't help but wonder, after reading so many negative reviews, if people really got this movie. Yes, it is a commentary on a depraved culture. But, "

In [16]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#647) ['I','ca',"n't",'help','but','wonder',',','after','reading','so','many','negative','reviews',',','if','people','really','got','this','movie','.','Yes',',','it','is','a','commentary','on','a','depraved'...]


In [17]:
first(spacy(['The U.S. dollar $1 is $1.00.']))

(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

In [18]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn('The U.S. dollar $1 is $1.00.'),20))

(#13) ['xxbos','xxmaj','the','xxup','u.s','.','dollar','$','1','is','$','1.00','.']


In [19]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#709) ['xxbos','i','ca',"n't",'help','but','wonder',',','after','reading','so','many','negative','reviews',',','if','people','really','got','this','movie','.','xxmaj','yes',',','it','is','a','commentary','on','a'...]


In [20]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html>,
 <function fastai.text.core.replace_rep>,
 <function fastai.text.core.replace_wrep>,
 <function fastai.text.core.spec_add_spaces>,
 <function fastai.text.core.rm_useless_spaces>,
 <function fastai.text.core.replace_all_caps>,
 <function fastai.text.core.replace_maj>,
 <function fastai.text.core.lowercase>]

In [21]:
replace_rep??

In [22]:
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

In [23]:
spacy([txt])

<generator object SpacyTokenizer.__call__.<locals>.<genexpr> at 0x7f4bbebd33d0>

In [28]:
tokens = spacy([txt])
tokens

<generator object SpacyTokenizer.__call__.<locals>.<genexpr> at 0x7f4bbebd3050>

In [29]:
next(tokens, None)

(#647) ['I','ca',"n't",'help','but','wonder',',','after','reading','so'...]

In [30]:
tokens = spacy([txt])
first(tokens)

(#647) ['I','ca',"n't",'help','but','wonder',',','after','reading','so'...]

In [31]:
tokens = spacy([txt])
display(tokens)

tokens = spacy([txt])
display(next(tokens, None))

tokens = spacy([txt])
display(first(tokens))

txt0 = files[0].open().read()
print(1, ': ', txt0[0:90])
txt1 = files[1].open().read()
print(2, ': ', txt1[0:90])
txt2 = files[2].open().read()
print(3, ': ', txt2[0:90])

txt_collection = [txt0, txt1, txt2]
toks_collection = spacy(txt_collection)
display(first(toks_collection))
display(first(toks_collection))
display(first(toks_collection))

<generator object SpacyTokenizer.__call__.<locals>.<genexpr> at 0x7f4bbcfbbc50>

(#647) ['I','ca',"n't",'help','but','wonder',',','after','reading','so'...]

(#647) ['I','ca',"n't",'help','but','wonder',',','after','reading','so'...]

1 :  If you liked Freaks and Geeks this is a must-see. Judd Apatow once again created a hysteri
2 :  I can't help but wonder, after reading so many negative reviews, if people really got this
3 :  First off Ron Perlman needs to get plastic sugary. When I wasn't trying to throw up every 


(#130) ['If','you','liked','Freaks','and','Geeks','this','is','a','must'...]

(#647) ['I','ca',"n't",'help','but','wonder',',','after','reading','so'...]

(#147) ['First','off','Ron','Perlman','needs','to','get','plastic','sugary','.'...]

In [32]:
txts = L(o.open().read() for o in files[:2000])

In [33]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

In [34]:
subword(1000)

"▁I ▁can ' t ▁help ▁but ▁wonder , ▁after ▁read ing ▁so ▁many ▁ ne g at ive ▁review s , ▁if ▁people ▁really ▁got ▁this ▁movie . ▁Y es , ▁it ▁is ▁a ▁comment ary ▁on ▁a ▁de p"

In [35]:
subword(200)

"▁I ▁c an ' t ▁he l p ▁but ▁w on d er , ▁a f ter ▁re a d ing ▁s o ▁ma n y ▁ ne g a t i ve ▁re v i e w s ,"

In [36]:
subword(10000)

"▁I ▁can ' t ▁help ▁but ▁wonder , ▁after ▁reading ▁so ▁many ▁negative ▁reviews , ▁if ▁people ▁really ▁got ▁this ▁movie . ▁Yes , ▁it ▁is ▁a ▁commentary ▁on ▁a ▁d epraved ▁culture . ▁But , ▁as ▁the ▁narration ▁points"

In [37]:
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

(#709) ['xxbos','i','ca',"n't",'help','but','wonder',',','after','reading','so','many','negative','reviews',',','if','people','really','got','this','movie','.','xxmaj','yes',',','it','is','a','commentary','on','a'...]


In [38]:
toks200 = txts[:200].map(tkn)
toks200[0]

(#144) ['xxbos','xxmaj','if','you','liked','xxmaj','freaks','and','xxmaj','geeks'...]

In [39]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

"(#2096) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','is','it','in','i'...]"

In [40]:
nums = num(toks)[:20]
nums

TensorText([   2,   19,  191,   38,  306,   30, 1534,   11,  152,  649,   51,  133, 1212, 1004,   11,   63,   83,  112,  211,   22])

In [41]:
' '.join(num.vocab[o] for o in nums)

"xxbos i ca n't help but wonder , after reading so many negative reviews , if people really got this"