## Tokenization

Word Tokenization

Updating fastai library on collab


In [None]:
! [ -e /content ] && pip install -Uqq fastai

[?25l[K     |█▊                              | 10 kB 17.6 MB/s eta 0:00:01[K     |███▌                            | 20 kB 21.8 MB/s eta 0:00:01[K     |█████▏                          | 30 kB 25.4 MB/s eta 0:00:01[K     |███████                         | 40 kB 27.3 MB/s eta 0:00:01[K     |████████▋                       | 51 kB 29.2 MB/s eta 0:00:01[K     |██████████▍                     | 61 kB 31.9 MB/s eta 0:00:01[K     |████████████                    | 71 kB 27.5 MB/s eta 0:00:01[K     |█████████████▉                  | 81 kB 26.9 MB/s eta 0:00:01[K     |███████████████▋                | 92 kB 26.2 MB/s eta 0:00:01[K     |█████████████████▎              | 102 kB 27.5 MB/s eta 0:00:01[K     |███████████████████             | 112 kB 27.5 MB/s eta 0:00:01[K     |████████████████████▊           | 122 kB 27.5 MB/s eta 0:00:01[K     |██████████████████████▌         | 133 kB 27.5 MB/s eta 0:00:01[K     |████████████████████████▏       | 143 kB 27.5 MB/s eta 0:

Here, We are downloading the IMDB Dataset.

In [None]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

To see what was downloaded: 

In [None]:
path.ls()

(#7) [Path('/root/.fastai/data/imdb/unsup'),Path('/root/.fastai/data/imdb/test'),Path('/root/.fastai/data/imdb/tmp_lm'),Path('/root/.fastai/data/imdb/train'),Path('/root/.fastai/data/imdb/README'),Path('/root/.fastai/data/imdb/tmp_clas'),Path('/root/.fastai/data/imdb/imdb.vocab')]

Getting all the text files from the three folders : Unsupervised, Test, And Training Folders

In [None]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

print(files[0:5])

[Path('/root/.fastai/data/imdb/unsup/38932_0.txt'), Path('/root/.fastai/data/imdb/unsup/896_0.txt'), Path('/root/.fastai/data/imdb/unsup/16152_0.txt'), Path('/root/.fastai/data/imdb/unsup/41284_0.txt'), Path('/root/.fastai/data/imdb/unsup/8306_0.txt')]


An Example text: 

In [None]:
txt = files[0].open().read(); txt[:78]

"I don't think I've ever seen a show suck so hard! She might be a single mother"

Here, We do the tokenization on the first file. spacy doesnt just separate based on spaces, it also does things like splitting dont into do and n't as these are two separate words.

Reference for spacey: [spaCy](https://spacy.io/)

In [None]:
spacy = WordTokenizer()
tokWord = first(spacy([txt]))
print(coll_repr(tokWord, 30))

<class 'generator'>


It is possible to go even furthur in tokenization using fastai's wrapper on tokenization. It adds other features on top of spaCy like removing unnecessary spaces, adding tokens for things like beginning of stream, upperscale sletters, and much more.

Here is a brief summary of all the things done:

- `fix_html`:: Replaces special HTML characters with a readable version (IMDb reviews have quite a few of these)
- `replace_rep`:: Replaces any character repeated three times or more with a special token for repetition (`xxrep`), the number of times it's repeated, then the character
- `replace_wrep`:: Replaces any word repeated three times or more with a special token for word repetition (`xxwrep`), the number of times it's repeated, then the word
- `spec_add_spaces`:: Adds spaces around / and #
- `rm_useless_spaces`:: Removes all repetitions of the space character
- `replace_all_caps`:: Lowercases a word written in all caps and adds a special token for all caps (`xxup`) in front of it
- `replace_maj`:: Lowercases a capitalized word and adds a special token for capitalized (`xxmaj`) in front of it
- `lowercase`:: Lowercases all text and adds a special token at the beginning (`xxbos`) and/or the end (`xxeos`)

In [None]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#164) ['xxbos','i','do',"n't",'think','xxmaj','i',"'ve",'ever','seen','a','show','suck','so','hard','!','xxmaj','she','might','be','a','single','mother',',','but','a','mother','with','a','lot','of'...]


# Tokenization

### Subword Level

Subword tokenization is also useful when there are no spaces differentiating words like in chinese.

This is useful in order to reduce the size of our vocabulary.



*   We use the first 2000 reviews for our corpus




In [None]:
corpus2000 = L(o.open().read() for o in files[:2000])

Then we define a subword function that takes in a vocabulary size.

This tokenizer first goes through the entire corpus and creates a frequency count and selects the top `sz` number of words and makes that into a vocabulary.

After that it tokenizes the corpus and returns it.



In [None]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

Trying out with a vocabulary size of 2000.

In [None]:
subword(2000)

a larger vocab means fewer tokens per sentence, which means faster training , less memory, and less states for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn. 

_ indicates space in the token.

## Numericalization



in numericalization, we map the tokes to integers based on their index in the vocabulary. 



In [None]:
toks200 = corpus2000[:2000].map(tkn)
num = Numericalize(min_freq=3)
num.setup(toks200)
coll_repr(num.vocab,20)

"(#10592) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','and','a','of','to','is','it','in','i'...]"

In [None]:
nums = num(toks)[:20]; nums


TensorText([   0,   60,   34,  122,    0,  168,  137,  125,   13,  150, 1937,   53,  290,   55,    0,  266,   44,   13,  676,  379])

In [None]:
' '.join(num.vocab[o] for o in nums)

"xxunk do n't think xxunk 've ever seen a show suck so hard ! xxunk might be a single mother"

In [None]:
def makeDataset(dt):
  x = []
  y = []
  tokennum = []
  for text in dt:
    tokennum.append(int(text))
  
  for i in range(len(tokennum) - 4):
      x.append(tokennum[i:i+4]);
      y.append(tokennum[i+4]);
  return x,y

In [None]:
four,one = makeDataset(nums)


In [None]:
print(four)

[[0, 65, 39, 143], [65, 39, 143, 0], [39, 143, 0, 201], [143, 0, 201, 130], [0, 201, 130, 116], [201, 130, 116, 12], [130, 116, 12, 136], [116, 12, 136, 0], [12, 136, 0, 47], [136, 0, 47, 322], [0, 47, 322, 48], [47, 322, 48, 0], [322, 48, 0, 291], [48, 0, 291, 41], [0, 291, 41, 12], [291, 41, 12, 616]]


In [None]:
print(one)

[0, 201, 130, 116, 12, 136, 0, 47, 322, 48, 0, 291, 41, 12, 616, 553]
