### **INITIALIZATION:**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**LIBRARIES AND DEPENDENCIES:**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [3]:
#@ INSTALLING DEPENDENCIES: UNCOMMENT BELOW: 
# !pip install -Uqq fastbook
# import fastbook
# fastbook.setup_book()

In [4]:
#@ DOWNLOADING LIBRARIES AND DEPENDENCIES: 
from fastbook import *                              # Getting all the Libraries. 
from fastai.callback.fp16 import *
from fastai.text.all import *                       # Getting all the Libraries.
from IPython.display import display, HTML

### **GETTING THE DATASET:**
- I will get the **IMDB Dataset** here. 

In [5]:
#@ GETTING THE DATASET: 
path = untar_data(URLs.IMDB)                       # Getting Path to the Dataset. 
path.ls()                                          # Inspecting the Path. 

(#7) [Path('/root/.fastai/data/imdb/tmp_lm'),Path('/root/.fastai/data/imdb/train'),Path('/root/.fastai/data/imdb/tmp_clas'),Path('/root/.fastai/data/imdb/README'),Path('/root/.fastai/data/imdb/unsup'),Path('/root/.fastai/data/imdb/imdb.vocab'),Path('/root/.fastai/data/imdb/test')]

In [6]:
#@ GETTING TEXT FILES: 
files = get_text_files(path, folders=["train", "test", "unsup"])        # Getting Text Files. 
txt = files[0].open().read()                                            # Getting a Text. 
txt[:75]                                                                # Inspecting Text. 

"i don't know what they were thinking.by they,i mean anybody even remotely c"

### **WORD TOKENIZATION:**
- **Word Tokenization** splits a sentence on spaces as well as applying language specific rules to try to separate parts of meaning even when there are no spaces. Generally punctuation marks are also split into separate tokens. **Token** is a element of a list created by the **Tokenization** process which could be a word, a part of a word or subword or a single character. 

In [7]:
#@ INITIALIZING WORD TOKENIZATION: 
spacy = WordTokenizer()                                  # Initializing Tokenizer. 
toks = first(spacy([txt]))                               # Getting Tokens of Words. 
print(coll_repr(toks, 30))                               # Inspecting Tokens. 

(#172) ['i','do',"n't",'know','what','they','were','thinking.by','they',',','i','mean','anybody','even','remotely','connected','to','this',"disaster.i've",'seen','so','bad','movies',',',"i've",'seen','so','really','bad','movies'...]


In [8]:
#@ INSPECTING TOKENIZATION: EXAMPLE:
first(spacy(['The U.S. dollar $1 is $1.00.']))           # Inspecting Tokens. 

(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

In [9]:
#@ INITIALIZING WORD TOKENIZATION WITH FASTAI: 
tkn = Tokenizer(spacy)                                   # Initializing Tokenizer. 
print(coll_repr(tkn(txt), 31))                           # Inspecting Tokens. 

(#175) ['xxbos','i','do',"n't",'know','what','they','were','thinking.by','they',',','i','mean','anybody','even','remotely','connected','to','this',"disaster.i've",'seen','so','bad','movies',',',"i've",'seen','so','really','bad','movies'...]


**Note:**
- **xxbos** : Indicates the beginning of a text. 
- **xxmaj** : Indicates the next word begins with a capital. 
- **xxunk** : Indicates the next word is unknown.  

In [10]:
#@ INSPECTING TOKENIZATION: EXAMPLE:
coll_repr(tkn('&copy; Fast.ai www.fast.ai/INDEX'), 30)   # Inspecting Tokens. 

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

### **SUBWORD TOKENIZATION:**
- **Word Tokenization** relies on an assumption that spaces provide a useful separation of components of meaning in a sentence which is not always appropriate. Languages such as Chinese and Japanese don't use spaces and in such cases **Subword Tokenization** generally plays the best role. **Subword Tokenization** splits words into smaller parts based on the most commonly occurring sub strings. 

In [11]:
#@ INITIALIZING SUBWORD TOKENIZATION: EXAMPLE:
txts = L(o.open().read() for o in files[:2000])                # Getting List of Reviews. 

#@ INITIALIZING SUBWORD TOKENIZER: 
def subword(sz):                                               # Defining Function.      
    sp = SubwordTokenizer(vocab_sz=sz)                         # Initializing Subword Tokenizer. 
    sp.setup(txts)                                             # Getting Sequence of Characters. 
    return " ".join(first(sp([txt]))[:40])                     # Inspecting the Vocab. 

#@ IMPLEMENTATION: 
subword(1000)                                                  # Inspecting Subword Tokenization. 

"▁i ▁don ' t ▁know ▁what ▁they ▁were ▁think ing . b y ▁they , i ▁mean ▁any bo dy ▁even ▁re mo te ly ▁con n ect ed ▁to ▁this ▁dis a ster . i ' ve ▁seen ▁so"

**Notes:**
- Here **setup** is a special fastai method that is called automatically in usual data processing pipelines which reads the documents and find the common sequences of characters to create the vocab. Similarly [**L**](https://fastcore.fast.ai/#L) is also referred as superpowered list. The special character '_' represents a space character in the original text. 

In [13]:
#@ IMPLEMENTATION OF SUBWORD TOKENIZATION: 
subword(200)                                                  # Inspecting Vocab. 
subword(10000)                                                # Inspecting Vocab. 

"▁i ▁don ' t ▁know ▁what ▁they ▁were ▁thinking . by ▁they , i ▁mean ▁anybody ▁even ▁remote ly ▁connect ed ▁to ▁this ▁disaster . i ' ve ▁seen ▁so ▁bad ▁movies , i ' ve ▁seen ▁so ▁really ▁bad"

**Note:**
- A larger vocab means fewer tokens per sentence which means faster training, less memory, and less state for the model to remember but it means larger embedding matrices and require more data to learn. **Subword Tokenization** provides a way to easily scale between character tokenization i.e. using a small subword vocab and word tokenization i.e using a large subword vocab and handles every human language without needing language specific algorithms to be developed. 

### **NUMERICALIZATION:**
- **Numericalization** is the process of mapping tokens to integers. It involves making a list of all possible levels of that categorical variable or the vocab and replacing each level with its index in the vocab.

In [14]:
#@ INITIALIZING TOKENS: 
toks = tkn(txt)                                              # Getting Tokens. 
print(coll_repr(tkn(txt), 31))                               # Inspecting Tokens. 

(#175) ['xxbos','i','do',"n't",'know','what','they','were','thinking.by','they',',','i','mean','anybody','even','remotely','connected','to','this',"disaster.i've",'seen','so','bad','movies',',',"i've",'seen','so','really','bad','movies'...]


In [15]:
#@ INITIALIZING TOKENS: 
toks200 = txts[:200].map(tkn)                                # Getting Tokens. 
toks200[0]                                                   # Inspecting Tokens. 

(#175) ['xxbos','i','do',"n't",'know','what','they','were','thinking.by','they'...]

In [21]:
#@ NUMERICALIZATION USING FASTAI: 
num = Numericalize()                                         # Initializing Numericalization. 
num.setup(toks200)                                           # Getting Integers. 
coll_repr(num.vocab, 20)                                     # Inspecting Vocabulary. 

"(#2112) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','of','and','to','is','i','in','it'...]"

In [23]:
#@ INITIALIZING NUMERICALIZATION: 
nums = num(toks)[:20]; nums                                  # Inspection. 
" ".join(num.vocab[o] for o in nums)                         # Getting Original Text. 

"xxbos i do n't know what they were xxunk they , i mean anybody even remotely xxunk to this xxunk"