### **INITIALIZATION:**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**LIBRARIES AND DEPENDENCIES:**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [3]:
#@ INSTALLING DEPENDENCIES: UNCOMMENT BELOW: 
# !pip install -Uqq fastbook
# import fastbook
# fastbook.setup_book()

In [4]:
#@ DOWNLOADING LIBRARIES AND DEPENDENCIES: 
from fastbook import *                              # Getting all the Libraries. 
from fastai.callback.fp16 import *
from fastai.text.all import *                       # Getting all the Libraries.
from IPython.display import display, HTML

### **GETTING THE DATASET:**
- I will get the **IMDB Dataset** here. 

In [5]:
#@ GETTING THE DATASET: 
path = untar_data(URLs.IMDB)                       # Getting Path to the Dataset. 
path.ls()                                          # Inspecting the Path. 

(#7) [Path('/root/.fastai/data/imdb/tmp_lm'),Path('/root/.fastai/data/imdb/README'),Path('/root/.fastai/data/imdb/tmp_clas'),Path('/root/.fastai/data/imdb/unsup'),Path('/root/.fastai/data/imdb/imdb.vocab'),Path('/root/.fastai/data/imdb/train'),Path('/root/.fastai/data/imdb/test')]

In [6]:
#@ GETTING TEXT FILES: 
files = get_text_files(path, folders=["train", "test", "unsup"])        # Getting Text Files. 
txt = files[0].open().read()                                            # Getting a Text. 
txt[:75]                                                                # Inspecting Text. 

"*** May contain spoilers*** Wow. This movie is really bad. It's so bad that"

### **WORD TOKENIZATION:**
- **Word Tokenization** splits a sentence on spaces as well as applying language specific rules to try to separate parts of meaning even when there are no spaces. Generally punctuation marks are also split into separate tokens. **Token** is a element of a list created by the **Tokenization** process which could be a word, a part of a word or subword or a single character. 

In [7]:
#@ INITIALIZING WORD TOKENIZATION: 
spacy = WordTokenizer()                                  # Initializing Tokenizer. 
toks = first(spacy([txt]))                               # Getting Tokens of Words. 
print(coll_repr(toks, 30))                               # Inspecting Tokens. 

(#477) ['*','*','*','May','contain','spoilers','*','*','*','Wow','.','This','movie','is','really','bad','.','It',"'s",'so','bad','that','the','first','word','of','the','title','is','misspelled'...]


In [8]:
#@ INSPECTING TOKENIZATION: EXAMPLE:
first(spacy(['The U.S. dollar $1 is $1.00.']))           # Inspecting Tokens. 

(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

In [9]:
#@ INITIALIZING WORD TOKENIZATION WITH FASTAI: 
tkn = Tokenizer(spacy)                                   # Initializing Tokenizer. 
print(coll_repr(tkn(txt), 31))                           # Inspecting Tokens. 

(#505) ['xxbos','xxrep','3','*','xxmaj','may','contain','spoilers','xxrep','3','*','xxmaj','wow','.','xxmaj','this','movie','is','really','bad','.','xxmaj','it',"'s",'so','bad','that','the','first','word','of'...]


**Note:**
- **xxbos** : Indicates the beginning of a text. 
- **xxmaj** : Indicates the next word begins with a capital. 
- **xxunk** : Indicates the next word is unknown.  

In [10]:
#@ INSPECTING TOKENIZATION: EXAMPLE:
coll_repr(tkn('&copy; Fast.ai www.fast.ai/INDEX'), 30)   # Inspecting Tokens. 

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

### **SUBWORD TOKENIZATION:**
- **Word Tokenization** relies on an assumption that spaces provide a useful separation of components of meaning in a sentence which is not always appropriate. Languages such as Chinese and Japanese don't use spaces and in such cases **Subword Tokenization** generally plays the best role. **Subword Tokenization** splits words into smaller parts based on the most commonly occurring sub strings. 

In [11]:
#@ INITIALIZING SUBWORD TOKENIZATION: EXAMPLE:
txts = L(o.open().read() for o in files[:2000])                # Getting List of Reviews. 

#@ INITIALIZING SUBWORD TOKENIZER: 
def subword(sz):                                               # Defining Function.      
    sp = SubwordTokenizer(vocab_sz=sz)                         # Initializing Subword Tokenizer. 
    sp.setup(txts)                                             # Getting Sequence of Characters. 
    return " ".join(first(sp([txt]))[:40])                     # Inspecting the Vocab. 

#@ IMPLEMENTATION: 
subword(1000)                                                  # Inspecting Subword Tokenization. 

"▁ *** ▁Ma y ▁con t ain ▁sp o il ers *** ▁W ow . ▁This ▁movie ▁is ▁really ▁bad . ▁It ' s ▁so ▁bad ▁that ▁the ▁first ▁word ▁of ▁the ▁title ▁is ▁mis s p ell ed ▁("

**Notes:**
- Here **setup** is a special fastai method that is called automatically in usual data processing pipelines which reads the documents and find the common sequences of characters to create the vocab. Similarly [**L**](https://fastcore.fast.ai/#L) is also referred as superpowered list. The special character '_' represents a space character in the original text. 

In [12]:
#@ IMPLEMENTATION OF SUBWORD TOKENIZATION: 
subword(200)                                                  # Inspecting Vocab. 
subword(10000)                                                # Inspecting Vocab. 

"▁*** ▁May ▁contain ▁spoiler s *** ▁Wow . ▁This ▁movie ▁is ▁really ▁bad . ▁It ' s ▁so ▁bad ▁that ▁the ▁first ▁word ▁of ▁the ▁title ▁is ▁miss pell ed ▁( I ' d ▁be ▁willing ▁to ▁allow ▁for ▁the"

**Note:**
- A larger vocab means fewer tokens per sentence which means faster training, less memory, and less state for the model to remember but it means larger embedding matrices and require more data to learn. **Subword Tokenization** provides a way to easily scale between character tokenization i.e. using a small subword vocab and word tokenization i.e using a large subword vocab and handles every human language without needing language specific algorithms to be developed. 

### **NUMERICALIZATION:**
- **Numericalization** is the process of mapping tokens to integers. It involves making a list of all possible levels of that categorical variable or the vocab and replacing each level with its index in the vocab.

In [13]:
#@ INITIALIZING TOKENS: 
toks = tkn(txt)                                              # Getting Tokens. 
print(coll_repr(tkn(txt), 31))                               # Inspecting Tokens. 

(#505) ['xxbos','xxrep','3','*','xxmaj','may','contain','spoilers','xxrep','3','*','xxmaj','wow','.','xxmaj','this','movie','is','really','bad','.','xxmaj','it',"'s",'so','bad','that','the','first','word','of'...]


In [14]:
#@ INITIALIZING TOKENS: 
toks200 = txts[:200].map(tkn)                                # Getting Tokens. 
toks200[0]                                                   # Inspecting Tokens. 

(#505) ['xxbos','xxrep','3','*','xxmaj','may','contain','spoilers','xxrep','3'...]

In [15]:
#@ NUMERICALIZATION USING FASTAI: 
num = Numericalize()                                         # Initializing Numericalization. 
num.setup(toks200)                                           # Getting Integers. 
coll_repr(num.vocab, 20)                                     # Inspecting Vocabulary. 

"(#2216) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','and','a','of','to','is','in','it','that'...]"

In [16]:
#@ INITIALIZING NUMERICALIZATION: 
nums = num(toks)[:20]; nums                                  # Inspection. 
" ".join(num.vocab[o] for o in nums)                         # Getting Original Text. 

'xxbos xxrep 3 * xxmaj may xxunk xxunk xxrep 3 * xxmaj xxunk . xxmaj this movie is really bad'

### **CREATING BATCHES FOR LANGUAGE MODEL:**
- At every epoch I will shuffle the collection of documents and concatenate them into a stream of tokens and cut that stream into a batch of fixedsize consecutive ministreams. The model will then read the ministreams in order. 

In [17]:
#@ CREATING BATCHES FOR LANGUAGE MODEL: 
nums200 = toks200.map(num)                                   # Initializing Numericalization. 
dl = LMDataLoader(nums200)                                   # Creating Language Model Data Loaders. 

#@ INSPECTING FIRST BATCH: 
x, y = first(dl)                                             # Getting First Batch of Data. 
x.shape, y.shape                                             # Inspecting Shape of Data. 

(torch.Size([64, 72]), torch.Size([64, 72]))

In [18]:
#@ INSPECTING THE DATA: 
" ".join(num.vocab[o] for o in x[0][:20])                    # Inspecting Independent Variable. 

'xxbos xxrep 3 * xxmaj may xxunk xxunk xxrep 3 * xxmaj xxunk . xxmaj this movie is really bad'

In [19]:
#@ INSPECTING THE DATA: 
" ".join(num.vocab[o] for o in y[0][:20])                    # Inspecting Dependent Variable. 

'xxrep 3 * xxmaj may xxunk xxunk xxrep 3 * xxmaj xxunk . xxmaj this movie is really bad .'

### **TRAINING A TEXT CLASSIFIER:**

**LANGUAGE MODEL USING DATABLOCK:**

In [21]:
#@ CREATING LANGUAGE MODEL USING DATABLOCK: 
get_imdb = partial(get_text_files, folders=["train", "test", "unsup"])    # Getting Text Files. 
db = DataBlock(blocks=TextBlock.from_folder(path, is_lm=True),            # Initializing TextBlock. 
               get_items=get_imdb, splitter=RandomSplitter(0.1))          # Initializing DataBlock. 

#@ CREATING LANGUAGE MODEL DATALOADERS: 
dls_lm = db.dataloaders(path, path=path, bs=128, seq_len=80)              # Initializing Data Loaders. 

In [23]:
#@ INSPECTING THE BATCHES OF DATA: 
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,xxbos a beautiful film about the coming of early silent cinema to xxmaj china . xxup shadow xxup magic deftly combines a love story with the drama of the cultural clash between xxmaj china 's ancient traditions and modern xxmaj western culture in the form of film . xxmaj an amazing first film by xxmaj chinese director xxmaj ann xxmaj hu . xxmaj if i correctly understood xxmaj ms . xxmaj hu 's comments at the 2 xxrep 3 0,a beautiful film about the coming of early silent cinema to xxmaj china . xxup shadow xxup magic deftly combines a love story with the drama of the cultural clash between xxmaj china 's ancient traditions and modern xxmaj western culture in the form of film . xxmaj an amazing first film by xxmaj chinese director xxmaj ann xxmaj hu . xxmaj if i correctly understood xxmaj ms . xxmaj hu 's comments at the 2 xxrep 3 0 xxmaj
1,"is one of the 90 's best thinking person 's romantic movies . xxmaj julie xxmaj delpy turns in one of the decade 's most engaging performances as the xxmaj parisian lass who spends a day with stranger - on - a - train xxmaj ethan xxmaj hawke . xxmaj the dialogue ( and there is oodles of it ) is sometimes meandering and overly precious , but this portrait of two young wannabe - lovers making a romantic ,","one of the 90 's best thinking person 's romantic movies . xxmaj julie xxmaj delpy turns in one of the decade 's most engaging performances as the xxmaj parisian lass who spends a day with stranger - on - a - train xxmaj ethan xxmaj hawke . xxmaj the dialogue ( and there is oodles of it ) is sometimes meandering and overly precious , but this portrait of two young wannabe - lovers making a romantic , intellectual"


**FINETUNING THE LANGAUGE MODEL:**
- I will use **Embeddings** to convert the integer word indices into activations that can be used for the neural networks. These embeddings are feed into **Recurrent Neural Network** using and architecture called **AWD-LSTM**. 

In [24]:
#@ INITIALIZING LANGUAGE MODEL LEARNER: 
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3,                 # Using AWD LSTM Architecture. 
                               metrics=[accuracy, Perplexity()]).to_fp16()      # Initializing LM Learner.                          

In [25]:
#@ TRAINING EMBEDDINGS WITH RANDOM INITIALIZATION: 
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.006634,3.903991,0.299049,49.600018,21:03


**SAVING AND LOADING MODELS:**

In [34]:
#@ SAVING MODELS: 
learn.save("/content/gdrive/MyDrive/1Epoch")                  # Saving the Model. 

Path('/content/gdrive/MyDrive/1Epoch.pth')

In [None]:
#@ LOADING MODELS: 
learn = learn.load("/content/gdrive/MyDrive/1Epoch")          # Loading the Model. 