### **INITIALIZATION:**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**LIBRARIES AND DEPENDENCIES:**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [3]:
#@ INSTALLING DEPENDENCIES: UNCOMMENT BELOW: 
# !pip install -Uqq fastbook
# import fastbook
# fastbook.setup_book()

In [4]:
#@ DOWNLOADING LIBRARIES AND DEPENDENCIES: 
from fastbook import *                              # Getting all the Libraries. 
from fastai.callback.fp16 import *
from fastai.text.all import *                       # Getting all the Libraries.
from IPython.display import display, HTML

### **GETTING THE DATASET:**
- I will get the **IMDB Dataset** here. 

In [5]:
#@ GETTING THE DATASET: 
path = untar_data(URLs.IMDB)                       # Getting Path to the Dataset. 
path.ls()                                          # Inspecting the Path. 

(#7) [Path('/root/.fastai/data/imdb/unsup'),Path('/root/.fastai/data/imdb/tmp_clas'),Path('/root/.fastai/data/imdb/train'),Path('/root/.fastai/data/imdb/imdb.vocab'),Path('/root/.fastai/data/imdb/test'),Path('/root/.fastai/data/imdb/README'),Path('/root/.fastai/data/imdb/tmp_lm')]

In [6]:
#@ GETTING TEXT FILES: 
files = get_text_files(path, folders=["train", "test", "unsup"])        # Getting Text Files. 
txt = files[0].open().read()                                            # Getting a Text. 
txt[:75]                                                                # Inspecting Text. 

"Well it wasn't horrible but it wasn't great. No where near as good as the o"

### **WORD TOKENIZATION:**
- **Word Tokenization** splits a sentence on spaces as well as applying language specific rules to try to separate parts of meaning even when there are no spaces. Generally punctuation marks are also split into separate tokens. **Token** is a element of a list created by the **Tokenization** process which could be a word, a part of a word or subword or a single character. 

In [7]:
#@ INITIALIZING WORD TOKENIZATION: 
spacy = WordTokenizer()                                  # Initializing Tokenizer. 
toks = first(spacy([txt]))                               # Getting Tokens of Words. 
print(coll_repr(toks, 30))                               # Inspecting Tokens. 

(#298) ['Well','it','was',"n't",'horrible','but','it','was',"n't",'great','.','No','where','near','as','good','as','the','original','.','It','kinda','tried','to','hard','to','be','the','first','movie'...]


In [8]:
#@ INSPECTING TOKENIZATION: EXAMPLE:
first(spacy(['The U.S. dollar $1 is $1.00.']))           # Inspecting Tokens. 

(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

In [9]:
#@ INITIALIZING WORD TOKENIZATION WITH FASTAI: 
tkn = Tokenizer(spacy)                                   # Initializing Tokenizer. 
print(coll_repr(tkn(txt), 31))                           # Inspecting Tokens. 

(#331) ['xxbos','xxmaj','well','it','was',"n't",'horrible','but','it','was',"n't",'great','.','xxmaj','no','where','near','as','good','as','the','original','.','xxmaj','it','kinda','tried','to','hard','to','be'...]


**Note:**
- **xxbos** : Indicates the beginning of a text. 
- **xxmaj** : Indicates the next word begins with a capital. 
- **xxunk** : Indicates the next word is unknown.  

In [10]:
#@ INSPECTING TOKENIZATION: EXAMPLE:
coll_repr(tkn('&copy; Fast.ai www.fast.ai/INDEX'), 30)   # Inspecting Tokens. 

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

### **SUBWORD TOKENIZATION:**
- **Word Tokenization** relies on an assumption that spaces provide a useful separation of components of meaning in a sentence which is not always appropriate. Languages such as Chinese and Japanese don't use spaces and in such cases **Subword Tokenization** generally plays the best role. **Subword Tokenization** splits words into smaller parts based on the most commonly occurring sub strings. 

In [11]:
#@ INITIALIZING SUBWORD TOKENIZATION: EXAMPLE:
txts = L(o.open().read() for o in files[:2000])                # Getting List of Reviews. 

#@ INITIALIZING SUBWORD TOKENIZER: 
def subword(sz):                                               # Defining Function.      
    sp = SubwordTokenizer(vocab_sz=sz)                         # Initializing Subword Tokenizer. 
    sp.setup(txts)                                             # Getting Sequence of Characters. 
    return " ".join(first(sp([txt]))[:40])                     # Inspecting the Vocab. 

#@ IMPLEMENTATION: 
subword(1000)                                                  # Inspecting Subword Tokenization. 

"▁We ll ▁it ▁was n ' t ▁horrible ▁but ▁it ▁was n ' t ▁great . ▁No ▁where ▁near ▁as ▁good ▁as ▁the ▁original . ▁It ▁kind a ▁tri ed ▁to ▁hard ▁to ▁be ▁the ▁first ▁movie . ▁I ▁think"

**Notes:**
- Here **setup** is a special fastai method that is called automatically in usual data processing pipelines which reads the documents and find the common sequences of characters to create the vocab. Similarly [**L**](https://fastcore.fast.ai/#L) is also referred as superpowered list. The special character '_' represents a space character in the original text. 

In [12]:
#@ IMPLEMENTATION OF SUBWORD TOKENIZATION: 
subword(200)                                                  # Inspecting Vocab. 
subword(10000)                                                # Inspecting Vocab. 

"▁Well ▁it ▁wasn ' t ▁horrible ▁but ▁it ▁wasn ' t ▁great . ▁No ▁where ▁near ▁as ▁good ▁as ▁the ▁original . ▁It ▁kind a ▁tried ▁to ▁hard ▁to ▁be ▁the ▁first ▁movie . ▁I ▁think ▁it ▁needed ▁a ▁better"

**Note:**
- A larger vocab means fewer tokens per sentence which means faster training, less memory, and less state for the model to remember but it means larger embedding matrices and require more data to learn. **Subword Tokenization** provides a way to easily scale between character tokenization i.e. using a small subword vocab and word tokenization i.e using a large subword vocab and handles every human language without needing language specific algorithms to be developed. 

### **NUMERICALIZATION:**
- **Numericalization** is the process of mapping tokens to integers. It involves making a list of all possible levels of that categorical variable or the vocab and replacing each level with its index in the vocab.

In [13]:
#@ INITIALIZING TOKENS: 
toks = tkn(txt)                                              # Getting Tokens. 
print(coll_repr(tkn(txt), 31))                               # Inspecting Tokens. 

(#331) ['xxbos','xxmaj','well','it','was',"n't",'horrible','but','it','was',"n't",'great','.','xxmaj','no','where','near','as','good','as','the','original','.','xxmaj','it','kinda','tried','to','hard','to','be'...]


In [14]:
#@ INITIALIZING TOKENS: 
toks200 = txts[:200].map(tkn)                                # Getting Tokens. 
toks200[0]                                                   # Inspecting Tokens. 

(#331) ['xxbos','xxmaj','well','it','was',"n't",'horrible','but','it','was'...]

In [15]:
#@ NUMERICALIZATION USING FASTAI: 
num = Numericalize()                                         # Initializing Numericalization. 
num.setup(toks200)                                           # Getting Integers. 
coll_repr(num.vocab, 20)                                     # Inspecting Vocabulary. 

"(#2120) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','and','a','of','to','is','in','it','i'...]"

In [16]:
#@ INITIALIZING NUMERICALIZATION: 
nums = num(toks)[:20]; nums                                  # Inspection. 
" ".join(num.vocab[o] for o in nums)                         # Getting Original Text. 

"xxbos xxmaj well it was n't horrible but it was n't great . xxmaj no where near as good as"

### **CREATING BATCHES FOR LANGUAGE MODEL:**
- At every epoch I will shuffle the collection of documents and concatenate them into a stream of tokens and cut that stream into a batch of fixedsize consecutive ministreams. The model will then read the ministreams in order. 

In [17]:
#@ CREATING BATCHES FOR LANGUAGE MODEL: 
nums200 = toks200.map(num)                                   # Initializing Numericalization. 
dl = LMDataLoader(nums200)                                   # Creating Language Model Data Loaders. 

#@ INSPECTING FIRST BATCH: 
x, y = first(dl)                                             # Getting First Batch of Data. 
x.shape, y.shape                                             # Inspecting Shape of Data. 

(torch.Size([64, 72]), torch.Size([64, 72]))

In [18]:
#@ INSPECTING THE DATA: 
" ".join(num.vocab[o] for o in x[0][:20])                    # Inspecting Independent Variable. 

"xxbos xxmaj well it was n't horrible but it was n't great . xxmaj no where near as good as"

In [19]:
#@ INSPECTING THE DATA: 
" ".join(num.vocab[o] for o in y[0][:20])                    # Inspecting Dependent Variable. 

"xxmaj well it was n't horrible but it was n't great . xxmaj no where near as good as the"

### **TRAINING A TEXT CLASSIFIER:**

**LANGUAGE MODEL USING DATABLOCK:**

In [20]:
#@ CREATING LANGUAGE MODEL USING DATABLOCK: 
get_imdb = partial(get_text_files, folders=["train", "test", "unsup"])    # Getting Text Files. 
db = DataBlock(blocks=TextBlock.from_folder(path, is_lm=True),            # Initializing TextBlock. 
               get_items=get_imdb, splitter=RandomSplitter(0.1))          # Initializing DataBlock. 

#@ CREATING LANGUAGE MODEL DATALOADERS: 
dls_lm = db.dataloaders(path, path=path, bs=128, seq_len=80)              # Initializing Data Loaders. 

In [21]:
#@ INSPECTING THE BATCHES OF DATA: 
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj this is basically a goofball comedy , with somewhat odd pacing due to some dramatic elements . xxmaj for xxmaj michael xxup j. xxmaj fox and xxmaj paul xxmaj reubens , it was their first xxunk had previously been in a short lived xxup tv series and a xxup tv movie ) . \n\n xxmaj since the movie is basically a race / scavenger hunt type movie , like "" cannonball xxmaj run "" , "" it 's","xxmaj this is basically a goofball comedy , with somewhat odd pacing due to some dramatic elements . xxmaj for xxmaj michael xxup j. xxmaj fox and xxmaj paul xxmaj reubens , it was their first xxunk had previously been in a short lived xxup tv series and a xxup tv movie ) . \n\n xxmaj since the movie is basically a race / scavenger hunt type movie , like "" cannonball xxmaj run "" , "" it 's a"
1,"concerned only with the behavior of its characters , it 's original and challenging . xxmaj then it turns into a story filled with familiar elements , and by the end everything is happening by the numbers . xxbos xxmaj what goes through the mind of the office drone that snaps and shoots up the place ? xxmaj what drives a person like that ? xxmaj come on , we 've all seen the reports "" he xxmaj was a","only with the behavior of its characters , it 's original and challenging . xxmaj then it turns into a story filled with familiar elements , and by the end everything is happening by the numbers . xxbos xxmaj what goes through the mind of the office drone that snaps and shoots up the place ? xxmaj what drives a person like that ? xxmaj come on , we 've all seen the reports "" he xxmaj was a xxmaj"


**FINETUNING THE LANGAUGE MODEL:**
- I will use **Embeddings** to convert the integer word indices into activations that can be used for the neural networks. These embeddings are feed into **Recurrent Neural Network** using and architecture called **AWD-LSTM**. 

In [22]:
#@ INITIALIZING LANGUAGE MODEL LEARNER: 
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3,                 # Using AWD LSTM Architecture. 
                               metrics=[accuracy, Perplexity()]).to_fp16()      # Initializing LM Learner.    

In [None]:
#@ TRAINING EMBEDDINGS WITH RANDOM INITIALIZATION: 
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.006634,3.903991,0.299049,49.600018,21:03


**SAVING AND LOADING MODELS:**

In [None]:
#@ SAVING MODELS: 
learn.save("/content/gdrive/MyDrive/1Epoch")                  # Saving the Model. 

Path('/content/gdrive/MyDrive/1Epoch.pth')

In [23]:
#@ LOADING MODELS: 
learn = learn.load("/content/gdrive/MyDrive/1Epoch")          # Loading the Model. 

In [24]:
#@ FINETUNING THE LANGUAGE MODEL: 
learn.unfreeze()                                              # Unfreezing the Layers. 
learn.fit_one_cycle(10, 2e-3)                                 # Training the Model. 

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.33838,4.253496,0.291754,70.350945,22:53
1,3.975756,3.908851,0.31335,49.841671,23:06
2,3.798644,3.76222,0.323882,43.043861,22:59
3,3.690822,3.682423,0.330617,39.742588,22:59
4,3.606129,3.640616,0.334283,38.115314,22:53
5,3.529613,3.610161,0.337432,36.972008,23:02
6,3.455461,3.589717,0.339728,36.223839,23:07
7,3.384331,3.582524,0.341028,35.964203,23:23
8,3.340192,3.580934,0.341526,35.907047,23:15
9,3.312969,3.583758,0.341377,36.008621,23:11


**ENCODER**
- **Encoder** is defined as the model which doesn't contain task specific final layers. The term **Encoder** means much the same thing as body when applied to vision **CNN** but **Encoder** tends to be more used for NLP and generative models.

In [25]:
#@ SAVING MODELS: 
learn.save_encoder("/content/gdrive/MyDrive/FineTuned")       # Saving the Model. 

### **TEXT GENERATION:**
- I will use the model to write new reviews. 

In [26]:
#@ INITIALIZING TEXT GENERATION: 
TEXT = "I hate the movie because"                            # Text Example. 
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75)
         for _ in range(N_SENTENCES)]                       # Initializing Text Generation. 
print(" ".join(preds))                                      # Inspecting Generated Reviews.  

i hate the movie because of the awful acting and the Canadian scenery . 

 This movie is not worth a penny . There is nothing funny or funny about this movie . Here 's another one of those b movies i hate the movie because it is so predictable and ridiculous . No , this one is not for people who are not . This movie is a " thriller " , with a lot of nudity and problem in the plot .


### **TEXT CLASSIFICATION:**
- Here, I am moving towards **Classifier** fine tuning rather than **Language Model** fine tuning as mentioned above. A **Language Model** predicts the next word of a document so it doesn't require any external labels. A **Classifier** predicts an external labels. 

In [27]:
#@ INITIALIZING DATALOADERS:
db_clas = DataBlock(blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),        # Initializing Text Blocks. 
                            CategoryBlock),                                         # Initializing Category Block.
                    get_y=parent_label,                                             # Getting Target. 
                    get_items=partial(get_text_files,folders=["train","test"]),     # Getting Text Files. 
                    splitter=GrandparentSplitter(valid_name="test"))                # Splitting the Data. 
dls_clas = db_clas.dataloaders(path, path=path, bs=128, seq_len=72)                 # Initializing DataLoaders. 

#@ INSPECTING THE BATCHES: 
dls_clas.show_batch(max_n=2)                                                        # Inspection. 

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos * * attention xxmaj spoilers * * \n\n xxmaj first of all , let me say that xxmaj rob xxmaj roy is one of the best films of the 90 's . xxmaj it was an amazing achievement for all those involved , especially the acting of xxmaj liam xxmaj neeson , xxmaj jessica xxmaj lange , xxmaj john xxmaj hurt , xxmaj brian xxmaj cox , and xxmaj tim xxmaj roth . xxmaj michael xxmaj canton xxmaj jones painted a wonderful portrait of the honor and dishonor that men can represent in themselves . xxmaj but alas … \n\n it constantly , and unfairly gets compared to "" braveheart "" . xxmaj these are two entirely different films , probably only similar in the fact that they are both about xxmaj scots in historical xxmaj scotland . xxmaj yet , this comparison frequently bothers me because it seems",pos


In [30]:
#@ CREATING MODEL FOR TEXT CLASSIFICATION: 
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()                        # Initializing Text Classifier Learner. 
learn = learn.load_encoder("/content/gdrive/MyDrive/FineTuned")                    # Loading the Encoder. 

**FINETUNING THE CLASSIFIER:**
- I will train the **Classifier** with discriminative learning rates and gradaul unfreezing. In computer vision unfreezing the model at once is common approach but for **NLP Classifier** unfreezing a few layers at a time will make a real difference. 

In [31]:
#@ TRAINING THE CLASSIFIERS: 
learn.fit_one_cycle(1, 2e-2)                                                       # Freezing. 

epoch,train_loss,valid_loss,accuracy,time
0,0.2382,0.180242,0.93256,01:09


In [32]:
#@ TRAINING THE CLASSIFIERS: UNFREEZING LAYERS: 
learn.freeze_to(-2)                                                               # Unfreezing. 
learn.fit_one_cycle(1, slice(1e-2/(2.6**4), 1e-2))                                # Training the Classifier. 

epoch,train_loss,valid_loss,accuracy,time
0,0.219655,0.163169,0.93788,01:14


In [33]:
#@ TRAINING THE CLASSIFIERS: UNFREEZING LAYERS: 
learn.freeze_to(-3)                                                               # Unfreezing. 
learn.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3))                                # Training the Classifier. 

epoch,train_loss,valid_loss,accuracy,time
0,0.187817,0.147782,0.9446,01:37


In [34]:
#@ TRAINING THE CLASSIFIERS: UNFREEZING LAYERS: 
learn.unfreeze()                                                                  # Unfreezing. 
learn.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))                                # Training the Classifier. 

epoch,train_loss,valid_loss,accuracy,time
0,0.158834,0.149136,0.94448,01:55
1,0.143565,0.14869,0.94772,01:55
