### **INITIALIZATION:**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**LIBRARIES AND DEPENDENCIES:**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [3]:
#@ INSTALLING DEPENDENCIES: UNCOMMENT BELOW: 
# !pip install -Uqq fastbook
# import fastbook
# fastbook.setup_book()
# !pip install -Uq transformers

In [4]:
#@ DOWNLOADING LIBRARIES AND DEPENDENCIES: 
from fastai.basics import *
from fastai.callback.all import *
from fastai.text.all import *

import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

#@ IGNORING WARNINGS: 
import warnings
warnings.filterwarnings("ignore")

### **GPT2 MODEL:**
- There are several versions of **GPT2 Model** : [**Transformers Documentation**](https://huggingface.co/transformers/pretrained_models.html). I will inspect the **Tokenizer** and **Model**. The **Tokenizers** in **HuggingFace** usually do the tokenization and numericalization in one step. The **Model** can generate predictions. 

In [6]:
#@ LOADING PRETRAINED MODEL: 
pretrained_weights = "gpt2"                                         # Initialization. 
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)   # Initializing Tokenizer. 
model = GPT2LMHeadModel.from_pretrained(pretrained_weights)         # Initializing Pretrained Model. 

In [7]:
#@ INSPECTING TOKENIZER: 
ids = tokenizer.encode("Hello there! How are you?")                 # Implementation of Tokenizer. 
print(ids)                                                          # Inspection. 
tokenizer.decode(ids)                                               # Getting Text. 

[15496, 612, 0, 1374, 389, 345, 30]


'Hello there! How are you?'

In [8]:
#@ INSPECTING PREDICTIONS: 
t = torch.LongTensor(ids)[None]                                     # Initializing 1D Tensor. 
preds = model.generate(t)                                           # Generating Predictions. 
preds.shape, preds[0]                                               # Inspection. 

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


(torch.Size([1, 20]),
 tensor([15496,   612,     0,  1374,   389,   345,    30,   198,   198,    40,  1101,   257,  1310,  1643, 10032,   286,   262,  6678,  3404,    13]))

In [9]:
#@ INSPECTING PREDICTIONS: 
tokenizer.decode(preds[0].numpy())                                  # Inspection. 

"Hello there! How are you?\n\nI'm a little bit tired of the usual stuff."

### **PREPARING DATA:**
- I will use **wikitext-2** dataset here. 

In [12]:
#@ GETTING THE DATASET: 
path = untar_data(URLs.WIKITEXT_TINY)                   # Initializing Path to Dataset. 
path.ls()                                               # Inspection.                           

(#2) [Path('/root/.fastai/data/wikitext-2/test.csv'),Path('/root/.fastai/data/wikitext-2/train.csv')]

In [13]:
#@ LOADING THE DATASET: 
df_train = pd.read_csv(path/"train.csv", header=None)   # Reading the Data.
df_valid = pd.read_csv(path/"test.csv", header=None)    # Reading the Data. 
df_train.head(2)                                        # Inspecting the Data. 

Unnamed: 0,0
0,"\n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z..."
1,"\n = Big Boy ( song ) = \n \n "" Big Boy "" <unk> "" I 'm A Big Boy Now "" was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including "" Big Boy "" . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re..."


In [15]:
#@ LOADING THE DATA:
all_texts = np.concatenate([df_train[0].values, 
                            df_valid[0].values])        # Initializing Concatenation. 

### **TRANSFORMERS TOKENIZER:**

**TRANSFORM METHOD:**  
**Fastai Transform** is defined as:     
- an **encodes** method that is applied when **transform** is called. 
- a **decodes** method that is applied when **decode** method of transform is called. 
- a **setups** method that sets inner state of **Transform**. 

In [16]:
#@ DEFINING TRANSFORMERS TOKENIZER: 
class TransformersTokenizer(Transform):                     # Defining Tokenizer. 
    def __init__(self, tokenizer):                          # Initializing Constructor Function. 
        self.tokenizer = tokenizer                          # Initializing Tokenizer. 
    
    def encodes(self, x):                                   # Initializing Encode Method. 
        toks = self.tokenizer.tokenize(x)                   # Initializing Tokenizer. 
        return tensor(
            self.tokenizer.convert_tokens_to_ids(toks))     # Generating IDs. 
    
    def decodes(self, x):                                   # Initializing Decode Method. 
        return TitledStr(
            self.tokenizer.decode(x.cpu().numpy()))

In [18]:
#@ IMPLEMENTATION OF TRANSFORM METHOD: 
splits = [range_of(df_train), 
          list(range(len(df_train), len(all_texts)))]       # Initialization. 
tls = TfmdLists(all_texts,TransformersTokenizer(tokenizer), 
                splits=splits, dl_type=LMDataLoader)        # Initializing Transformed DataLoader. 

#@ INSPECTING TRANSFORMED DATALOADER: 
print(tls.train[0], tls.valid[0])
print(tls.tfms(tls.train.items[0]).shape, 
      tls.tfms(tls.valid.items[0]).shape)

tensor([220, 198, 796,  ..., 198, 220, 198]) tensor([220, 198, 796,  ..., 198, 220, 198])
torch.Size([4576]) torch.Size([1485])
