**QUALITY DATA IS TOUGH TO BE FOUND.** 

Yes it is tough to be found and no one wants to spend days trying to read research papers in this ordeal of trying to find that perfect dataset. The perfect publicly available dataset DOES NOT EXIST. 
Sure you could be doing hours of feature engineering, but wait, we can generate our own dataset. YES. THAT IS THE PURPOSE.

(Upvote this or the Chinese Government will take away my access to the internet)

> #### So what? Should I waste my money to use AWS or Azure's annotation services and get my already shady data more shady labels?
Answer: **NO.**

So how do I plan to solve this problem?
## Dumb Idea: Use Text Generator to Generate more data to feed into a classifier.

# Importing Stuff

### So, what are the three main libraries required here?
The libraries required to using a hugging face transformer would be: the `Transformers`, `FastAI`, `Pandas` and `PyTorch` Libraries.

In [None]:
# Installing:
!pip install -Uqq fastai
!pip install transformers

In [None]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

import torch
from fastai.text.all import *

import pandas as pd

In [None]:
# Importing the model and the tokenizer:

pretrained_weights = 'gpt2'
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model = GPT2LMHeadModel.from_pretrained(pretrained_weights)

Before we move on to the fine-tuning part, let's have a look at this `tokenizer` and this `model`. The tokenizers in `HuggingFace` usually do the tokenization and the numericalization in one step (we ignore the padding warning for now):

# Playing Around with the Tokenizer 🏃🏽‍♀️


In [None]:
# Encoding a sentence and checking it out:

ids = tokenizer.encode('This is an example of text, and')
ids

In [None]:
# Decoding the same bad boy: 

tokenizer.decode(ids)

# Preprocessing 👩‍💻

Let's have a look at what those `csv` files look like:

In [None]:
# Reading the training CSV:
df_train = pd.read_csv("../input/nazidataset/nazitweets.csv", header=None, lineterminator='\n')

# Removing the NaN rows.
df_train = df_train.dropna()

# Taking a look at it:
df_train.head()


In [None]:
# Converting all the texts into a numpy array:
all_texts = np.array([df_train[1].values])

In [None]:
# (Stolen Code Alert)
class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        toks = self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(toks))
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

In [None]:
# Defining the splits for the dataloader:
splits = [range_of(df_train), list(range(len(df_train), len(all_texts)))]

# Defining the Transformed Lists:
tls = TfmdLists(all_texts, TransformersTokenizer(tokenizer), splits=splits, dl_type=LMDataLoader)

We specify `dl_type=LMDataLoader` for when we will convert this `TfmdLists` to `DataLoaders`: we will use an `LMDataLoader` since we have a language modeling problem, not the usual fastai `TfmdDL`.

In [None]:
show_at(tls.train, 0)

The fastai library expects the data to be assembled in a `DataLoaders` object (something that has a training and validation dataloader). We can get one by using the `dataloaders` method. We just have to specify a batch size and a sequence length. Since the GPT2 model was trained with sequences of size 1024, **we will not use this sequence length (it's a stateless model, so it will mess the perplexity if we use less, I will do that nonetheless)**:

In [None]:
# bs refers to the Batch Size while sl refers to the sequence length:
bs,sl = 4,240 

# Defining the Dataloader:
dls = tls.dataloaders(bs=bs, seq_len=sl)

Lets take a final look at the data:

In [None]:
dls.show_batch(max_n=2)

Another way to gather the data is to preprocess the texts once and for all and only use the transform to decode the tensors to texts:

In [None]:
# Defining the function:
def tokenize(text):
    toks = tokenizer.tokenize(text)
    return tensor(tokenizer.convert_tokens_to_ids(toks))

# Actually Tokenizing everything:
tokenized = [tokenize(t) for t in progress_bar(all_texts)]

(Stolen Code Alert) Now we change the previous `Tokenizer` like this:

In [None]:
class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        return x if isinstance(x, Tensor) else tokenize(x)
        
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

In [None]:
# Getting the dataloader:

tls = TfmdLists(tokenized, TransformersTokenizer(tokenizer), splits=splits, dl_type=LMDataLoader)
dls = tls.dataloaders(bs=bs, seq_len=sl)

And we can check it still works properly for showing purposes:

In [None]:
# Hey again, my old friend.

dls.show_batch(max_n=2)

# Fine-tuning the model 🤹🏽‍♀️

The HuggingFace model will return a tuple in outputs, with the actual predictions and some additional activations (should we want to use them in some regularization scheme). To work inside the fastai training loop, we will need to drop those using a `Callback`: we use those to alter the behavior of the training loop.

Here we need to write the event `after_pred` and replace `self.learn.pred` (which contains the predictions that will be passed to the loss function) by just its first element. In callbacks, there is a shortcut that lets you access any of the underlying `Learner` attributes so we can write `self.pred[0]` instead of `self.learn.pred[0]`. That shortcut only works for read access, not write, so we have to write `self.learn.pred` on the right side (otherwise we would set a `pred` attribute in the `Callback`).

In [None]:
# The DropOutput Callback:

class DropOutput(Callback):
    def after_pred(self): self.learn.pred = self.pred[0]

Now, we are ready to create our `Learner`, which is a fastai object grouping data, model and loss function and handles model training or inference. Since we are in a language model setting, we pass perplexity as a metric, and we need to use the callback we just defined. Lastly, we use mixed precision to save every bit of memory we can (and if you have a modern GPU, it will also make training faster):

I genuinely have no clue how bad the perplexity is going to get since I literally passed the Sequence Length like 25% of the one they used while training the GPT2.

In [None]:
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), cbs=[DropOutput], metrics=Perplexity()).to_fp16()

Checking if the current configuration allow the model to actually work:

In [None]:
learn.validate()

Yes it is **working**, time to find the Learning Rate.

## Learning Rate

The `lr_find()` will find you the best suited learning rate.

In [None]:
learn.lr_find()

## Fitting the Model 🤹🏽‍♀️

In [None]:
learn.fit_one_cycle(5, 1e-5)

As seen above, pretty aweful loss and even awefully good perplexity.

# Generating some Stuff 😏

Defining a function to generate a sentence:

In [None]:
def predict(prompt, length, beams, temp):
    prompt_ids = tokenizer.encode(prompt)
    inp = tensor(prompt_ids)[None].cuda()
    preds = learn.model.generate(inp, max_length=length, num_beams=beams, temperature=temp)
    return tokenizer.decode(preds[0].cpu().numpy())

Finally trying the model out 🤔

In [None]:
predict("immigrants are bad for", 10, 5, 1.5)

In [None]:
predict("i hate it that these brown dudes have", 13, 5, 1)

Looks like it is kinda working, not that well BUT it is kinda working.

# If you liked what I am doing here: UPVOTE!

If you don't the chinese government will take away my access to this laptop and I will be left alone in this basement, please save me.