In [44]:
!pip install transformers
!pip install datasets
!pip install spacy
!pip install ftfy
!pip install accelerate

Collecting accelerate
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.23.0


In [1]:
from huggingface_hub import notebook_login

# login to huggingface hub
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Import Packages

To train our own GPT ( Generative Pre Trained ) model, we'll use Huggingface's transformers package and load datasets, functions and trainer from it. Huggingface provides set of useful utilities to build your own custom models from scratch! Which is very helpful if you are getting started!

In [21]:
from transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, OpenAIGPTConfig
from datasets import load_dataset,ReadInstruction

# Load Dataset

To build our text generation model, we'll use "pszemraj/simple_wikipedia_LM" dataset from huggingface. Its a small wikipedia dataset and its a good starting point for our model.

In [23]:
dataset = load_dataset("pszemraj/simple_wikipedia_LM",split=[
    ReadInstruction("train",to=30, unit="%"),
    ReadInstruction("validation",to=10, unit="%")
])

Let's see what our data looks like!

In [28]:
print("\n\n".join(dataset[0]['text'][:5]))

Vitoria Futebol Clube is a Portuguese sports club from the city of Setubal. Popularly known as Vitoria de Setubal (), the club was born under the original name Sport Victoria from the ashes of the small Bonfim Foot-Ball Club.

Pope Shenouda III (3 August 1923 - 17 March 2012) was the 117th Pope of Alexandria & Patriarch of the See of St. Mark. His papacy lasted for forty years, four months, and four days from 14 November 1971 until his death on 17 March 2012.
Pope Shenouda III died on 17 March 2012 in Cairo, Egypt from respiratory and kidney failure, aged 88.

Bob Steele (Robert Adrian Bradbury; January 23, 1907 - December 21, 1988) was an American actor. He was known for his roles in Carson City Kid, Island in the Sky, Rio Bravo, Hang 'Em High, Rio Lobo, and in the television sitcom F Troop.
Steele was born on January 23, 1907 in Portland, Oregon. He was raised in Hollywood, California. Steele was married to Louise A. Chessman from 1931 until they divorced in 1933. Then he was married

In [30]:
dataset[0][0]

{'id': '796322',
 'url': 'https://simple.wikipedia.org/wiki/Vit%C3%B3ria%20F.C.',
 'title': 'Vitória F.C.',
 'text': 'Vitoria Futebol Clube is a Portuguese sports club from the city of Setubal. Popularly known as Vitoria de Setubal (), the club was born under the original name Sport Victoria from the ashes of the small Bonfim Foot-Ball Club.'}

So it's a simple text corpus.

# Tokenization

Since neural networks only understands numbers and not text natively, we'll use a technique called **Tokenization** to convert out text into numbers. But don't worry Huggingface got us covered.

We'll use the OpenAIGPTTokenizer to tokenize this corpus for us. There are plenty of other tokenizers avaiable for us to use but make sure the tokenizer is compatible with your model's architecture.

In [31]:
gpt_tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")

In [32]:
def tokenize_function(data_dict):
    return gpt_tokenizer(data_dict["text"],truncation=True)

train_dataset_tokenized = dataset[0].map(tokenize_function, batched=True, remove_columns=['id','title','url'])
validation_dataset_tokenized = dataset[0].map(tokenize_function, batched=True, remove_columns=['id','title','url'])

Map:   0%|          | 0/67873 [00:00<?, ? examples/s]

Map:   0%|          | 0/67873 [00:00<?, ? examples/s]

In [33]:
train_dataset_tokenized.remove_columns(['attention_mask'])
validation_dataset_tokenized.remove_columns(['attention_mask'])

Dataset({
    features: ['text', 'input_ids'],
    num_rows: 67873
})

In [34]:
train_dataset_tokenized.set_format("torch")
validation_dataset_tokenized.set_format("torch")

In [35]:
gpt_tokenizer.pad_token = "<PAD>"

Okay, So we have loaded the openai-gpt pretrained tokenizer. Now we'll use it to tokenize our text corpus.

# Model Architecture

For text generation, previously RNN based architectures were used. But after transformers 🤖 arrived, RNN's for text generation became obsolete. The transformer decoder only architecture can be used to generate text.

<img src="https://iq.opengenus.org/content/images/2020/11/Screen-Shot-2020-11-16-at-9.24.43-AM.png" width=200/>

You can code this architecture by yourself from scratch to learn the indepths of the transformer architecture, but if you're already aware and know how this architecture works, you can save some time and use Huggingface and skip coding this architecture from scratch and focus on building your text generation model.

We'll use the Huggingface's **OpenAIGPTLMHeadModel** to build our text generation model. **OpenAIGPTLMHeadModel** is the implementation of the GPT Decoder only architecture you see in the image.

In [36]:
len(gpt_tokenizer)

40478

In [38]:
model_config = OpenAIGPTConfig(
    vocab_size=len(gpt_tokenizer),
    n_positions=531, # max sequence length
    n_embd=512, # embedding size per token
    n_layer=4, # number of decoder blocks
    n_head=4, # number of attentions heads in each attention block
)
model = OpenAIGPTLMHeadModel(model_config)

# Inputs & Outputs

Remember, transformer decoder only blocks are autoregressive i.e they take the previous output as input and generates new output.

Right now, our dataset is just sequence of words. We need to convert it into self supervised data i.e The model will take few words and predict the next word.

### Example

Sentence: Toronto is located in Ontario. Toronto is the most loved city in the world


Input: Toronto _ _ _

Output: Toronto is



We can use Huggingface DataCollator to achieve this. Huggingface makes everything easy!!

In [39]:
from transformers import DataCollatorForLanguageModeling

In [40]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=gpt_tokenizer,
    mlm=False
)

# Training Setup

We have everything setup now, Its time to setup our training config and train the model!

In [41]:
from transformers import Trainer, TrainingArguments

In [42]:
gpt_trainer_arguments = TrainingArguments(
    output_dir = "./model/",
    evaluation_strategy = "epoch",
    num_train_epochs = 5,
    save_strategy = "epoch",
    use_cpu = False,
    load_best_model_at_end = True,
    hub_model_id="gpt1_base_abhi",
    push_to_hub=True
)

In [43]:
gpt_trainer = Trainer(
    model = model,
    args=gpt_trainer_arguments,
    data_collator = data_collator,
    train_dataset = train_dataset_tokenized,
    eval_dataset = validation_dataset_tokenized,
    tokenizer = gpt_tokenizer,
)

# Train

Let's train the model now!

In [None]:
gpt_trainer.train()

Epoch,Training Loss,Validation Loss
1,4.943,4.812946
2,4.5439,4.377125
3,4.2816,4.149065
