In [None]:
%%capture
!pip install transformers
!pip install datasets
!pip install GPUtil

In [None]:
import torch
import pandas as pd
from numba import cuda
from datasets import load_dataset
from GPUtil import showUtilization as gpu_usage
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TrainingArguments, Trainer, default_data_collator



In [None]:
%%capture
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [None]:
%%capture
model = GPT2LMHeadModel.from_pretrained('gpt2',
                                        pad_token_id=tokenizer.eos_token_id)

In [None]:
print("The max model length is {} for this model".format(tokenizer.model_max_length))
print("The beginning of sequence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))

The max model length is 1024 for this model
The beginning of sequence token <|endoftext|> token has the id 50256
The end of sequence token <|endoftext|> has the id 50256


In [None]:
tokenizer.vocab_size

50257

In [None]:
tokenizer

GPT2Tokenizer(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=True)

In [None]:
tokenizer.all_special_tokens

['<|endoftext|>']

In [None]:
tokenizer.eos_token_id

50256

GPT-2 does not employ padding. Its default maximum supported sentence length is 1024.

In [None]:
tokenizer.max_model_input_sizes

{'gpt2': 1024,
 'gpt2-medium': 1024,
 'gpt2-large': 1024,
 'gpt2-xl': 1024,
 'distilgpt2': 1024}

In [None]:
sentence = 'I am an Artificial Intelligence Developer'
input_ids  = tokenizer.encode(sentence,
                              return_tensors = 'pt')

In [None]:
input_ids

tensor([[   40,   716,   281, 35941,  9345, 23836]])

In [None]:
tokenizer.decode(input_ids[0][3])

' Artificial'

In [None]:
greedy_output = model.generate(input_ids,
                               max_length=100,
                               no_repeat_ngram_size=2)

In [None]:
for i, output in enumerate(greedy_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am an Artificial Intelligence Developer. I am a software developer. And I'm a programmer.

I'm not a computer scientist. But I do have a lot of experience in the field of Artificial intelligence. So I think that I can help you understand the challenges of AI. You can learn about the problems of artificial intelligence and how they can be solved. It's not just about solving problems. There are many other things that can go wrong. We can't just solve problems by solving them...



In [None]:
beam_output = model.generate(input_ids,
                             max_length = 100,
                             num_beams=5,
                             num_return_sequences=5,
                             no_repeat_ngram_size=2,
                             early_stopping=True)

In [None]:
beam_output

tensor([[   40,   716,   281, 35941,  9345, 23836,    13,   314,   423,   587,
          1762,   319,  9552,   329,   257,   890,   640,   290,   314,   716,
           845,  6568,   546,   262,  2003,   286,  9552,    13,   198,   198,
            40,   423,   257,  1256,   286,  1998,   287,   262,  2214,   286,
         11666,  4430,    13,   554,   262,   938,  1178,   812,    11,   314,
          1053,   587,  2950,   287,   257,  1271,   286,  1180,  4493,    11,
          1390,   262,  2478,   286,   262,  9552,  3859,   329,  3012,   338,
          5565,  3859,    11,   290,   262,  6282,   286,   281,  9552,    12,
         12293,  5175,   598,   329,  4196,   338,  8969,  3859,    13,  2312,
          4493,   423,  2957,   502,   284,   262,  7664,   326,  9552,   318],
        [   40,   716,   281, 35941,  9345, 23836,    13,   314,   423,   587,
          1762,   319,  9552,   329,   257,   890,   640,   290,   314,   716,
           845,  6568,   546,   262,  2003,   286, 

In [None]:
for i, output in enumerate(beam_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am an Artificial Intelligence Developer. I have been working on AI for a long time and I am very excited about the future of AI.

I have a lot of experience in the field of artificial intelligence. In the last few years, I've been involved in a number of different projects, including the development of the AI platform for Google's Android platform, and the creation of an AI-powered mobile app for Apple's iOS platform. These projects have led me to the conclusion that AI is...

1: I am an Artificial Intelligence Developer. I have been working on AI for a long time and I am very excited about the future of AI.

I have a lot of experience in the field of artificial intelligence. In the last few years, I've been involved in a number of different projects, including the development of the AI platform for Google's Android platform, and the creation of an AI-powered mobile app for Apple's iOS platform. These projects have led me to believe that AI is a...

2: I am an Artificial Intellige

In [None]:
random_output = model.generate(input_ids,
                               do_sample=True,
                               max_length=100,
                               top_k=0,
                               temperature=0.8)

In [None]:
for i, output in enumerate(random_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am an Artificial Intelligence Developer, who is often asked to explain his technical work to you. I go into a lot of detail about the topics and concepts of artificial intelligence in this course, and it is very easy to understand how people think and work in general. Much of the book is on deep learning, and I will cover many of the topics most people have not learned in several years. In addition to the topics covered, there are some points I experimenting with, including the idea that a computer...



In [None]:
top_k_output = model.generate(input_ids,
                              do_sample=True,
                              max_length=100,
                              top_k=50)

In [None]:
for i, output in enumerate(top_k_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am an Artificial Intelligence Developer and I think that even if you're getting a lot of applications of AI in your day to day life, it is very, very difficult to make a good decision or not. The difference between good news and bad news is actually quite significant.

Q. You have started a company to build a new kind of AI, for the sake of innovation, with the help of artificial intelligence. How are you planning on using AI to move forward in this direction?
...



In [None]:
top_p_output = model.generate(input_ids,
                              do_sample=True,
                              max_length=100,
                              top_p=0.8,
                              top_k=0)

In [None]:
for i, output in enumerate(top_p_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am an Artificial Intelligence Developer who just recently joined the #IBEL leadership team at UBI. I'm able to talk about the future of AI and my thoughts on AI-driven solutions, whether you're in the fields of UI design, game design, or product development.

JB: Oh my goodness. It's so nice to meet you all.

JH: Thank you.

JB: I love talking to you all! I want to thank you all for being...



In [None]:
top_k_p_outputs = model.generate(input_ids,
                                 do_sample=True,
                                 max_length=2*100,
                                 top_k=50,
                                 top_p=0.85,
                                 num_return_sequences=5)

In [None]:
for i, output in enumerate(top_k_p_outputs):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: I am an Artificial Intelligence Developer with a lot of experience in creating AI applications. I have been programming for over 3 years and have worked in a lot of applications. I have taught a lot of programming languages with lots of experience and I do believe that I am the only person who could make an intelligent system that works well. I believe that there is no single solution that I can create that would be perfect for me, but I have made a number of attempts, many of them successful, but I still believe that there is only one way to accomplish this goal. I have come to the conclusion that we need to move to an AI world, where humans will be the dominant player in many fields. We need to be able to build AI systems that do not have to work in humans, and we must be able to make them work in the human world. In this post I am going to talk about what I believe can be done. I will go into the technical aspects of what I believe can...

1: I am an Artificial Intelligence Devel

In [None]:
%%capture
dataset_name = "tiny_shakespeare"
cache_dir = "lm_dataset/"
datasets = load_dataset(dataset_name, cache_dir=cache_dir)

In [None]:
datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})

In [None]:
#datasets['train'][:1]

In [None]:
column_names = datasets["train"].column_names
text_column_name = "text" if "text" in column_names else column_names[0]

def tokenize_function(examples):
    output = tokenizer(examples[text_column_name])
    return output

tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=column_names,
    desc="Running tokenizer on dataset")

Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (301966 > 1024). Running this sequence through the model will result in indexing errors


Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
block_size = tokenizer.model_max_length
if block_size > 1024:
    block_size = 1024

def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()}
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    desc=f"Grouping texts in chunks of {block_size}")

Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ?ba/s]

Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ?ba/s]

Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
train_dataset = lm_datasets["train"]
eval_dataset = lm_datasets["validation"]

In [None]:
training_args = TrainingArguments(output_dir = "output/",
                                  per_device_train_batch_size=1,
                                  num_train_epochs=50,
                                  save_total_limit=1)

In [None]:
trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=eval_dataset,
                  tokenizer=tokenizer,
                  data_collator=default_data_collator)

In [None]:
gpu_usage()

| ID | GPU | MEM |
------------------
|  0 |  0% |  8% |


In [None]:
torch.cuda.empty_cache()

In [None]:
train_result = trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
500,3.4861
1000,3.163
1500,2.9468
2000,2.7326
2500,2.5437
3000,2.3804
3500,2.2043
4000,2.0479
4500,1.9113
5000,1.7868


In [None]:
trainer.save_model()

metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)

trainer.save_state()

***** train metrics *****
  epoch                    =       50.0
  total_flos               =  7154406GF
  train_loss               =      1.548
  train_runtime            = 0:54:33.49
  train_samples_per_second =      4.491
  train_steps_per_second   =      4.491


In [None]:
torch.manual_seed(2)

ids = tokenizer.encode('One does not simply walk into',
                       return_tensors='pt').cuda()

In [None]:
greedy_output = model.generate(ids,
                               max_length=100,
                               no_repeat_ngram_size=2)

In [None]:
for i, output in enumerate(greedy_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: One does not simply walk into a room and yields up his or her body
To such as have no more in common with him;
But in one respect he differs from all others; and
For one thing, he hath in himself a nature unlike
any in nature, and in being thus most
brave, to break from him.

LUCIO:
This is a brave fellow; for he is one
That will, in a word, do more than one thousand...



In [None]:
beam_output = model.generate(ids,
                             max_length = 100,
                             num_beams=5,
                             num_return_sequences=5,
                             no_repeat_ngram_size=2,
                             early_stopping=True)

In [None]:
for i, output in enumerate(beam_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: One does not simply walk into a man's bosom and take his clothes
from him; for I have seen such a thing.

LADY ANNE:
O Warwick, thou art the first that ever didst bend
That wronged thyself in the view of others!
The army of the queen am I arm'd against;
And I, against thy back, will turn the diadem
On thy head, and burn the principal of thy pride
With the...

1: One does not simply walk into a man's bosom and take his clothes
from him; for I have seen such a thing.

LADY ANNE:
O Warwick, thou art the first that ever didst bend
That wronged thyself in the view of others!
The army of the queen am I arm'd against;
And I, against thy back, will turn the diadem
On thy head, and burn the principal of thy pride
With that...

2: One does not simply walk into a man's bosom and take his clothes
from him; for I have seen such a thing.

LADY ANNE:
O Warwick, thou art the first that ever didst bend
That wronged thyself in the view of others!
The army of the queen am I arm'd against;
And I, again

In [None]:
random_output = model.generate(ids,
                               do_sample=True,
                               max_length=100,
                               top_k=0,
                               temperature=0.8)

In [None]:
for i, output in enumerate(random_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: One does not simply walk into a man's mouth and speak;
For I have heard some of these spoken. Thou hast undone thyself.

ROMEO:
Thou detestable traitor, I have seen thy face.

BENVOLIO:
O, make me happy by having him.

MERCUTIO:
And happy too, is it so: a woman is wont to chide.

ROMEO:
He chides for his...



In [None]:
top_k_output = model.generate(ids,
                              do_sample=True,
                              max_length=100,
                              top_k=50)

In [None]:
for i, output in enumerate(top_k_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: One does not simply walk into this world;
And that the naked traveller be king,
His acts of violence transported to the end;
Yet, in this world he is king; and I, his wife,
Can no longer say 'I love thee'
His warlike father advised him to, and I, his wife,
Were angels and nature no better pleased:
For why, 'tis my husband's will,
And I, his wife, should that be so...



In [None]:
top_p_output = model.generate(ids,
                              do_sample=True,
                              max_length=100,
                              top_p=0.8,
                              top_k=0)

In [None]:
for i, output in enumerate(top_p_output):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: One does not simply walk into the mind of the dull;
And yields too much to the common feeling.

Second Murderer:
What rages here in this cell?

CLARENCE:
Let black magic apprehend
This murderous wretch: he is come to know
His evil done, and by that knowledge apprehends
The evil done.

Third Murderer:
What rages here in this cell?

CLARENCE:
That he choose...



In [None]:
top_k_p_outputs = model.generate(ids,
                                 do_sample=True,
                                 max_length=2*100,
                                 top_k=50,
                                 top_p=0.85,
                                 num_return_sequences=5)

In [None]:
for i, output in enumerate(top_k_p_outputs):
    print("{}: {}...".format(i, tokenizer.decode(output,
                                                 skip_special_tokens=True)))
    print('')

0: One does not simply walk into the world;
One at a time, some one at a time,
Could within a mile encompass all the earth,
And nothing can be more than the farthest world,
Within whose vastness all your body is,
When you are cold.

FLORIZEL:
So had you never been cold;
But now you have, since you can no more but think it,
Cold does encompass your thinking;
And cold will cloud your thought;
Since you cannot think it so, do not take
Your apprehension with your apprehension.

LEONTES:
It is a charge he makes against my better nature,
Because I abhor his rude delights. It is spoke so,
More than with thunder or with wind; so goes:
I am no meteor; yet meteor I can behold,
Lords circling in the clouds, that roused up their fury,
To dash down and throw their...

1: One does not simply walk into a man's bosom;
He, in aught, may move or move; a naked man by his garments,
Is not naked in the sense of apparel.
What he does see, he humbles him with precept;
But whether his folly or what contempt
H