<a href="https://colab.research.google.com/github/Srishtijais16/step_demo/blob/day5/day5_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
from transformers import GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline, \
                         Trainer, TrainingArguments
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')  # load up a standard gpt2 model

tokenizer.pad_token = tokenizer.eos_token
# load up our data into a dataset
pds_data = TextDataset(
    tokenizer=tokenizer,
    file_path='/content/Book.txt',
    block_size=64  # length of each chunk of text to use as a datapoint
)

pds_data[0], pds_data[0].shape  # inspect the first point
print(tokenizer.decode(pds_data[0]))
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
    # MLM is Masked Language Modelling (for BERT + auto-encoding tasks)
)
# example of how collator pads data dynamically
collator_example = data_collator([tokenizer('I am an input'), tokenizer('So am I')])

collator_example
collator_example.input_ids  # 50256 is our pad token id

tokenizer.pad_token_id
collator_example.attention_mask  # Note the 0 in the attention mask where we have a pad token
collator_example.labels  # note the -100 to ignore loss calculation for the padded token
# Labels are shifted inside the GPT model so we don't need to worry about that

model = GPT2LMHeadModel.from_pretrained('gpt2')  # load up a GPT2 model

pretrained_generator = pipeline(  # create a generator with built in params
    'text-generation', model=model, tokenizer='gpt2',
    config={'max_length': 200, 'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)

print('----------')
for generated_sequence in pretrained_generator('This dataset shows the relationship', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

training_args = TrainingArguments(
    output_dir="./gpt2_pds", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=32,  # batch size for evaluation
    logging_steps=10,
    load_best_model_at_end=True,
    evaluation_strategy='epoch',
    save_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=pds_data.examples[:int(len(pds_data.examples)*.8)],
    eval_dataset=pds_data.examples[int(len(pds_data.examples)*.8):]
)

trainer.evaluate()
trainer.train()
trainer.evaluate()  # loss decrease is slowing down so we are hitting our limit
trainer.save_model()
loaded_model = GPT2LMHeadModel.from_pretrained('./gpt2_pds')

finetuned_generator = pipeline(
    'text-generation', model=loaded_model, tokenizer=tokenizer,
    config={'max_length': 200, 'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)

# examples are now sustainably about data
print('----------')
for generated_sequence in finetuned_generator('what is megatech', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

MEGATECH
MEGATECH
TECHNOLOGY IN 2O5O
TECHNOLOGY IN 2050
edited by
DANIEL FRANKLINedited by
DANIEL FRANKLIN
Books

Published under exclusive licence from The Economist by
Profile Books Ltd
3 Hol


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


----------
This dataset shows the relationship between the level of social inequality (i.e., social inequality as a function of income) and the share of people receiving social security. More recent data show that this trend is more prevalent among males than females both in the
----------
This dataset shows the relationship between the different sex groups in all four different studies (see Figure 2).

Figure 2. Relationship between sex groups

The figure shows the relationships on the basis of sex and age. In the sex group where the
----------
This dataset shows the relationship between the relative level of the prevalence of cancer and the amount of total fat stored by adiposome cells. To test this, we performed an X-ray computed tomography analysis of the adipose tissue of mice that were
----------






<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Epoch,Training Loss,Validation Loss,Model Preparation Time
1,No log,3.868581,0.0029
2,No log,3.800093,0.0029
3,4.269800,3.783803,0.0029


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Device set to use cpu


----------
what is megatech?" asked John Faubert, in an effort to provide some common understanding of the subject of gender differences.
The gender equality movement will continue, the debate has been riled by a spate of recent victories, such as
----------
what is megatech?"
With more than half a century ago, it was clear that the internet allowed anyone to connect the two worlds. Now, Google is taking a different tack by offering its services on a much larger scale. As it prepares
----------
what is megatech] that you are here to do? And do you mean to help this project go forward?
I think I should say something about my attitude about it. In 2006 I came across the book A Short Description of Social Science
----------
