# GPT for style completion

In [1]:
from transformers import GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline, Trainer, TrainingArguments

In [2]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [4]:
pds_data = TextDataset(
    tokenizer= tokenizer,
    file_path="/PDS2.txt",
    block_size=32 # this is the length of each chunk of text to use as a data point
)

In [5]:
pds_data[0] , pds_data[0].shape # inspecting the first entry

(tensor([  200, 47231,  6418,   286,  6060,  5800,   198, 12211,  5061,   198,
           198,    32, 31516,   338,  5698,   284, 13905,  7605,   290,  4583,
           284,   198, 11249,   304,   171,   105,   222, 13967,  1366,    12,
         15808,  5479]),
 torch.Size([32]))

In [6]:
print(tokenizer.decode(pds_data[0]))

Principles of Data Science
Second Edition

A beginner's guide to statistical techniques and theory to
build eﬀective data-driven applications


In [7]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm = False # masked language modelling task.
)

In [9]:
tokenizer.pad_token = tokenizer.eos_token

collator_example = data_collator([tokenizer('I am an input'), tokenizer('So am I')])

collator_example

{'input_ids': tensor([[   40,   716,   281,  5128],
        [ 2396,   716,   314, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 0]]), 'labels': tensor([[  40,  716,  281, 5128],
        [2396,  716,  314, -100]])}

In [10]:
collator_example.input_ids

tensor([[   40,   716,   281,  5128],
        [ 2396,   716,   314, 50256]])

In [11]:
tokenizer.pad_token_id

50256

In [12]:
collator_example.attention_mask

tensor([[1, 1, 1, 1],
        [1, 1, 1, 0]])

attention mask is 0 where there is pad token

In [13]:
collator_example.labels

tensor([[  40,  716,  281, 5128],
        [2396,  716,  314, -100]])

In [14]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

pretrained_generator = pipeline(
    'text-generation',
    model=model,
    tokenizer='gpt2',
    config={'max_length':200,
            'do_sample': True,
            'top_p':0.9,
            'temperature': 0.7,
            'top_k': 10}
)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cuda:0


In [15]:
for generated_sequence in pretrained_generator('A dataset shows the relationships', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


A dataset shows the relationships of many of the major climate-warming experiments published between 1850 to 1998 and how they compare with each other. The authors show how different measurements have affected the trends of the models. In the case of the LendMean
----------
A dataset shows the relationships between a total of 10,824 different countries. The dataset comes from the Gallup/USA Today Internet Country survey from 1990 onward. It has been rereported on five occasions since then from around 2011 through 2016.


----------
A dataset shows the relationships of two major demographic groups by using the data from the US Census Bureau from 1995 to 2013. Data are weighted based on age ranges, with older age groups being significantly least likely to vote in a given election. For women,
----------


In [16]:
# Initialize training arguments
training_args = TrainingArguments(
    output_dir="./gpt2_pds", # The output directory
    overwrite_output_dir=True, # Overwrite the content of the output directory
    num_train_epochs=3, # Number of training epochs
    per_device_train_batch_size=32, # Batch size for training
    per_device_eval_batch_size=32,  # Batch size for evaluation
    warmup_steps=len(pds_data.examples) // 5, # Number of warmup steps for learning rate scheduler
    logging_steps=50,
    load_best_model_at_end=True,
    evaluation_strategy='epoch',
    save_strategy='epoch',       # Save checkpoint at the end of each epoch
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=pds_data.examples[:int(len(pds_data.examples) * 0.8)],
    eval_dataset=pds_data.examples[int(len(pds_data.examples) * 0.8):],
)

# Start evaluation
trainer.evaluate()


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mnt_[0m ([33meshaan-rithesh2023-vit-chennai[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


{'eval_loss': 4.955997467041016,
 'eval_model_preparation_time': 0.0038,
 'eval_runtime': 2.4741,
 'eval_samples_per_second': 379.942,
 'eval_steps_per_second': 12.126}

In [17]:
trainer.train()

Epoch,Training Loss,Validation Loss,Model Preparation Time
1,4.1646,4.093808,0.0038
2,3.6619,3.860945,0.0038
3,3.3033,3.77576,0.0038


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=354, training_loss=3.776356120567537, metrics={'train_runtime': 150.2371, 'train_samples_per_second': 75.001, 'train_steps_per_second': 2.356, 'total_flos': 184014913536000.0, 'train_loss': 3.776356120567537, 'epoch': 3.0})

In [18]:
trainer.evaluate()

{'eval_loss': 3.7757601737976074,
 'eval_model_preparation_time': 0.0038,
 'eval_runtime': 2.5093,
 'eval_samples_per_second': 374.602,
 'eval_steps_per_second': 11.955,
 'epoch': 3.0}

In [19]:
trainer.save_model()

In [20]:
loaded_model = GPT2LMHeadModel.from_pretrained('./gpt2_pds')

finetuned_generator = pipeline(
    'text-generation', model=loaded_model, tokenizer=tokenizer,
    config={'max_length': 200,  'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)

Device set to use cuda:0


In [21]:
for generated_sequence in finetuned_generator('A dataset shows the relationships', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

A dataset shows the relationships of a continuous variable in two ways: first, it is considered the average of the two points, and second, each variable is associated with 1-4 possible results. The
first point is considered the average of the two
----------
A dataset shows the relationships between the
distributions in data:
from clustering import kdf

from pandas import plot, pd
pd.delimit_mean(mean, x, y, width=
----------
A dataset shows the relationships among clusters of a set of
composites: all of them cluster together as a single data point, but are split in the top half by one
point. If the dataset holds a single set of clusters, but
----------
