In [1]:
from transformers import GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline, \
                         Trainer, TrainingArguments

In [2]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')  # load up a standard gpt2 model

tokenizer.pad_token = tokenizer.eos_token
# set our pad token to be the eos token. This lets gpt know how to fill space

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [3]:
# load up our data into a dataset
pds_data = TextDataset(
    tokenizer=tokenizer,
    file_path='/content/AI-in-Everyday-Life-How-Artificial-Intelligence-is-Already-Changing-Our-Experiences (1).txt',  # Principles of Data Science - Sinan Ozdemir
    block_size=64  # length of each chunk of text to use as a datapoint
)



In [4]:
pds_data[0], pds_data[0].shape  # inspect the first point

(tensor([20185,   287, 48154,  5155,    25,  1374, 35941,   198,   198, 24123,
         45329,   594,   318, 27511, 33680,  3954,   198,   198, 20468, 10035,
           628,   628,   198,   198, 21906,   628,   198,  8001,  9542,  4430,
           357, 20185,     8,   318,   783,   281, 19287,   636,   286,   674,
          4445,  3160,    11,  1771,   356,  7564,   340,   393,   407,    13,
           198,   198,  4863,   262,  4410,   356,   779,   284,   262,  5370,
           356,   787,    11,  9552]),
 torch.Size([64]))

In [5]:
print(tokenizer.decode(pds_data[0]))

AI in Everyday Life: How Artificial

Intel igence is Already Changing Our

Experiences





Introduction


Artificial intelligence (AI) is now an integral part of our daily lives, whether we recognize it or not.

From the devices we use to the decisions we make, AI


In [6]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
    # MLM is Masked Language Modelling (for BERT + auto-encoding tasks)
)

In [7]:
# example of how collator pads data dynamically
collator_example = data_collator([tokenizer('I am an input'), tokenizer('So am I')])

collator_example

{'input_ids': tensor([[   40,   716,   281,  5128],
        [ 2396,   716,   314, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 0]]), 'labels': tensor([[  40,  716,  281, 5128],
        [2396,  716,  314, -100]])}

In [8]:
collator_example.input_ids  # 50256 is our pad token id

tensor([[   40,   716,   281,  5128],
        [ 2396,   716,   314, 50256]])

In [9]:
tokenizer.pad_token_id

50256

In [10]:
collator_example.attention_mask  # Note the 0 in the attention mask where we have a pad token

tensor([[1, 1, 1, 1],
        [1, 1, 1, 0]])

In [11]:
collator_example.labels  # note the -100 to ignore loss calculation for the padded token
# Labels are shifted inside the GPT model so we don't need to worry about that

tensor([[  40,  716,  281, 5128],
        [2396,  716,  314, -100]])

In [12]:
model = GPT2LMHeadModel.from_pretrained('gpt2')  # load up a GPT2 model

pretrained_generator = pipeline(  # create a generator with built in params
    'text-generation', model=model, tokenizer='gpt2',
    config={'max_length': 200, 'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cuda:0


In [13]:
print('----------')
for generated_sequence in pretrained_generator('This dataset shows the relationship', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


----------
This dataset shows the relationship with the number estimated by TES and the percentage of each parameter having an effect on an estimate of the likelihood (i.e., the threshold estimate) of a significant association between the two sets of parameters. The significance of
----------
This dataset shows the relationship between the first time a person moves in the world; the second is based on their height, weight, and age; and the third is based on the current state of the world. The time to move from the top to
----------
This dataset shows the relationship between gender and IQ [34–39].

The gender difference between white participants and African-American participants has been known. This is consistent with earlier work by J.E. Green and F.E.R.
----------


In [14]:
training_args = TrainingArguments(
    output_dir="./gpt2_pds", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=32,  # batch size for evaluation
    logging_steps=10,
    load_best_model_at_end=True,
    evaluation_strategy='epoch',
    save_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=pds_data.examples[:int(len(pds_data.examples)*.8)],
    eval_dataset=pds_data.examples[int(len(pds_data.examples)*.8):]
)

trainer.evaluate()





<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


{'eval_loss': 3.2615807056427,
 'eval_model_preparation_time': 0.0037,
 'eval_runtime': 0.3704,
 'eval_samples_per_second': 118.78,
 'eval_steps_per_second': 5.399}

In [15]:
trainer.train()

Epoch,Training Loss,Validation Loss,Model Preparation Time
1,No log,2.7691,0.0037
2,3.231000,2.626179,0.0037
3,3.231000,2.585106,0.0037


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=18, training_loss=3.0555986828274198, metrics={'train_runtime': 51.5436, 'train_samples_per_second': 10.244, 'train_steps_per_second': 0.349, 'total_flos': 17245274112000.0, 'train_loss': 3.0555986828274198, 'epoch': 3.0})

In [16]:
trainer.evaluate()  # loss decrease is slowing down so we are hitting our limit

{'eval_loss': 2.585106134414673,
 'eval_model_preparation_time': 0.0037,
 'eval_runtime': 0.2541,
 'eval_samples_per_second': 173.167,
 'eval_steps_per_second': 7.871,
 'epoch': 3.0}

In [17]:
trainer.save_model()

In [18]:
loaded_model = GPT2LMHeadModel.from_pretrained('./gpt2_pds')

finetuned_generator = pipeline(
    'text-generation', model=loaded_model, tokenizer=tokenizer,
    config={'max_length': 200, 'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)

Device set to use cuda:0


In [19]:
# examples are now sustainably about data
print('----------')
for generated_sequence in finetuned_generator('This dataset shows the relationship', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

----------
This dataset shows the relationship among data sets and data sets and how their interaction manifests as social interactions and outcomes in practice. This is particularly important for organizations whose data are inherently structured or set high expectations of social interaction. Data sets have been developed to accurately
----------
This dataset shows the relationship between self-reported obesity and the likelihood that a person will be obese over time. In this study, participants who reported eating less than their body weight in the past month were more likely to be obese. This trend continues to
----------
This dataset shows the relationship between educational attainment and the quality of life for individuals in various socioeconomic backgrounds. These findings are important for future research and public policy: it is critical that we ensure that the data we use in decision making is sensitive to human biases
----------
