In [1]:
from dotenv import load_dotenv
import os

# Load environment variables from .env.local
load_dotenv('.env.local')

# Now you can access your token
hf_token = os.getenv("HF_TOKEN")  # Replace with your variable name

In [2]:
from huggingface_hub import login

login(token=hf_token)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## Load ELI5 dataset

Start by loading the first 5000 examples from the ELI5-Category dataset with the 🤗 Datasets library. This’ll give you a chance to experiment and make sure everything works before spending more time training on the full dataset.

In [3]:
from datasets import load_dataset

eli5 = load_dataset("rexarski/eli5_category", split="train[:5000]")

README.md: 0.00B [00:00, ?B/s]

eli5_category.py: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

In [4]:
eli5 = eli5.train_test_split(test_size=0.2)

In [5]:
eli5["train"][0]

{'q_id': '7ew8vv',
 'title': 'can a star with no galaxy exist?',
 'selftext': "Is it possible for there to be a star in space, but that it's not inside a galaxy?",
 'category': 'Physics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dq7von1', 'dq7vuf9'],
  'text': ['Yes. Stars can and do exist outside of the boundaries of galaxies. They are commonly referred to as stellar outcasts or intergalactic stars.',
   'Yes, they are known as intergalactic stars or rogue stars. The first were discovered in 1997. It is believed they are born inside of galaxies, and later get ejected by close encounters with black holes or violent galactic collisions. Some theories suggest there may be as many stars outside of galaxies as there are inside.'],
  'score': [7, 6],
  'text_urls': [[], []]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

## Preprocess

The next step is to load a DistilGPT2 tokenizer to process the text subfield:

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

In [7]:
eli5 = eli5.flatten()

In [8]:
eli5["train"][0]

{'q_id': '7ew8vv',
 'title': 'can a star with no galaxy exist?',
 'selftext': "Is it possible for there to be a star in space, but that it's not inside a galaxy?",
 'category': 'Physics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dq7von1', 'dq7vuf9'],
 'answers.text': ['Yes. Stars can and do exist outside of the boundaries of galaxies. They are commonly referred to as stellar outcasts or intergalactic stars.',
  'Yes, they are known as intergalactic stars or rogue stars. The first were discovered in 1997. It is believed they are born inside of galaxies, and later get ejected by close encounters with black holes or violent galactic collisions. Some theories suggest there may be as many stars outside of galaxies as there are inside.'],
 'answers.score': [7, 6],
 'answers.text_urls': [[], []],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

Each subfield is now a separate column as indicated by the answers prefix, and the text field is a list now. Instead of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them.

Here is a first preprocessing function to join the list of strings for each example and tokenize the result:

In [9]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

To apply this preprocessing function over the entire dataset, use the 🤗 Datasets map method. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once, and increasing the number of processes with num_proc. Remove any columns you don’t need:

In [10]:
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1862 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1850 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1201 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2653 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1195 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1102 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1310 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1318 > 1024). Running this sequence through the model will result in indexing errors


In [11]:
tokenized_eli5["train"][0]

{'input_ids': [5297,
  13,
  10271,
  460,
  290,
  466,
  2152,
  2354,
  286,
  262,
  13215,
  286,
  27982,
  13,
  1119,
  389,
  8811,
  6412,
  284,
  355,
  25041,
  503,
  40924,
  393,
  987,
  13528,
  12009,
  5788,
  13,
  3363,
  11,
  484,
  389,
  1900,
  355,
  987,
  13528,
  12009,
  5788,
  393,
  23586,
  5788,
  13,
  383,
  717,
  547,
  5071,
  287,
  8309,
  13,
  632,
  318,
  4762,
  484,
  389,
  4642,
  2641,
  286,
  27982,
  11,
  290,
  1568,
  651,
  38632,
  416,
  1969,
  16925,
  351,
  2042,
  10421,
  393,
  6590,
  44280,
  31998,
  13,
  2773,
  10946,
  1950,
  612,
  743,
  307,
  355,
  867,
  5788,
  2354,
  286,
  27982,
  355,
  612,
  389,
  2641,
  13],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  

This dataset contains the token sequences, but some of these are longer than the maximum input length for the model.

You can now use a second preprocessing function to

concatenate all the sequences
split the concatenated sequences into shorter chunks defined by block_size, which should be both shorter than the maximum input length and short enough for your GPU RAM.

In [12]:
block_size = 128


def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [13]:
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Now create a batch of examples using DataCollatorForLanguageModeling. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [14]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## Train

You’re ready to start training your model now! Load DistilGPT2 with AutoModelForCausalLM:

In [15]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

At this point, only three steps remain:

- Define your training hyperparameters in TrainingArguments. The only required parameter is output_dir which specifies where to save your model. You’ll push this model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging Face to upload your model).
- Pass the training arguments to Trainer along with the model, datasets, and data collator.
- Call train() to finetune your model.

In [17]:
training_args = TrainingArguments(
    output_dir="distilgpt2_eli5_clm-model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,3.9208,3.823804
2,3.8234,3.814938
3,3.7679,3.812513


'(ProtocolError('Connection aborted.', BrokenPipeError(32, 'Broken pipe')), '(Request ID: b97313ea-16da-45f3-8358-1db225a4fab9)')' thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/75/93/7593b7352c465484eff0778395f8c937b153122f2983494e7c992720c57d163c/00e4e03aa0e3cddc21d439e77efe7683d2fe8710c958a1b8be371509ee202241?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20250702%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250702T165905Z&X-Amz-Expires=86400&X-Amz-Signature=f1519b84012a21fc0f8294261e202ce1a04aec2882d9068e8fae431ac06bdfbd&X-Amz-SignedHeaders=host&partNumber=1&uploadId=_4qHBzvqoBFb2uDRWudsPNhEYJkpPEMvU_O6oBxY4fFb.JREbYkwmYteU18R80aesS8SLg37OzJe8m4x3gMP6Py3j6Ry92vRLltkBC5Y0Uo3hl5V8kFxMaRoLQPf3iwu&x-id=UploadPart
Retrying in 1s [Retry 1/5].


TrainOutput(global_step=3888, training_loss=3.831155659239969, metrics={'train_runtime': 3956.5693, 'train_samples_per_second': 7.859, 'train_steps_per_second': 0.983, 'total_flos': 1015627807457280.0, 'train_loss': 3.831155659239969, 'epoch': 3.0})

## Evaluation

Once training is completed, use the evaluate() method to evaluate your model and get its perplexity:

In [18]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 45.26


Then share your model to the Hub with the push_to_hub() method so everyone can use your model:

In [19]:
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/Swagam/distilgpt2_eli5_clm-model/commit/9bf45dbfaf46c38f0ad1156beb027002b7813d9d', commit_message='End of training', commit_description='', oid='9bf45dbfaf46c38f0ad1156beb027002b7813d9d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Swagam/distilgpt2_eli5_clm-model', endpoint='https://huggingface.co', repo_type='model', repo_id='Swagam/distilgpt2_eli5_clm-model'), pr_revision=None, pr_num=None)

## Inference

Great, now that you’ve finetuned a model, you can use it for inference!

The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline for text generation with your model, and pass your text to it:

In [20]:
from transformers import pipeline

prompt = "Somatic hypermutation allows the immune system to"

generator = pipeline("text-generation", model="Swagam/distilgpt2_eli5_clm-model")
generator(prompt)

config.json:   0%|          | 0.00/977 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/131 [00:00<?, ?B/s]

Device set to use mps:0
  test_elements = torch.tensor(test_elements)


[{'generated_text': "Somatic hypermutation allows the immune system to stop producing antibodies or to suppress the mutation. The immune system is made up of cells that have no known immune function (at least as yet). We've had this happen a long time. It's"}]

In [22]:
prompt = "Planes fly because of"

generator = pipeline("text-generation", model="Swagam/distilgpt2_eli5_clm-model")
generator(prompt)

Device set to use mps:0


[{'generated_text': 'Planes fly because of their ability to communicate through the air. The same problem was explained in a video posted at the International Space Station. Air currents can affect both air and ground currents, both of which can cause a huge ripple in air. In'}]

## Inference Round-About

Tokenize the text and return the input_ids as PyTorch tensors:

In [23]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Swagam/distilgpt2_eli5_clm-model")
inputs = tokenizer(prompt, return_tensors="pt").input_ids

Use the generate() method to generate text. For more details about the different text generation strategies and parameters for controlling generation, check out the Text generation strategies page.

In [24]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Swagam/distilgpt2_eli5_clm-model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Decode the generated token ids back into text:

In [25]:
tokenizer.batch_decode(outputs, skip_special_tokens=True)

["Planes fly because of the air pressure. The aircraft's main engines tend to have the same temperature. The difference between the water temperature and the air pressure is much bigger than the air pressure is. The temperature difference can be more significant. But the bigger the air pressure (the greater it is), the bigger the air pressure.Because it’s a system called a “computer operating system,” the system is comprised of hundreds of software, hardware, and functions that are designed to handle multiple inputs: 1"]