In [1]:
%pip install datasets
%pip install peft
%pip install trl
%pip install bitsandbytes
%pip install ipywidgets

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import DataCollatorForCompletionOnlyLM, SFTConfig, SFTTrainer

In [3]:
model_id = "HuggingFaceTB/SmolLM-135M-Instruct"
dataset_id = "medalpaca/medical_meadow_medical_flashcards"
device = "cuda" if torch.cuda.is_available() else "cpu"

## Preparing and Formatting the Dataset for Training

We'll be preparing a dataset for training a Language Model (LLM). The steps involve formatting the dataset to keep only the necessary columns and splitting it into training and evaluation sets. Proper dataset preparation is crucial for ensuring the model's effectiveness and generalization.


In [4]:
def format_dataset(dataset, keys, instruction_col_name, response_col_name):
    """Format the dataset by retaining only necessary columns and renaming them."""
    cols_to_remove = [key for key in keys if key not in [instruction_col_name, response_col_name]]
    dataset = dataset.remove_columns(cols_to_remove)
    dataset = dataset.rename_column(instruction_col_name, "instruction")
    dataset = dataset.rename_column(response_col_name, "response")
    return dataset

def prepare_datasets(dataset, instruction_col_name, response_col_name):
    """Format and split the dataset for training and evaluation."""
    available_cols = list(dataset["train"].features.keys())
    formatted_dataset = format_dataset(
        dataset, available_cols, instruction_col_name, response_col_name
    )

    if "valid" in formatted_dataset:
        train_dataset = formatted_dataset["train"]
        eval_dataset = formatted_dataset["valid"]
    elif "test" in formatted_dataset:
        train_dataset = formatted_dataset["train"]
        eval_dataset = formatted_dataset["test"]
    else:
        split_dataset = formatted_dataset["train"].train_test_split(test_size=0.2)
        train_dataset, eval_dataset = split_dataset["train"], split_dataset["test"]

    return train_dataset, eval_dataset

Load the dataset using its ID or path. This dataset will be used for training and evaluating the model


In [5]:
dataset = load_dataset(dataset_id)

README.md:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

(…)l_meadow_wikidoc_medical_flashcards.json:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/33955 [00:00<?, ? examples/s]

Print the dataset information to inspect its structure and column names. This is important to understand the data we're working with and ensure that we correctly identify the columns containing the `instructions` and `responses`.


In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'instruction'],
        num_rows: 33955
    })
})

In [7]:
dataset["train"][0]

{'input': 'What is the relationship between very low Mg2+ levels, PTH levels, and Ca2+ levels?',
 'output': 'Very low Mg2+ levels correspond to low PTH levels which in turn results in low Ca2+ levels.',
 'instruction': 'Answer this question truthfully'}

Format the dataset and split it into training and evaluation sets. Here, `input` and `output` represent the columns in the dataset holding the `instruction` and `response`, respectively.


In [8]:
train_dataset, eval_dataset = prepare_datasets(
    dataset, instruction_col_name="input", response_col_name="output"
)

In [9]:
print(f"{train_dataset = }")
print(f"{eval_dataset = }")

train_dataset = Dataset({
    features: ['instruction', 'response'],
    num_rows: 27164
})
eval_dataset = Dataset({
    features: ['instruction', 'response'],
    num_rows: 6791
})


## Load and Test Pre-trained Model

Define Functions for Response Generation and Display


In [10]:
def generate_response(model, tokenizer, instruction, device="cpu"):
    """Generate a response from the model based on an instruction."""
    messages = [{"role": "user", "content": instruction}]
    input_text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
    outputs = model.generate(
        inputs, max_new_tokens=128, temperature=0.2, top_p=0.9, do_sample=True
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

def print_example(example):
    """Print an example from the dataset."""
    print(f"Original Dataset Example:")
    print(f"Instruction: {example['instruction']}")
    print(f"Response: {example['response']}")
    print("-" * 100)

def print_response(response):
    """Print the model's response."""
    print(f"Model response:")
    print(response.split("assistant\n")[-1])
    print("-" * 100)

Load the Model, and the Tokenizer


In [11]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

tokenizer_config.json:   0%|          | 0.00/3.59k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Test the Pre-trained Model


In [12]:
# Define a test example
example1 = eval_dataset[1]

response = generate_response(model, tokenizer, example1["instruction"], device)

print_example(example1)
print_response(response)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Original Dataset Example:
Instruction: What thyroid imbalance is associated with anxiety?
Response: Hyperthyroidism presents with anxiety.
----------------------------------------------------------------------------------------------------
Model response:
Thyroid imbalance is a common condition that can contribute to anxiety. Thyroid dysfunction, which affects the thyroid gland, can lead to anxiety symptoms. Here are some ways thyroid imbalance can contribute to anxiety:

1. **Hypothyroidism**: Hypothyroidism, or underactive thyroid, can cause anxiety by disrupting the body's natural balance of hormones. This can lead to feelings of fatigue, weight gain, and mood disturbances.
2. **Hyperthyroidism**: Hyperthyroidism, or overactive thyroid, can cause anxiety by disrupting the body's natural balance of hormones. This can lead to feelings of anxiety, jitteriness, and rapid heartbeat.
----------------------------------------------------------------------------------------------------


### Supervised Fine-tuning Trainer


#### Training adapters [Read More](https://huggingface.co/docs/trl/v0.9.6/en/sft_trainer#training-adapters)


Huggingface support tight integration with PEFT library so that we can conveniently train adapters instead of training the entire model.


In [13]:
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none"
)

#### Customize prompts using packed dataset [Read More](https://huggingface.co/docs/trl/en/sft_trainer#customize-your-prompts-using-packed-dataset)


Since our dataset has two field `instruction` and `response`, we need to combine them as one string to be able to past it to the SFT Trainer.


In [14]:
def formatting_prompts_func(example: dict) -> str:
    """Format prompt for training."""
    text = f"<|im_start|>user\n{example['instruction']}<|im_end|>\n<|im_start|>assistant\n{example['response']}<|im_end|>"
    return text

## Training Parameters


In [20]:
num_train_epochs = 100

output_dir = f"{model_id.split('/')[-1]}-{dataset_id.split('/')[-1]}-{num_train_epochs}epochs"

sft_config = SFTConfig(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    max_seq_length=512,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    save_steps=500,  # save checkpoints every n training steps
    logging_steps=500,
    learning_rate=1e-3,
    weight_decay=0.001,
    fp16=False,
    bf16=True,
    warmup_ratio=0.05,
    lr_scheduler_type="constant",
    packing=True
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    formatting_func=formatting_prompts_func,
    peft_config=peft_config,
    args=sft_config,
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [21]:
print(torch.cuda.memory_summary(device=None, abbreviated=False))

|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 2            |        cudaMalloc retries: 2         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   1112 MiB |   4883 MiB |  41364 MiB |  40251 MiB |
|       from large pool |   1079 MiB |   4851 MiB |  38604 MiB |  37525 MiB |
|       from small pool |     32 MiB |     34 MiB |   2759 MiB |   2726 MiB |
|---------------------------------------------------------------------------|
| Active memory         |   1112 MiB |   4883 MiB |  41364 MiB |  40251 MiB |
|       from large pool |   1079 MiB |   4851 MiB |  38604 MiB |  37525 MiB |
|       from small pool |     32 MiB |     34 MiB |   2759 MiB |   2726 MiB |
|---------------------------------------------------------------

In [22]:
trainer.train()

  0%|          | 0/1645 [00:00<?, ?it/s]

  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


{'loss': 1.4248, 'grad_norm': 0.11244793981313705, 'learning_rate': 0.001, 'epoch': 1.52}


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


{'loss': 1.3429, 'grad_norm': 0.14638756215572357, 'learning_rate': 0.001, 'epoch': 3.04}


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


{'loss': 1.2936, 'grad_norm': 0.14381001889705658, 'learning_rate': 0.001, 'epoch': 4.56}


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


{'train_runtime': 1721.3863, 'train_samples_per_second': 15.281, 'train_steps_per_second': 0.956, 'train_loss': 1.3473510649428904, 'epoch': 5.0}


TrainOutput(global_step=1645, training_loss=1.3473510649428904, metrics={'train_runtime': 1721.3863, 'train_samples_per_second': 15.281, 'train_steps_per_second': 0.956, 'total_flos': 8656664365301760.0, 'train_loss': 1.3473510649428904, 'epoch': 5.0})

In [23]:
trainer.evaluate()

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


  0%|          | 0/166 [00:00<?, ?it/s]

{'eval_runtime': 25.3956,
 'eval_samples_per_second': 52.253,
 'eval_steps_per_second': 6.537,
 'epoch': 5.0}

In [24]:
trainer.save_model()

#### Test Fine-tuned Model


In [26]:
ft_model = AutoModelForCausalLM.from_pretrained(output_dir).to(device)

response = generate_response(ft_model, tokenizer, example1["instruction"], device)

print_example(example1)
print_response(response)

Original Dataset Example:
Instruction: What thyroid imbalance is associated with anxiety?
Response: Hyperthyroidism presents with anxiety.
----------------------------------------------------------------------------------------------------
Model response:
Hypothyroidism is the thyroid imbalance that is associated with anxiety.
----------------------------------------------------------------------------------------------------


#### Test Many Responses


In [40]:
import random

for i in range(10):
    example = eval_dataset[random.randint(1, len(eval_dataset))]
    test_response = generate_response(ft_model, tokenizer, example["instruction"], device)

    print("=======================", (i+1), "==========================")
    print_example(example)
    print_response(test_response)


Original Dataset Example:
Instruction: What type of toxin is produced by Shigella dysenteriae?
Response: Shigella dysenteriae produces Shiga toxin. Shigella is a type of bacteria that can cause an infection called shigellosis. There are several species of Shigella, and each one may produce different types of toxins. Shigella dysenteriae is one species that is known to produce a toxin called Shiga toxin. This toxin can cause damage to the lining of the intestine and lead to symptoms such as bloody diarrhea, abdominal pain, and fever. In severe cases, Shiga toxin can also cause a condition called hemolytic uremic syndrome (HUS), which can lead to kidney failure and other complications. Treatment for shigellosis typically involves rest, fluids, and antibiotics if necessary. If a person with shigellosis develops severe symptoms or complications such as HUS, they may require hospitalization and more intensive treatment.
-----------------------------------------------------------------------

Push the fine-tuned model to your HuggingFace


In [41]:
hf_access_token = "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"
if hf_access_token:
    trainer.push_to_hub(token=hf_access_token)

adapter_model.safetensors:   0%|          | 0.00/3.70M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.50k [00:00<?, ?B/s]