**Medical Reasoning Finetuning** on *GPT-2* model

Using datset [Medical Reasoning Dataset by FreedomIntelligence](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT) and trying to get good results without overcomplication and plain finetuning

In [1]:
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m41.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [2]:
# Importing required libraries
from datasets import load_dataset
import pandas as pd
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
import torch

In [3]:
# 1. Load the dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", split="train")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

medical_o1_sft.json:   0%|          | 0.00/75.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25682 [00:00<?, ? examples/s]

Dataset({
    features: ['Question', 'Complex_CoT', 'Response'],
    num_rows: 25682
})

In [4]:
# 2. Initialize model and tokenizer
# Using LLaMA-2 7B or similar models would be ideal, but for demonstration we'll use gpt2 as it will be faster to train
MODEL_NAME = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [5]:
# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.pad_token_id

In [6]:
# 3. Split dataset
split_datasets = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_datasets["train"]
validation_dataset = split_datasets["test"]

In [7]:
# 4. Defining Tokenization function
def tokenize_function(examples):
    max_length = 512

    # Formatting the text to match the desired input-output pattern
    formatted_texts = []
    for question, thinking, response in zip(examples["Question"],
                                         examples["Complex_CoT"],
                                         examples["Response"]):
        # Format: Question: [question] Thinking: [thinking] Response: [response]
        text = f"Question: {question}\nThinking: {thinking}\nResponse: {response}{tokenizer.eos_token}"
        formatted_texts.append(text)

    # Tokenize
    tokenized = tokenizer(
        formatted_texts,
        padding="max_length",
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )

    # For casual language modeling, labels should be the same as input_ids
    tokenized["labels"] = tokenized["input_ids"].clone()

    # Create attention mask
    tokenized["attention_mask"] = tokenized["input_ids"].ne(tokenizer.pad_token_id)

    return tokenized

In [8]:
# 5. Apply tokenization
tokenized_train_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=train_dataset.column_names
)
tokenized_validation_dataset = validation_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=validation_dataset.column_names
)

Map:   0%|          | 0/23113 [00:00<?, ? examples/s]

Map:   0%|          | 0/2569 [00:00<?, ? examples/s]

In [9]:
# 6. Configure training arguments
training_args = TrainingArguments(
    output_dir="./medical_reasoning_model_checkpoints",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    save_strategy="epoch",
    save_total_limit=2,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    fp16=True if torch.cuda.is_available() else False,
    report_to="none",
    gradient_accumulation_steps=4,
)



In [10]:
# 7. Create data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # because we want causal language modeling
)

In [11]:
# 8. Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)


  trainer = Trainer(


In [12]:
# 9. Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
0,9.7541,2.311642
1,9.2766,2.259904
2,9.2533,2.247069


TrainOutput(global_step=4332, training_loss=9.611415632323656, metrics={'train_runtime': 4383.9694, 'train_samples_per_second': 15.816, 'train_steps_per_second': 0.988, 'total_flos': 1.811276365824e+16, 'train_loss': 9.611415632323656, 'epoch': 2.9994808790448175})

In [13]:
# 10. Save the final model
model.save_pretrained("./medical_reasoning_final_model")
tokenizer.save_pretrained("./medical_reasoning_final_model")

('./medical_reasoning_final_model/tokenizer_config.json',
 './medical_reasoning_final_model/special_tokens_map.json',
 './medical_reasoning_final_model/vocab.json',
 './medical_reasoning_final_model/merges.txt',
 './medical_reasoning_final_model/added_tokens.json',
 './medical_reasoning_final_model/tokenizer.json')

In [14]:
# 11. Test inference function
def generate_medical_response(question, model, tokenizer, max_length=512):
    prompt = f"Question: {question}\nThinking:"
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)

    if torch.cuda.is_available():
        inputs = inputs.to("cuda")
        model = model.to("cuda")

    # Generate response
    outputs = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        num_return_sequences=1,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

    # Decode and return the response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

In [15]:
# Example usage of inference
example_question = "I have been experiencing severe headaches and dizziness for the past week. What could be wrong?"
response = generate_medical_response(example_question, model, tokenizer)
print(response)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Question: I have been experiencing severe headaches and dizziness for the past week. What could be wrong?
Thinking: Alright, let's think about this. I've been dealing with some pretty serious headaches and dizziness for the past week, and now I'm feeling really tired and have to take a break. Oh, and the doctor said it's going to take a while for me to get used to these symptoms. So, I'm really feeling like I'm in a really bad mood.

I remember there are some classic symptoms like nausea, vomiting, and even feeling nauseous. These can be pretty serious. The doctor also mentioned some dizziness. That sounds pretty concerning.

Now, let's look at what could be wrong with my brain. I've heard that there's a buildup of adrenaline in the brain, which can lead to these symptoms. This can happen when there's too much adrenaline in the blood, leading to these symptoms.

Let me think. Could it be something like a tumor? If I'm not careful, I might not be able to feel the pain or feel the dizzin

In [None]:
# Infinite Loop so that the notebook does not close of inactivity
while True:
  pass