## Code to Chapter 7 of LangChain for Life Science and Healthcare book, by Dr. Ivan Reznikov

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1oMijLBEHHl_tcX_-b7qlBNXIwAUwmL0M?usp=sharing)

## Fine-tuning DeepSeek Reasoning Model for Biology Questions

This notebook demonstrates how to fine-tune a DeepSeek-R1-Distill model specifically for biology questions using the Unsloth framework. The model will learn to provide step-by-step reasoning for biology problems.

- **Base Model**: DeepSeek-R1-Distill-Qwen-1.5B (reasoning-capable model)
- **Task**: Biology question answering with chain-of-thought reasoning
- **Method**: LoRA (Low-Rank Adaptation) fine-tuning
- **Framework**: Unsloth for efficient training

<font color='yellow'>Warning: this notebook uses GPU T4</font>

## 1. Setting Up Working Environment

First, we install the necessary packages and set up authentication for Hugging Face.
Unsloth provides optimized implementations for faster training and inference.

In [1]:
%%capture
!pip install unsloth vllm
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [2]:
from google.colab import userdata
import os

os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

## 2. Loading the Model and Tokenizer

We're using DeepSeek-R1-Distill-Qwen-1.5B, which is specifically designed for reasoning tasks.
The model supports chain-of-thought reasoning with <think> tags.

### Key Parameters:
- **max_seq_length**: 2048 tokens (sufficient for biology questions and reasoning)
- **load_in_4bit**: Enables 4-bit quantization for memory efficiency
- **dtype**: Auto-detected based on hardware capabilities

In [3]:
from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None
load_in_4bit = True


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B",
    #model_name = "unsloth/DeepSeek-R1-Distill-Qwen-7B",
    #model_name = "unsloth/DeepSeek-R1-Distill-Qwen-14B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 07-22 08:37:13 [__init__.py:244] Automatically detected platform cuda.
==((====))==  Unsloth 2025.7.6: Fast Qwen2 patching. Transformers: 4.53.2. vLLM: 0.9.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.81G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

## 3. Loading and Processing the Dataset

### Prompt Template Design

The prompt template is specifically designed for DeepSeek-R1 reasoning models:
- Uses `<think>` tags to encourage step-by-step reasoning
- Follows a clear instruction → question → response format
- Incorporates the reasoning chain from the training data

The template structure mirrors how the model was originally trained, making fine-tuning more effective.

In [4]:
train_prompt_style = """Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are an expert in biology.
Please answer the following question.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

In [5]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    inputs = examples["question"]
    cots = [x["reasoning"] for x in examples["metadata"]]
    outputs = examples["answer"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

### Dataset Loading and Inspection

We'll use the ToT-Biology dataset which contains biology questions with reasoning chains.
This dataset is ideal for training models to provide step-by-step biological reasoning.


In [6]:
from datasets import Dataset, load_dataset

dataset = load_dataset("moremilk/ToT-Biology")

README.md:   0%|          | 0.00/10.0k [00:00<?, ?B/s]

ToT-Biology-32k.json:   0%|          | 0.00/78.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/23000 [00:00<?, ? examples/s]

In [7]:
dataset['train'][0]

{'answer': 'Cell theory, a foundational principle in biology, states that:\n\n*   All living organisms are composed of one or more cells.\n*   The cell is the basic unit of structure and organization in organisms.\n*   Cells arise from pre-existing cells.\n\n### Key Contributors and Observations:\n\n*   **Robert Hooke (1665):** Observed compartments in cork using a microscope and coined the term "cell."  He observed the **compartmentalization** of the cork tissue but did not understand the cell as a fundamental unit of life. His observations were primarily structural.\n*   **Anton van Leeuwenhoek (1670s):**  Observed living microorganisms ("animalcules") in pond water and other samples using more powerful microscopes. His discovery of **living microscopic entities** expanded the understanding of the diversity of life beyond what was visible to the naked eye.\n*   **Matthias Schleiden (1838):** A botanist, concluded that **all plants are made of cells**.  He meticulously studied various

In [8]:
dataset = dataset['train'].map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/23000 [00:00<?, ? examples/s]

In [9]:
dataset

Dataset({
    features: ['answer', 'id', 'metadata', 'question', 'text'],
    num_rows: 23000
})

In [10]:
dataset[0]

{'answer': 'Cell theory, a foundational principle in biology, states that:\n\n*   All living organisms are composed of one or more cells.\n*   The cell is the basic unit of structure and organization in organisms.\n*   Cells arise from pre-existing cells.\n\n### Key Contributors and Observations:\n\n*   **Robert Hooke (1665):** Observed compartments in cork using a microscope and coined the term "cell."  He observed the **compartmentalization** of the cork tissue but did not understand the cell as a fundamental unit of life. His observations were primarily structural.\n*   **Anton van Leeuwenhoek (1670s):**  Observed living microorganisms ("animalcules") in pond water and other samples using more powerful microscopes. His discovery of **living microscopic entities** expanded the understanding of the diversity of life beyond what was visible to the naked eye.\n*   **Matthias Schleiden (1838):** A botanist, concluded that **all plants are made of cells**.  He meticulously studied various

## 4. Fine-Tuning Setup with LoRA

### LoRA (Low-Rank Adaptation) Configuration

We use LoRA to efficiently fine-tune the model by updating only a small number of parameters:

- **Target Modules**: We target key attention and MLP layers for maximum impact
- **Rank (r=16)**: Controls the bottleneck dimension in LoRA layers
- **Alpha (16)**: Scaling factor for LoRA updates
- **Dropout (0)**: No dropout in LoRA layers for this experiment

### Target Modules Explained:
- `q_proj, k_proj, v_proj, o_proj`: Self-attention mechanism components
- `gate_proj, up_proj, down_proj`: Feed-forward network components

In [11]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank - higher values capture more complexity but use more parameters
    target_modules=[
        "q_proj",     # Query projection in self-attention
        "k_proj",     # Key projection in self-attention
        "v_proj",     # Value projection in self-attention
        "o_proj",     # Output projection from attention
        "gate_proj",  # Gate projection in FFN (controls information flow)
        "up_proj",    # Up projection in FFN (expands dimensionality)
        "down_proj",  # Down projection in FFN (reduces dimensionality)
    ],
    lora_alpha=16,        # LoRA scaling parameter
    lora_dropout=0,       # No dropout in LoRA layers
    bias="none",          # Don't update bias terms
    use_gradient_checkpointing="unsloth",  # Memory optimization
    random_state=2025,    # For reproducibility
    use_rslora=False,     # Standard LoRA (not rank-stabilized)
    loftq_config=None,    # No quantization-aware LoRA
)

Unsloth 2025.7.6 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


## 5. Training Configuration

### Data Collator Setup
We use a completion-only data collator that focuses training on the response portion,
ignoring the instruction and question parts during loss calculation.

In [12]:
from trl import DataCollatorForCompletionOnlyLM

instruction_template = "### Instruction:"
response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)

### Training Arguments Configuration

Key training parameters:
- **Batch Size**: 8 per device with 4 gradient accumulation steps (effective batch size: 32)
- **Learning Rate**: 5e-5 (conservative for fine-tuning)
- **Steps**: 200 training steps (adjust based on dataset size)
- **Optimizer**: AdamW with 8-bit precision for memory efficiency

In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    # data_collator=collator,  # Uncomment to use completion-only training
    dataset_num_proc=2,  # Parallel processing for dataset
    args=TrainingArguments(
        per_device_train_batch_size=1,    # Batch size per GPU
        gradient_accumulation_steps=32,     # Effective batch size = gradient_accumulation_steps * per_device_train_batch_size = 32
        warmup_steps=10,                   # Learning rate warmup
        max_steps=200,                     # Total training steps
        learning_rate=5e-5,                # Conservative learning rate
        fp16=not is_bfloat16_supported(),  # Use fp16 if bfloat16 not available
        bf16=is_bfloat16_supported(),      # Use bfloat16 if supported (more stable)
        logging_steps=10,                  # Log every 10 steps
        optim="adamw_8bit",               # 8-bit optimizer for memory efficiency
        weight_decay=0.01,                # L2 regularization
        lr_scheduler_type="linear",       # Linear learning rate decay
        seed=2025,                        # For reproducibility
        output_dir="outputs",             # Output directory
        report_to="none"                  # Disable wandb/tensorboard logging
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/23000 [00:00<?, ? examples/s]

## 6. Training the Model

Now we begin the actual fine-tuning process. The training will show:
- Loss values (should generally decrease over time)
- Training speed and memory usage
- Gradient norms and learning rate schedule

In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 23,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 32
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 32 x 1) = 32
 "-____-"     Trainable parameters = 18,464,768 of 1,795,552,768 (1.03% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,2.2199
20,2.0936
30,1.9338
40,1.7978
50,1.7259
60,1.672
70,1.625
80,1.5937
90,1.5545
100,1.5564


In [15]:
model.save_pretrained("bio-tuned-deepseek-r1")
tokenizer.save_pretrained("bio-tuned-deepseek-r1")

('bio-tuned-deepseek-r1/tokenizer_config.json',
 'bio-tuned-deepseek-r1/special_tokens_map.json',
 'bio-tuned-deepseek-r1/chat_template.jinja',
 'bio-tuned-deepseek-r1/tokenizer.json')

## 7. Loading Models for Comparison

We load both the original base model and our fine-tuned model to compare their performance
on biology questions. This allows us to evaluate the effectiveness of our fine-tuning.

In [16]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

bio_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "bio-tuned-deepseek-r1",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2025.7.6: Fast Qwen2 patching. Transformers: 4.53.2. vLLM: 0.9.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
==((====))==  Unsloth 2025.7.6: Fast Qwen2 patching. Transformers: 4.53.2. vLLM: 0.9.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## 8. Model Inference and Comparison

### Inference Prompt Template

Note: The inference prompt has a slightly different instruction than training
to test generalization. In practice, you might want to use the exact same format
as training for optimal performance.

In [17]:
responses = []

In [18]:
question_prompt_style = """Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>{}"""

In [19]:
def generate_text(question, _model, max_length=2048):
    FastLanguageModel.for_inference(_model)  # Unsloth has 2x faster inference!
    inputs = tokenizer([question_prompt_style.format(question, "")], return_tensors="pt").to("cuda")

    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=max_length,
        temperature = 0.1,
    )
    response = tokenizer.batch_decode(outputs)
    return response

We'll be using the following evaluation table:

| Score  | Meaning                                                                                  |
| ------ | ---------------------------------------------------------------------------------------- |
| **–1** | **Incorrect** – Wrong conclusion or major misconception                                  |
| **0**  | **Uncertain / No useful answer** – No clear answer given or model admits it doesn't know |
| **1**  | **Somewhat correct** – Contains partial truth or insight but flawed overall              |
| **3**  | **Completely correct** – Accurate answer with clear and logical reasoning                |


### Test Case 1: Membrane Potential and Ion Permeability
Testing understanding of cellular physiology and electrochemical gradients.

In [20]:
_response = {}
question = "If a cell's membrane becomes more permeable to sodium ions but less permeable to potassium ions, what would likely happen to the cell's resting membrane potential? Why?"

In [21]:
output = generate_text(question, model)
_response["base_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out what happens to a cell's resting membrane potential if its membrane becomes more permeable to sodium ions but less permeable to potassium ions. Let me break this down step by step.

First, I remember that the resting membrane potential is the difference between the inside and outside of a cell's membrane. It's usually around 70 millivolts (mV) when the cell is resting. This potential is maintained by ion channels and pumps that allow ions to flow in and out of the cell.

Now, sodium ions (Na+) and potassium ions (K+) are both ions that can cross the membrane. I think sodium is a larger ion compared to potassium, so maybe it's easier for sodium ions to pass through the membrane than potassium ions. But wait, I'm not entirely sure about the permeability of sodium and potassium channels. I recall that sodium channels are more permeable than potassium channels, but I'm not certain about the exact order.

The question says the membrane becomes more

In [22]:
output = generate_text(question, bio_model)
_response["tuned_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out what happens to a cell's resting membrane potential if its membrane becomes more permeable to sodium ions but less permeable to potassium ions. Let me break this down step by step.

First, I remember that the resting membrane potential is the difference between the inside and outside of a cell's membrane. It's usually around 70 millivolts (mV) when the cell is resting. This potential is maintained by ion channels and pumps that allow ions to flow in and out of the cell.

Now, sodium ions (Na+) and potassium ions (K+) are both important ions, but they have different properties. Sodium is a larger ion compared to potassium, which is smaller. I think this affects how easily they can enter or exit the cell.

The cell membrane's permeability to sodium and potassium depends on the ion's charge and size. Sodium ions are positively charged, and potassium ions are negatively charged. I remember that positively charged ions are harder to push into the c

In [23]:
_response["question"] = question
responses.append(_response)

**Ground Truth:**
Increased Na⁺ permeability and decreased K⁺ permeability cause the membrane potential to become **less negative** (i.e., **increase** or depolarize).

| Model     | Score | Notes                                                                                                     |
| --------- | :---: | --------------------------------------------------------------------------------------------------------- |
| **Base**  |   -1   | Concludes a decrease in potential (hyperpolarization), which is wrong. Repetitive and confused reasoning. |
| **Tuned** |   1   | Correct final answer (increase) but poor and incorrect explanation (mixes up charges and movements).      |

### Test Case 2: Genetics and Phenotype Expression
Testing knowledge of dominance relationships and phenotypic expression.

In [24]:
_response = {}
question = """Suppose a mouse population has individuals with genotypes AA, Aa, and aa for coat color, and A is completely dominant over a. Will there be 3 different coat color phenotypes visible in the population?"""

In [25]:
output = generate_text(question, model)
_response["base_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out this genetics question about mice and coat color. The question is about whether there will be three different phenotypes in the population based on the genotypes AA, Aa, and aa. The dominant allele A is completely over the recessive a. 

First, I remember that in genetics, the phenotype is the observable trait, which in this case is the coat color. Since A is dominant, any mouse with at least one A allele will have the dominant phenotype, which I think is black. The aa genotype would have the recessive phenotype, which is white. 

So, looking at the genotypes given: AA is black, Aa is also black because they have at least one A, and aa is white. That means there are only two phenotypes: black and white. 

Wait, but the question is asking if there are three different phenotypes. So, if all three genotypes were present, we would have three phenotypes. But here, only AA, Aa, and aa are present. AA is black, Aa is black, and aa is white. So, only 

In [26]:
output = generate_text(question, bio_model)
_response["tuned_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out whether a mouse population with genotypes AA, Aa, and aa will show three different coat color phenotypes. Let me break this down step by step.

First, I know that coat color in mice is determined by a genetic trait, and here it's completely dominant. So, A is the dominant allele, and a is the recessive one. That means the presence of at least one A allele will result in the AA or Aa genotype, which would be phenotype A. The aa genotype would be phenotype a.

Now, looking at the genotypes given: AA, Aa, and aa. Each of these corresponds to a specific phenotype based on the dominant and recessive alleles. So, AA would be phenotype A, Aa would also be phenotype A because the dominant A is present, and aa would be phenotype a.

Wait, but the question is asking if there are three different phenotypes. So, if I have AA, Aa, and aa, that's three different genotypes. But according to the dominance, AA and Aa both would show phenotype A, and aa would s

In [27]:
_response["question"] = question
responses.append(_response)

**Ground Truth:**
AA and Aa give the same (dominant) phenotype; aa gives recessive. **Only 2 phenotypes** occur.

| Model     | Score | Notes                                                               |
| --------- | :---: | ------------------------------------------------------------------- |
| **Base**  |   1   | Correct answer and logic, but a bit verbose and initially confused. |
| **Tuned** |   3   | Fully correct, clean and concise reasoning.                         |

### Test Case 3: Evolutionary Biology and Natural Selection
Testing understanding of competitive advantage and evolutionary dynamics.

In [28]:
_response = {}
question = "In a population of bacteria, a mutation occurs that makes the bacteria produce a toxic compound that kills nearby competitors. However, the bacteria with this mutation grow more slowly than non-mutated bacteria when alone. Would you expect this mutation to become more or less common over time? Why?"

In [29]:
output = generate_text(question, model)
_response["base_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out whether this mutation in bacteria will become more or less common over time. Let me break this down step by step.

First, the mutation makes the bacteria produce a toxic compound. That toxic compound is supposed to kill nearby competitors. So, if a bacterium with this mutation is in a crowded area, it might be toxic to other bacteria nearby. But the mutation also makes these bacteria grow more slowly when they're alone. So, if they're not around much, they don't grow as fast as normal bacteria.

Now, I need to think about how this mutation affects their survival and reproduction. If the toxic compound is present, it might make it harder for these bacteria to survive when they're in a group. But wait, the mutation also affects their growth rate when alone. So, when they're alone, they don't grow as quickly as others. That could mean they might not be able to reproduce as effectively when they're in a group because they're not as fit.

But how d

In [30]:
output = generate_text(question, bio_model)
_response["tuned_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out whether this mutation in bacteria will become more or less common over time. Let me break it down step by step.

First, the mutation makes the bacteria produce a toxic compound. That toxic compound is supposed to kill nearby competitors. So, if a bacterium with this mutation is around another bacterium that doesn't have the mutation, the one with the mutation will probably die because it's toxic to the competitor.

But wait, the mutation also makes these bacteria grow more slowly when they're alone. So, if they're not around any competitors, they grow slower. But when they are around competitors, they die because of the toxic compound.

Now, I need to think about how this affects their survival and reproduction. If a bacterium with this mutation is in a crowded area, it will die because it's toxic to the nearby competitors. But if it's alone, it grows more slowly, which might mean it's harder to reproduce or survive on its own.

So, in a popul

In [31]:
_response["question"] = question
responses.append(_response)

**Ground Truth:**
Slower growth makes the mutant less fit. Mutation becomes **less common**.

| Model     | Score | Notes                                                                                 |
| --------- | :---: | ------------------------------------------------------------------------------------- |
| **Base**  |   1   | Correct outcome but flawed reasoning about reduced fitness. Repetitive.                      |
| **Tuned** |   -1   | Incorrect reasoning (misunderstands who is harmed), but ends with correct conclusion. |

### Test Case 4: Ecology and Energy Transfer
Testing quantitative understanding of ecological energy pyramids.

In [32]:
_response = {}
question = """In a pond ecosystem, if there are 100 producer organisms (algae), and each trophic level transfers 10% of energy to the next level,
what is the maximum number of tertiary consumers (third-level predators) that could be supported?"""

In [33]:
output = generate_text(question, model)
_response["base_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out this pond ecosystem question. Let me break it down step by step. 

First, the pond has 100 producer organisms, which are the algae. These are the primary producers, right? So they're the first level of the food chain. Each trophic level transfers 10% of the energy to the next level. That means each level is losing 90% of its energy to the next level.

The question is asking about the maximum number of tertiary consumers, which are third-level predators. So, they're predators that depend on the producers and the first-level consumers. 

I think I need to calculate how much energy each trophic level can support. Since each level is losing 90%, the energy available at each level is 10% of the previous level. 

Starting with producers, they have 100 units of energy. Then, each consumer level takes 10% of the previous level. So, for the first consumer (primary consumers), they take 10% of 100, which is 10 units. 

But wait, the question is about te

In [34]:
output = generate_text(question, bio_model)
_response["tuned_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out this pond ecosystem question. Let me start by understanding what's being asked. There are 100 producer organisms, which are the algae. Each trophic level transfers 10% of energy to the next level. I need to find out the maximum number of tertiary consumers, which are third-level predators, that could be supported.

First, I should recall what trophic levels are. The producer is at level 1. Then, each consumer at a higher level consumes the previous one. So, level 2 would be primary consumers, level 3 would be secondary consumers, and level 4 would be tertiary consumers. Wait, no, actually, tertiary consumers are at level 4, right? Because each level is one step up in the food chain.

Now, each trophic level transfers 10% of the energy to the next level. So, if I start with 100 producers, each level will have 10% of the energy from the previous level. But wait, energy isn't transferred directly; it's transferred as metabolic work. So, each cons

In [35]:
_response["question"] = question
responses.append(_response)

**Ground Truth:**
Each level gets 10% of the energy from the level below. From 100 producers → 10 → 1 → **0.1 tertiary consumers** possible (at most).

| Model     | Score | Notes                                                  |
| --------- | :---: | ------------------------------------------------------ |
| **Base**  |   0   | Never concludes, just loops through calculations.      |
| **Tuned** |   1   | Same issue, no clear final answer, but better reasoning (`consumer would need 0.1 units of energy`). |


### Test Case 5: Microbiology and Population Genetics
Testing complex quantitative reasoning involving growth rates and genetic drift.


In [36]:
_response = {}
question = """In a bacterial population of 10⁶ cells carrying a plasmid with genes for both ampicillin resistance and GFP expression,
what is the minimum number of generations needed to lose the plasmid in 99.9% of the population if:
-Plasmid loss rate is 0.1% per cell division
-Cells with plasmid grow 20% slower than cells without plasmid
-No antibiotic selection is present"""

In [37]:
output = generate_text(question, model)
_response["base_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out this bacterial population problem. Let me break it down step by step. 

First, there are 10^6 cells in the population. Each cell has a plasmid with genes for Ampicillin resistance and GFP expression. The goal is to find the minimum number of generations needed for the plasmid to be lost in 99.9% of the population. 

The plasmid loss rate is 0.1% per cell division. That means each generation, there's a 0.1% chance that a cell with the plasmid will lose it. But wait, there's also another factor: cells with plasmids grow 20% slower than those without. So, if a cell has the plasmid, it might not replicate as quickly, which could affect the overall population dynamics.

But the question is about losing the plasmid, not about the growth rates. So, the plasmid loss rate is the key here. Each generation, each cell has a 0.1% chance of losing the plasmid. 

Now, we need to model this as a probability problem. Since each cell's fate is independent, we c

In [38]:
output = generate_text(question, bio_model)
_response["tuned_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out this bacterial population problem. Let me break it down step by step. 

First, there are 10^6 cells in the population. Each cell has a plasmid with genes for Ampicillin resistance and GFP expression. The goal is to find the minimum number of generations needed for the plasmid to be lost in 99.9% of the population. 

The plasmid loss rate is 0.1% per cell division. That means each generation, there's a 0.1% chance that a cell with the plasmid will lose it. But wait, there's also another factor: cells with plasmids grow 20% slower than those without. So, if a cell has the plasmid, it might not replicate as efficiently, which could affect the overall population dynamics.

But the question is about losing the plasmid, not about replication. So, the loss rate is the key here. Each generation, each cell has a 0.1% chance of losing the plasmid. But since the population is large (10^6), we can model this using probability.

I think this is a problem t

In [39]:
_response["question"] = question
responses.append(_response)

**Ground Truth:**
Use $(0.999)^n = 0.001$ → $n \approx 6906$ generations.

| Model     | Score | Notes                                                                                    |
| --------- | :---: | ---------------------------------------------------------------------------------------- |
| **Base**  |   1   | Starts correct but then spirals into confusion with mixed probability models.            |
| **Tuned** |   1   | Calculates correct value but then doubts it and brings in a flawed expected-value model. |

### Test Case 6: Plant Biology and Ecology
Testing understanding of plant-pollinator interactions and experimental design.

In [40]:
_response = {}
question = "A particular species of flowering plant produces nectar with a much higher sugar concentration than most other plants. However, this species has very low seed production compared to related species when growing in its natural habitat. When the same species is grown in a greenhouse with hand pollination, seed production is normal. What might explain this pattern, and what experiment would you design to test your hypothesis?"

In [41]:
output = generate_text(question, model)
_response["base_reasoning"] = output
print(output[0].split("### Response:")[1])


<think><think>
Okay, so I'm trying to figure out why this particular flowering plant species has such a high sugar concentration in its nectar but low seed production when it's grown in a greenhouse with hand pollination. Let me break this down step by step.

First, I know that nectar is usually a source of water and nutrients for plants. High sugar concentration in nectar might mean that the plant is getting more water and nutrients from the nectar. But why would that affect seed production? I remember that seeds are produced when a plant can transfer pollen to another plant, usually through pollination. So, if the plant is producing low seed production, maybe it's not able to transfer pollen effectively.

Now, the plant is in a greenhouse with hand pollination. Hand pollination is when a plant is pollinated by a nearby plant, like a flower or another plant, without the help of a professional pollinator. I think this might be different from natural pollination, which is usually done 

In [42]:
output = generate_text(question, bio_model)
_response["tuned_reasoning"] = output
print(output[0].split("### Response:")[1])


<think><think>
Okay, so I'm trying to figure out this question about a flowering plant species. Let me break it down step by step. 

First, the plant produces nectar with a much higher sugar concentration than most other plants. That makes sense because nectar is usually sweet, so higher sugar content would mean it's more nutritious. But then, when this plant grows in its natural habitat, its seed production is low. That's interesting because usually, plants that produce nectar might have higher seed production, but here it's the opposite.

Now, when the same species is grown in a greenhouse with hand pollination, the seed production is normal. Hand pollination is a method used in some countries to pollinate plants without using pollinator-friendly chemicals. So, in the greenhouse, they're pollinating manually, which might be different from natural pollination. 

The question is asking why the seed production is low in the natural habitat but normal in the greenhouse with hand pollina

In [43]:
_response["question"] = question
responses.append(_response)

**Ground Truth:**
Pollinator limitation (due to nectar) explains low seed set. Hand pollination solves it. Test via pollinator vs hand/no pollination.

| Model     | Score | Notes                                                                              |
| --------- | :---: | ---------------------------------------------------------------------------------- |
| **Base**  |   1   | Understands the problem and proposes a valid experiment, but reasoning is muddled. |
| **Tuned** |   -1   | Misunderstands the role of nectar and proposes irrelevant experimental design.     |


### Test Case 7: Biochemistry - Urea Cycle
Testing detailed knowledge of metabolic pathways.

In [48]:
_response = {}
question = "Detail the step-by-step process of the urea cycle within liver cells, explaining how specific enzymes convert toxic ammonia into urea for excretion."

In [49]:
output = generate_text(question, model)
_response["base_reasoning"] = output
print(output[0].split("### Response:")[1])


<think><think>
Okay, so I need to explain how the urea cycle works in the liver. I remember that the liver has something to do with excreting ammonia, but I'm a bit fuzzy on the details. Let me think through this step by step.

First, I think the liver processes ammonia. Ammonia is toxic, so the liver must convert it into urea. Urea is a waste product, right? So the liver must have enzymes that break down ammonia into urea. But wait, how does that happen?

I recall that ammonia is a nitrogenous compound, and urea is a nitrogenous compound too. So maybe the liver has enzymes that can convert one into the other. But I'm not sure which enzymes do that. I think there are two main enzymes involved here: one that converts ammonia into urea and another that breaks down urea further.

Let me try to outline the steps. The first step is probably the breakdown of ammonia. I think ammonia is converted into urea by the liver's ammonia oxidase. So ammonia (NH3) is oxidized to urea (Urea). But wait,

In [50]:
output = generate_text(question, bio_model)
_response["tuned_reasoning"] = output
print(output[0].split("### Response:")[1])


<think><think>
Okay, so I need to explain how the urea cycle works in the liver. I remember that the liver has something to do with excreting ammonia, but I'm a bit fuzzy on the details. Let me think through this step by step.

First, I think the liver processes ammonia. Ammonia is toxic, so the liver must convert it into urea. Urea is a waste product, right? So the liver must have enzymes that break down ammonia into urea. But how does that happen?

I recall that ammonia is a nitrogenous molecule, and the liver has enzymes that can break it down. Maybe it's something like ammonia deaminase or something similar. These enzymes are probably in the liver's cytoplasm. So the first step is ammonia entering the liver.

Once it's in the liver, it must be processed. I think the liver has a complex system of enzymes and transporters. Maybe it's something like the ammonia deaminase complex. This complex breaks down ammonia into urea. So ammonia is converted into urea in the liver.

But wait, ho

In [51]:
_response["question"] = question
responses.append(_response)

**Ground Truth:**
CPS1 → OTC → ASS1 → ASL → ARG1. Converts ammonia → urea. All mammalian enzymes are known.

| Model     | Score | Notes                                             |
| --------- | :---: | ------------------------------------------------- |
| **Base**  |   -1   | Entirely wrong; fake enzymes and fictional steps. |
| **Tuned** |   -1   | Same issue—completely incorrect pathway.          |

### Test Case 8: Agricultural Biology
Testing applied biological knowledge in sustainable farming.

In [52]:
_response = {}
question = "Describe at least three specific ways that rotating different types of crops (e.g., legumes, grains) sequentially on the same land enhances soil health and reduces the need for synthetic fertilizers."

In [53]:
output = generate_text(question, model)
_response["base_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I need to figure out how rotating different types of crops on the same land can enhance soil health and reduce the need for synthetic fertilizers. Let me start by thinking about what I know about crop rotation and soil health.

First, I remember that crop rotation involves rotating different crops within the same field or area. This practice is often used in agriculture to improve soil fertility. The idea is that each crop provides different nutrients and nutrients that other crops can absorb. For example, if you rotate between legumes like beans and grains like corn, each crop can provide nitrogen, phosphorus, and potassium, which are essential nutrients for plants.

Now, how does this help with soil health? Well, I think that when you rotate crops, you're exposing the soil to different microorganisms. These microorganisms can break down organic matter and improve soil structure. Also, different crops might have different root systems, which can help the soil retain 

In [54]:
output = generate_text(question, bio_model)
_response["tuned_reasoning"] = output
print(output[0].split("### Response:")[1])


<think>
Okay, so I need to figure out how rotating different types of crops on the same land can enhance soil health and reduce the need for synthetic fertilizers. Let me start by thinking about what I know about crop rotation and soil health.

First, I remember that crop rotation involves rotating different crops in the same field over a period of time. This practice is supposed to improve soil health because it helps maintain the structure of the soil and prevents soil erosion. But how does that relate to reducing synthetic fertilizers?

Well, synthetic fertilizers are usually applied regularly to maintain soil fertility. If you rotate crops, you're not applying the same type of fertilizer every year. So, maybe that reduces the need for synthetic fertilizers because you're not consistently applying them to the same land.

Let me think about each crop type. For example, if you rotate legumes like beans and peas, followed by grains like corn and wheat, and then back to legumes, this c

In [55]:
_response["question"] = question
responses.append(_response)

**Ground Truth:**
Legumes fix N, different crops restore nutrients, prevent erosion, support microbes. Reduces fertilizer need.

| Model     | Score | Notes                                                                                       |
| --------- | :---: | ------------------------------------------------------------------------------------------- |
| **Base**  |   1   | Covers many correct ideas (roots, microbes, erosion), but includes minor misunderstandings. |
| **Tuned** |   1   | Bit more vague; misses key mechanisms like nitrogen fixation. Poor explanation.   

## 9. Saving Results for Analysis

All responses are saved to a JSON file for further analysis and comparison.
This allows for systematic evaluation of the fine-tuning effectiveness.

### Key Evaluation Metrics to Consider:
1. **Reasoning Quality**: Does the model show clearer step-by-step thinking?
2. **Biological Accuracy**: Are the scientific facts and concepts correct?
3. **Depth of Explanation**: Does the fine-tuned model provide more detailed explanations?
4. **Consistency**: Are responses more consistent in format and quality?

Overall we can see that in some cases the tuned model is much better.

The purpose of this was to conduct a simple experiment and learn how to fine-tune models, rather than the model itself. For better results try larger base models, longer finetuning and look into data sources.

In [60]:
!zip -r /content/bio-tuned-deepseek-r1.zip /content/bio-tuned-deepseek-r1

  adding: content/bio-tuned-deepseek-r1/ (stored 0%)
  adding: content/bio-tuned-deepseek-r1/chat_template.jinja (deflated 75%)
  adding: content/bio-tuned-deepseek-r1/adapter_model.safetensors (deflated 8%)
  adding: content/bio-tuned-deepseek-r1/README.md (deflated 65%)
  adding: content/bio-tuned-deepseek-r1/special_tokens_map.json (deflated 70%)
  adding: content/bio-tuned-deepseek-r1/adapter_config.json (deflated 57%)
  adding: content/bio-tuned-deepseek-r1/tokenizer.json (deflated 81%)
  adding: content/bio-tuned-deepseek-r1/tokenizer_config.json (deflated 88%)


In [61]:
from google.colab import files
files.download("/content/bio-tuned-deepseek-r1.zip")
files.download("/content/responses.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Bonus Test Case: Stem Cell Biology and Regenerative Medicine
Testing knowledge of current therapeutic applications.

In [None]:
_response = {}
question = "What specific roles do different types of stem cells play in cellular aging, and what are some concrete examples of stem cell-based therapies being explored to treat conditions like osteoarthritis or macular degeneration?"

In [None]:
output = generate_text(question, model)
_response["base_reasoning"] = output
print(output[0].split("### Response:")[1])


<think><think>
Okay, so I need to figure out what different types of stem cells play in cellular aging and how stem cell therapies are used to treat conditions like osteoarthritis and macular degeneration. Let me start by recalling what I know about stem cells and cellular aging.

Stem cells are the building blocks of the body, right? They can differentiate into various cell types. Cellular aging is when cells age, usually through damage like oxidative stress or nutrient depletion. Different stem cell types might have different vulnerabilities to this aging process.

First, let's think about the main types of stem cells. There are fibroblasts, mesenchymal stem cells (MSCs), natural stem cells (NSCs), and embryonic stem cells (ESCs). Each has its own role in cellular aging.

Fibroblasts are part of the fibroblast family. They are involved in cell proliferation and differentiation. If they age, they might not be able to perform their functions as well. Fibroblasts can be used in stem ce

In [None]:
output = generate_text(question, bio_model)
_response["tuned_reasoning"] = output
print(output[0].split("### Response:")[1])


<think><think>
Okay, so I need to figure out what different types of stem cells do in cellular aging and then think about specific therapies using them to treat conditions like osteoarthritis or macular degeneration. Let me start by recalling what I know about stem cells and cellular aging.

Stem cells are the building blocks of the body, right? They can differentiate into various cell types. But cellular aging is a process where cells age, and stem cells play a role in this. I remember hearing that there are different types of stem cells, like embryonic stem cells, induced stem cells, and adult stem cells. Each type might have different roles in aging.

So, embryonic stem cells are the most primitive. They can differentiate into various cell types, including muscle, skin, and even blood cells. But they might not be as robust in older age. Maybe they have issues with differentiation or can't handle stress better. I think they might be involved in conditions like osteoarthritis because

In [None]:
_response["question"] = question
responses.append(_response)

In [56]:
import json
with open('responses.json', 'w') as f:
    json.dump(responses, f)