## 02 — Train Flan-T5-Base + LoRA for AppetIte

This notebook fine-tunes **google/flan-t5-base** using **LoRA** on the AppetIte dataset.

Goals:
- Input: ingredients list
- Output: recipe title + step-by-step instructions
- Model: Flan-T5-Base (instruction-tuned)
- Fine-tuning: LoRA (parameter-efficient, M2-friendly)

Assumptions:
- Preprocessed CSVs are available in `data/processed/`:
  - `appetite_train.csv`
  - `appetite_val.csv`
  - `appetite_test.csv`

In [1]:
!pip install transformers datasets peft evaluate sacrebleu sentencepiece accelerate --quiet

In [2]:
import os
import gc
import torch
import pandas as pd
from datasets import Dataset

from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq
)

from peft import LoraConfig, get_peft_model, TaskType
import evaluate

os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [3]:
for obj_name in ["model", "trainer"]:
    if obj_name in globals():
        del globals()[obj_name]

gc.collect()
if torch.backends.mps.is_available():
    torch.mps.empty_cache()

if torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

device

'mps'

### Load Processed AppetIte Dataset

We load the cleaned CSV files produced by the preprocessing notebook.

In [4]:
train_df = pd.read_csv("data/processed/appetite_train.csv")
val_df   = pd.read_csv("data/processed/appetite_val.csv")
test_df  = pd.read_csv("data/processed/appetite_test.csv")

train_df.head()

Unnamed: 0,Title,ingredients_text,target_text,Image_Name
0,3-Ingredient Sweet Potato Casserole with Maple...,"Olive oil (for pan), 5 1/2 pounds sweet potato...",Title: 3-Ingredient Sweet Potato Casserole wit...,3-ingredient-sweet-potato-casserole-with-maple...
1,Sea Bass with Marinated Vegetables,"1 ripe medium tomato, 1 garlic clove, smashed,...",Title: Sea Bass with Marinated Vegetables\nIns...,sea-bass-with-marinated-vegetables-242301
2,Spicy Pork Posole,"1 tablespoon olive oil, 6 ounces pork tenderlo...",Title: Spicy Pork Posole\nInstructions: Heat o...,spicy-pork-posole-357091
3,Roast Pork Belly Toasts with Blood-Orange BBQ ...,"One 2-pound piece of boneless, skinless pork b...",Title: Roast Pork Belly Toasts with Blood-Oran...,roast-pork-belly-toasts-with-blood-orange-bbq-...
4,Farmers' Market Salad with Spiced Goat Cheese ...,"2 tablespoons sesame seeds, 2 teaspoons ground...",Title: Farmers' Market Salad with Spiced Goat ...,farmers-market-salad-with-spiced-goat-cheese-r...


### Final Text Sanitization

We ensure all text fields are proper strings, with no NaN, None, or list objects.
This prevents tokenizer crashes.

In [5]:
def clean_field(x):
    if x is None:
        return ""
    if isinstance(x, float):  # catches NaN
        return ""
    if isinstance(x, list):
        return ", ".join(map(str, x))
    return str(x)

In [6]:
for df_temp in [train_df, val_df, test_df]:
    df_temp["ingredients_text"] = df_temp["ingredients_text"].apply(clean_field)
    df_temp["target_text"]      = df_temp["target_text"].apply(clean_field)

print("Sanitization complete.")
print("Train size:", train_df.shape)
print("Val size:  ", val_df.shape)
print("Test size: ", test_df.shape)

Sanitization complete.
Train size: (10796, 4)
Val size:   (1349, 4)
Test size:  (1350, 4)


### Convert Pandas → HuggingFace Datasets

In [7]:
train_ds = Dataset.from_pandas(train_df)
val_ds   = Dataset.from_pandas(val_df)
test_ds  = Dataset.from_pandas(test_df)

train_ds

Dataset({
    features: ['Title', 'ingredients_text', 'target_text', 'Image_Name'],
    num_rows: 10796
})

### Load Flan-T5-Base

In [8]:
MODEL_NAME = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

tokenizer.padding_side = "right"

model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):

### Configure LoRA (Low-Rank Adaptation)

In [9]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],  
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

model.to(device)

trainable params: 1,769,472 || all params: 249,347,328 || trainable%: 0.7096


PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): T5ForConditionalGeneration(
      (shared): Embedding(32128, 768)
      (encoder): T5Stack(
        (embed_tokens): Embedding(32128, 768)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=16, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=16, out_features=768, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
            

### Tokenization & Prompt Design

For each example we construct:

**Input (to the model):**

> "Given the following ingredients: {ingredients_text}\nWrite a recipe with a title and step-by-step instructions."

**Target (to be generated):**

`target_text` (from preprocessing):
> "Title: ...\nInstructions: ..."

We also cap the sequence lengths to fit Apple M2 memory.

In [10]:
MAX_INPUT_LEN = 256
MAX_TARGET_LEN = 256

def preprocess_function(batch):
    inputs = [
        f"Given the following ingredients: {ing}\n"
        f"Write a cooking recipe with a clear title and step-by-step instructions."
        for ing in batch["ingredients_text"]
    ]

    model_inputs = tokenizer(
        inputs,
        max_length=MAX_INPUT_LEN,
        truncation=True,
        padding="max_length"
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            batch["target_text"],
            max_length=MAX_TARGET_LEN,
            truncation=True,
            padding="max_length"
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [11]:
train_tok = train_ds.map(preprocess_function, batched=True)
val_tok   = val_ds.map(preprocess_function, batched=True)
test_tok  = test_ds.map(preprocess_function, batched=True)

train_tok

Map:   0%|          | 0/10796 [00:00<?, ? examples/s]



Map:   0%|          | 0/1349 [00:00<?, ? examples/s]

Map:   0%|          | 0/1350 [00:00<?, ? examples/s]

Dataset({
    features: ['Title', 'ingredients_text', 'target_text', 'Image_Name', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 10796
})

### Data Collator

We use `DataCollatorForSeq2Seq` for dynamic padding during training.

In [12]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model
)

### Evaluation Metrics: ROUGE-L and BLEU

In [13]:
rouge = evaluate.load("rouge")
bleu = evaluate.load("sacrebleu")


def compute_metrics(eval_pred):
    preds, labels = eval_pred


    preds = preds.tolist()
    labels = labels.tolist()

    clean_preds = []
    for seq in preds:
        clean_seq = []
        for tok in seq:
            if isinstance(tok, int) and 0 <= tok < tokenizer.vocab_size:
                clean_seq.append(tok)
            else:
                clean_seq.append(tokenizer.pad_token_id)
        clean_preds.append(clean_seq)

    clean_labels = []
    for seq in labels:
        clean_seq = []
        for tok in seq:
            if tok == -100 or tok is None:
                clean_seq.append(tokenizer.pad_token_id)
            else:
                clean_seq.append(tok)
        clean_labels.append(clean_seq)

    pred_texts = tokenizer.batch_decode(clean_preds, skip_special_tokens=True)
    label_texts = tokenizer.batch_decode(clean_labels, skip_special_tokens=True)

    rouge_res = rouge.compute(predictions=pred_texts, references=label_texts)
    bleu_res = bleu.compute(
        predictions=pred_texts,
        references=[[ref] for ref in label_texts]
    )

    return {
        "rougeL": rouge_res["rougeL"],
        "bleu": bleu_res["score"]
    }

### Training Configuration

We tune:

- Flan-T5-Base with LoRA
- Batch size = 1 on M2
- 3 epochs (you can increase to 4–5 later)
- No checkpoint saving (to avoid disk overflow)

In [14]:
BATCH_SIZE = 1 if device == "mps" else 2

training_args = Seq2SeqTrainingArguments(
    output_dir="model/flan_t5_appetite_checkpoints",
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=5e-5,
    num_train_epochs=3,
    logging_steps=500,
    save_strategy="no",       
    predict_with_generate=True,
    generation_max_length=256,
)

training_args



### Initialize Trainer

In [15]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=val_tok,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer

  trainer = Seq2SeqTrainer(


<transformers.trainer_seq2seq.Seq2SeqTrainer at 0x30ef5a2d0>

### Train the Model

In [24]:
train_result = trainer.train()
train_result



Step,Training Loss
500,6.1666
1000,2.6823
1500,2.2015
2000,2.0187
2500,1.916
3000,1.9165
3500,1.9555
4000,1.8992
4500,1.894
5000,1.9257


TrainOutput(global_step=32388, training_loss=1.874864421598084, metrics={'train_runtime': 11807.3115, 'train_samples_per_second': 2.743, 'train_steps_per_second': 2.743, 'total_flos': 1.1176988201975808e+16, 'train_loss': 1.874864421598084, 'epoch': 3.0})

### Evaluate on Test Set

We compute ROUGE-L and BLEU on the held-out test split.

In [25]:
test_metrics = trainer.evaluate(test_tok)
test_metrics



{'eval_loss': 1.547447919845581,
 'eval_model_preparation_time': 0.0108,
 'eval_rougeL': 0.22973915357940208,
 'eval_bleu': 7.959352841736187,
 'eval_runtime': 5476.317,
 'eval_samples_per_second': 0.247,
 'eval_steps_per_second': 0.247,
 'epoch': 3.0}

### Save Final Model

We save the LoRA-adapted Flan-T5-Base model and tokenizer for inference.

In [26]:
SAVE_DIR = "model/flan_t5_appetite_lora"
os.makedirs(SAVE_DIR, exist_ok=True)

trainer.save_model(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

print("Model saved to:", SAVE_DIR)

Model saved to: model/flan_t5_appetite_lora


### Quick Inference Test

We test the model on a sample ingredient list.

In [37]:
model.eval()

def generate_recipe(ingredients_text, max_new_tokens=256):

    prompt = (
        f"Given the following ingredients: {ingredients_text}\n"
        f"Write a well-formatted cooking recipe.\n"
        f"Format STRICTLY as:\n"
        f"Title: <recipe title>\n"
        f"\n"
        f"Instructions:\n"
        f"1. <step 1>\n"
        f"2. <step 2>\n"
        f"3. <step 3>\n"
        f"Each step MUST be on a new line. with step numbers\n"
    )

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True,
                       max_length=256).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            num_beams=4,
            early_stopping=True,
            no_repeat_ngram_size=3
        )

    text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    text = text.replace("Instructions:", "\n\nInstructions:\n")
    text = text.replace(". ", ".\n")

    return text
sample = train_df["ingredients_text"].iloc[0]
print("INGREDIENTS:\n", sample)
print("\nGENERATED RECIPE:\n")
print(generate_recipe(sample))

INGREDIENTS:
 Olive oil (for pan), 5 1/2 pounds sweet potatoes, peeled, cut into 1 1/2" pieces, 2 teaspoons kosher salt, divided, plus more, 1 1/4 cups pure maple syrup, divided, 2 cups pecan halves (about 7 ounces), 1 1/2 teaspoons freshly ground black pepper, divided

GENERATED RECIPE:

Title: Sweet Potatoes with Maple Syrup and Pecans 

Instructions:
 Preheat oven to 350°F.
Heat oil in a large skillet over medium-high heat.
Add sweet potatoes and cook, stirring occasionally, until tender, about 5 minutes.
Stir in maple syrup, pecans, and pepper.
Cook, stirring frequently, until sweet potatoes are tender, 8 to 10 minutes.
Transfer sweet potatoes to a bowl and let cool, covered, about 1 hour.
Meanwhile, heat remaining 2 tablespoons oil in another skillet over high heat, stirring often, until golden brown, about 2 minutes.
Add pecan halves and pepper to pan and cook until caramelized, about 3 minutes.
Remove from heat and stir in remaining 1 tablespoon maple syrup and pecan.
Add pepper