# üöÄ FLAN-T5 Construction Rules Training on Google Colab

This notebook trains a FLAN-T5 model to convert natural language construction requirements into structured JSON.

**Hardware recommendation**: GPU (T4 or better)

**Estimated training time**:
- T4 GPU: ~45-60 minutes
- L4 GPU: ~25-35 minutes  
- A100 GPU: ~15-20 minutes

## Step 1: Check GPU Availability

In [2]:
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("‚ö†Ô∏è Warning: No GPU detected. Training will be very slow on CPU.")
    print("Go to Runtime > Change runtime type > Hardware accelerator > GPU")

PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4
GPU Memory: 14.74 GB


## Step 2: Install Dependencies

In [3]:
!pip install -q transformers>=4.30.0 datasets>=2.0.0 accelerate>=0.20.0 sentencepiece>=0.1.99
print("‚úÖ Dependencies installed successfully!")

‚úÖ Dependencies installed successfully!


## Step 3: Upload Dataset

**Option A**: Upload the dataset file directly (recommended for Colab)

In [4]:
from google.colab import files
import os

print("Please upload your dataset file: construction_ashrae_2013.jsonl")
uploaded = files.upload()

# Get the uploaded filename
dataset_path = list(uploaded.keys())[0]
print(f"\n‚úÖ Dataset uploaded: {dataset_path}")
print(f"File size: {os.path.getsize(dataset_path) / 1024 / 1024:.2f} MB")

Please upload your dataset file: construction_ashrae_2013.jsonl


Saving construction_ashrae_2013.jsonl to construction_ashrae_2013 (1).jsonl

‚úÖ Dataset uploaded: construction_ashrae_2013 (1).jsonl
File size: 2.52 MB


**Option B**: Clone from GitHub (if you pushed the dataset to your repo)

In [5]:
# Uncomment and modify if you want to clone from GitHub
# !git clone https://github.com/YOUR_USERNAME/DL-Construction-Recommendation.git
# dataset_path = "DL-Construction-Recommendation/dataset/construction_ashrae_2013.jsonl"

## Step 4: Define Training Functions

In [6]:
import json
from typing import Dict, List, Any
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
from datasets import Dataset


def load_jsonl_dataset(file_path: str) -> List[Dict[str, Any]]:
    """Load a JSONL dataset file."""
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line.strip()))
    return data


def prepare_dataset(jsonl_path: str, tokenizer, max_input_length: int = 512, max_target_length: int = 512):
    """Prepare the dataset for training with dynamic padding."""
    # Load data
    data = load_jsonl_dataset(jsonl_path)

    # Prepare inputs and targets
    inputs = [item["input_text"] for item in data]
    targets = [item["target_json"] for item in data]

    # Create dataset from raw data
    dataset_dict = {
        "input_text": inputs,
        "target_json": targets
    }

    dataset = Dataset.from_dict(dataset_dict)

    # Define tokenization function
    def tokenize_function(examples):
        # Tokenize inputs
        model_inputs = tokenizer(
            examples["input_text"],
            max_length=max_input_length,
            truncation=True,
            padding=False  # Dynamic padding by DataCollator
        )

        # Tokenize targets
        labels = tokenizer(
            text_target=examples["target_json"],
            max_length=max_target_length,
            truncation=True,
            padding=False
        )

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

    # Apply tokenization
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=dataset.column_names
    )

    return tokenized_dataset


print("‚úÖ Training functions defined!")

‚úÖ Training functions defined!


## Step 5: Configure Training Parameters

In [7]:
# Training configuration
CONFIG = {
    "model_name": "google/flan-t5-base",  # Change to "google/flan-t5-small" for faster training
    "output_dir": "./flan_t5_construction",
    "learning_rate": 5e-5,
    "batch_size": 8,  # Reduce to 4 if OOM (Out of Memory)
    "num_epochs": 5,
    "weight_decay": 0.01,
    "max_input_length": 256,  # Reduced from 512 for efficiency
    "max_target_length": 512,
    "eval_split": 0.1
}

print("Training Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

Training Configuration:
  model_name: google/flan-t5-base
  output_dir: ./flan_t5_construction
  learning_rate: 5e-05
  batch_size: 8
  num_epochs: 5
  weight_decay: 0.01
  max_input_length: 256
  max_target_length: 512
  eval_split: 0.1


## Step 6: Load Model and Tokenizer

In [8]:
print(f"Loading model: {CONFIG['model_name']}...")

tokenizer = AutoTokenizer.from_pretrained(CONFIG["model_name"])
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])

print(f"‚úÖ Model loaded successfully!")
print(f"Model parameters: {model.num_parameters() / 1e6:.1f}M")

Loading model: google/flan-t5-base...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


‚úÖ Model loaded successfully!
Model parameters: 247.6M


## Step 7: Prepare Dataset

In [9]:
print(f"Loading dataset from {dataset_path}...")
dataset = prepare_dataset(
    dataset_path,
    tokenizer,
    CONFIG["max_input_length"],
    CONFIG["max_target_length"]
)

# Split into train and eval
dataset = dataset.train_test_split(test_size=CONFIG["eval_split"], seed=42)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

print(f"‚úÖ Dataset prepared!")
print(f"  Train examples: {len(train_dataset)}")
print(f"  Eval examples: {len(eval_dataset)}")

Loading dataset from construction_ashrae_2013 (1).jsonl...


Map:   0%|          | 0/2751 [00:00<?, ? examples/s]

‚úÖ Dataset prepared!
  Train examples: 2475
  Eval examples: 276


## Step 8: Setup Training Arguments

In [10]:
# Calculate eval steps
steps_per_epoch = len(train_dataset) // CONFIG["batch_size"]
eval_save_steps = max(50, steps_per_epoch // 3)

print(f"Steps per epoch: {steps_per_epoch}")
print(f"Evaluation/Save every {eval_save_steps} steps")

# Data collator for dynamic padding
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

# Training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir=CONFIG["output_dir"],
    learning_rate=CONFIG["learning_rate"],
    per_device_train_batch_size=CONFIG["batch_size"],
    per_device_eval_batch_size=CONFIG["batch_size"],
    num_train_epochs=CONFIG["num_epochs"],
    weight_decay=CONFIG["weight_decay"],
    logging_dir=f"{CONFIG['output_dir']}/logs",
    logging_steps=50,
    save_steps=eval_save_steps,
    eval_steps=eval_save_steps,
    eval_strategy="steps",
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    greater_is_better=False,
    push_to_hub=False,
    report_to="none",
    fp16=False,  # Mixed precision training
)

print("‚úÖ Training arguments configured!")

Steps per epoch: 309
Evaluation/Save every 103 steps
‚úÖ Training arguments configured!


## Step 9: Initialize Trainer

In [11]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

print("‚úÖ Trainer initialized!")

  trainer = Seq2SeqTrainer(


‚úÖ Trainer initialized!


## Step 10: Start Training üöÄ

This will take approximately 45-60 minutes on a T4 GPU.

In [12]:
import time

print("="*80)
print("STARTING TRAINING")
print("="*80)
print(f"Model: {CONFIG['model_name']}")
print(f"Epochs: {CONFIG['num_epochs']}")
print(f"Batch size: {CONFIG['batch_size']}")
print(f"Learning rate: {CONFIG['learning_rate']}")
print("="*80)

start_time = time.time()

# Train the model
trainer.train()

end_time = time.time()
training_time = (end_time - start_time) / 60

print("\n" + "="*80)
print("TRAINING COMPLETE!")
print("="*80)
print(f"Total training time: {training_time:.2f} minutes")
print("="*80)

STARTING TRAINING
Model: google/flan-t5-base
Epochs: 5
Batch size: 8
Learning rate: 5e-05


Step,Training Loss,Validation Loss
103,0.5449,0.331342
206,0.346,0.287222
309,0.312,0.271793
412,0.2939,0.262958
515,0.2802,0.255931
618,0.2764,0.250395
721,0.2684,0.247406
824,0.2656,0.245175
927,0.264,0.243042
1030,0.2595,0.241158


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].



TRAINING COMPLETE!
Total training time: 44.29 minutes


## Step 11: Save Final Model

In [13]:
print(f"Saving model to {CONFIG['output_dir']}...")
trainer.save_model()
tokenizer.save_pretrained(CONFIG["output_dir"])

print("‚úÖ Model saved successfully!")

Saving model to ./flan_t5_construction...
‚úÖ Model saved successfully!


## Step 12: Evaluate Model

In [14]:
print("Running final evaluation...")
eval_results = trainer.evaluate()

print("\nEvaluation Results:")
for key, value in eval_results.items():
    print(f"  {key}: {value:.4f}")

Running final evaluation...



Evaluation Results:
  eval_loss: 0.2384
  eval_runtime: 13.7281
  eval_samples_per_second: 20.1050
  eval_steps_per_second: 2.5500
  epoch: 5.0000


## Step 13: Test Inference

Let's test the trained model with a sample input!

In [19]:
# Test inference
test_input = "In climate zone 5, exterior walls of type SteelFramed must not exceed a U-factor of 0.064."

print(f"Input: {test_input}")
print("\nGenerating JSON...")

inputs = tokenizer(test_input, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_length=1024,
    num_beams=5,
    do_sample=False,
    early_stopping=True
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\nRaw Generated Output:")
print(generated_text)


def fix_json(text):

    text = text.strip()


    if not text.startswith('{'):
        text = '{' + text
    if not text.endswith('}'):
        text = text + '}'

    return text


fixed_json = fix_json(generated_text)

print("\n" + "="*80)
try:
    parsed_json = json.loads(fixed_json)
    print("‚úÖ VALID JSON GENERATED!")
    print("="*80)
    print(json.dumps(parsed_json, indent=2))


    print("\n" + "="*80)
    print("KEY FIELDS VERIFICATION:")
    print("="*80)
    print(f"Rule ID: {parsed_json.get('rule_id', 'MISSING')}")
    print(f"Category: {parsed_json.get('rule_category', 'MISSING')}")
    print(f"Climate Zone: {parsed_json.get('inputs', {}).get('climate_zone', 'MISSING')}")
    print(f"Surface Type: {parsed_json.get('inputs', {}).get('surface_type', 'MISSING')}")
    print(f"Max U-value: {parsed_json.get('outputs', {}).get('max_u_value', 'MISSING')}")

except json.JSONDecodeError as e:
    print(f"‚ùå JSON PARSING ERROR: {e}")
    print("="*80)
    print("Fixed JSON attempt:")
    print(fixed_json)


Input: In climate zone 5, exterior walls of type SteelFramed must not exceed a U-factor of 0.064.

Generating JSON...

Raw Generated Output:
"rule_id":"edffffff-b0e4-4ed8-b0e4-b0e4ffffff","standard":"ASHRAE 90.1-2013","domain":"Construction","rule_category":"performance","inputs":"climate_zone":"5","building_type":null,"space_type":null,"surface_type":"SteelFramed","construction_type":"SteelFramed","building_category":"Nonresidential","minimum_percent_of_surface":null,"maximum_percent_of_surface":null,"outputs":"construction_name":"Typical Insulated Steel Framed Exterior Wall","assigned_construction_type":null,"max_u_value":0.064,"max_f_factor":null,"max_c_factor":null,"max_shgc":null,"min_vt":null,"min_vt_shgc":null,"units":"u_value":"Btu/h-ft2-F","f_factor":"Btu-in/h-ft2-F","c_factor":"Btu/h-ft2-F","shgc":null,"vt":null,"notes":"u_value_includes_interior_film":true,"u_value_includes_exterior_film":true

‚ùå JSON PARSING ERROR: Expecting ',' delimiter: line 1 column 156 (char 155)
Fix

## Step 14: Download Trained Model

Download the trained model to your local machine.

In [20]:
# Create a zip file of the model
!zip -r flan_t5_construction.zip {CONFIG['output_dir']}

print("‚úÖ Model zipped!")
print(f"Size: {os.path.getsize('flan_t5_construction.zip') / 1024 / 1024:.2f} MB")

# Download the zip file
from google.colab import files
files.download('flan_t5_construction.zip')

print("\n‚úÖ Download started! Check your browser's download folder.")

  adding: flan_t5_construction/ (stored 0%)
  adding: flan_t5_construction/tokenizer_config.json (deflated 95%)
  adding: flan_t5_construction/training_args.bin (deflated 53%)
  adding: flan_t5_construction/checkpoint-1545/ (stored 0%)
  adding: flan_t5_construction/checkpoint-1545/tokenizer_config.json (deflated 95%)
  adding: flan_t5_construction/checkpoint-1545/training_args.bin (deflated 53%)
  adding: flan_t5_construction/checkpoint-1545/scheduler.pt (deflated 61%)
  adding: flan_t5_construction/checkpoint-1545/trainer_state.json (deflated 79%)
  adding: flan_t5_construction/checkpoint-1545/config.json (deflated 62%)
  adding: flan_t5_construction/checkpoint-1545/model.safetensors (deflated 7%)
  adding: flan_t5_construction/checkpoint-1545/tokenizer.json (deflated 74%)
  adding: flan_t5_construction/checkpoint-1545/spiece.model (deflated 48%)
  adding: flan_t5_construction/checkpoint-1545/rng_state.pth (deflated 26%)
  adding: flan_t5_construction/checkpoint-1545/special_tokens_m

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


‚úÖ Download started! Check your browser's download folder.
