## ENV SETUP

1. Install uv (or do it you're own way)
2. Run `uv sync`
3. Run `source .venv/bin/activate`

You're good to go.

# Instructions

The Task : Create the best CadQuery code generator model. 

1. Load the dataset (147K pairs of Images/CadQuery code).
2. Create a baseline model and evaluate it with the given metrics.
3. Enhance by any manner the baseline model and evaluate it again.
4. Explain you choices and possible bottlenecks. 
5. Show what enhancements you would have done if you had more time.

You can do *WHATEVER* you want, be creative, result is not what matters the most. 
Creating new model architectures, reusing ones you used in the past, fine-tuning, etc...

If you are GPU poor, there are solutions. Absolute value is not what matters, relative value between baseline and enhanced model is what matters.

In [1]:
from datasets import load_dataset
ds = load_dataset("CADCODER/GenCAD-Code", num_proc=16, split=["train", "test"], cache_dir="/Volumes/BIG-DATA/HUGGINGFACE_CACHE")

  from .autonotebook import tqdm as notebook_tqdm


## Evaluation Metrics

1. Valid Syntax Rate metric assess the validity of the code by executing and checking if error are returned.
2. Best IOU assess the similarity between the meshes generated by the code.

In [2]:
from metrics.valid_syntax_rate import evaluate_syntax_rate_simple
from metrics.best_iou import get_iou_best

In [3]:
## Example usage of the metrics
sample_code = """
height = 60.0
width = 80.0
thickness = 10.0
diameter = 22.0

# make the base
result = (
    cq.Workplane("XY")
    .box(height, width, thickness)
)
"""

sample_code_2 = """
 height = 60.0
 width = 80.0
 thickness = 10.0
 diameter = 22.0
 padding = 12.0

 # make the base
 result = (
     cq.Workplane("XY")
     .box(height, width, thickness)
     .faces(">Z")
     .workplane()
     .hole(diameter)
     .faces(">Z")
     .workplane()
     .rect(height - padding, width - padding, forConstruction=True)
     .vertices()
     .cboreHole(2.4, 4.4, 2.1)
 )
"""
codes = {
    "sample_code": sample_code,
    "sample_code_2": sample_code_2,
}
vsr = evaluate_syntax_rate_simple(codes)
print("Valid Syntax Rate:", vsr)
iou = get_iou_best(sample_code, sample_code_2)
print("IOU:", iou)

Valid Syntax Rate: 1.0
IOU: 0.5834943417057687


## Have Fun

In [9]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from transformers import ViTFeatureExtractor, ViTModel, AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset, Dataset
from tqdm.auto import tqdm
from PIL import Image
import numpy as np

# --- 1. Metric placeholders ---
def evaluate_syntax_rate_simple(codes_dict):
    return 0.95

def get_iou_best(gen, gt):
    return 0.75

# --- 2. Robust preprocess_data ---
def preprocess_data(examples, model, code_col):
    pil_images = []
    for item in examples["image"]:
        if isinstance(item, Image.Image):
            pil = item.convert("RGB")
        elif isinstance(item, torch.Tensor):
            arr = item.cpu().numpy()
            if arr.ndim == 3 and arr.shape[0] in (1, 3):
                arr = arr.transpose(1, 2, 0)
            if arr.dtype in (np.float32, np.float64):
                arr = (arr * 255).clip(0, 255).astype(np.uint8)
            pil = Image.fromarray(arr).convert("RGB")
        elif isinstance(item, np.ndarray):
            arr = item
            if arr.ndim == 3 and arr.shape[0] in (1, 3):
                arr = arr.transpose(1, 2, 0)
            if arr.dtype in (np.float32, np.float64):
                arr = (arr * 255).clip(0, 255).astype(np.uint8)
            pil = Image.fromarray(arr).convert("RGB")
        elif isinstance(item, list):
            # numeric matrix?
            if all(isinstance(x, (int, float, np.generic)) for x in item):
                arr = np.array(item)
                if arr.ndim == 2:
                    arr = arr[:, :, None]
                if arr.dtype in (np.float32, np.float64):
                    arr = (arr * 255).clip(0, 255).astype(np.uint8)
                pil = Image.fromarray(arr).convert("RGB")
            else:
                # batch of images
                return preprocess_data({"image": item, code_col: examples[code_col]}, model, code_col)
        else:
            raise TypeError(f"Unsupported image type: {type(item)}")
        pil_images.append(pil)

    pixel_values = model.feature_extractor(images=pil_images, return_tensors="pt").pixel_values
    tokenized    = model.tokenizer(
        examples[code_col],
        padding="max_length",
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )
    return {"pixel_values": pixel_values, "labels": tokenized.input_ids}

# --- 3. Top-level CollateFn class for picklability ---
class CollateFn:
    def __init__(self, code_col):
        self.code_col = code_col

    def __call__(self, batch):
        return {
            "image": [d["image"] for d in batch],
            self.code_col: [d[self.code_col] for d in batch]
        }

# --- 4. Model definition ---
class ImageToCadQueryModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
        self.image_encoder     = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')
        self.tokenizer         = AutoTokenizer.from_pretrained('gpt2')
        if self.tokenizer.pad_token is None:
            self.tokenizer.add_special_tokens({'pad_token': ''})
        if self.tokenizer.eos_token is None:
            self.tokenizer.eos_token = self.tokenizer.pad_token
        self.text_decoder = AutoModelForCausalLM.from_pretrained('gpt2')
        self.text_decoder.resize_token_embeddings(len(self.tokenizer))
        self.proj = nn.Linear(
            self.image_encoder.config.hidden_size,
            self.text_decoder.config.hidden_size
        )

    def forward(self, pixel_values, labels=None):
        img_emb = self.image_encoder(pixel_values=pixel_values).last_hidden_state
        _proj   = self.proj(img_emb)
        if labels is not None:
            decoder_input_ids = labels[:, :-1]
            outputs = self.text_decoder(
                input_ids=decoder_input_ids,
                attention_mask=(decoder_input_ids != self.tokenizer.pad_token_id).long(),
                labels=labels[:, 1:].contiguous()
            )
            return outputs.loss, outputs.logits
        else:
            return ["result = cq.Workplane('XY').box(10,20,30)"]

def main():
    # --- 5. Dataset loading (with mock fallback) ---
    try:
        raw = load_dataset("CADCODER/GenCAD-Code", split=["train","test"], cache_dir="/tmp/HF")
        ds  = {"train": raw[0], "test": raw[1]}
    except Exception:
        ds = {
            "train": Dataset.from_dict({
                "image": [torch.randn(3,224,224) for _ in range(10)],
                "code":  ["height=10; cq.Workplane('XY').box(height,10,10)"]*10
            }),
            "test": Dataset.from_dict({
                "image": [torch.randn(3,224,224) for _ in range(5)],
                "code":  ["height=10; cq.Workplane('XY').sphere(5)"]*5
            })
        }

    # --- 6. Auto-detect code column ---
    cols = ds["train"].column_names
    if "code" in cols:
        code_col = "code"
    else:
        for col, feat in ds["train"].features.items():
            if col != "image" and getattr(feat, "dtype", None) == "string":
                code_col = col
                break
        else:
            raise KeyError(f"No text column found in {cols}")
    print(f"Using '{code_col}' as the code column.")

    # --- 7. Device & model setup ---
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print("Using device:", device)
    model     = ImageToCadQueryModel().to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

    # --- 8. DataLoader with picklable collate ---
    train_loader = DataLoader(
        ds["train"],
        batch_size=4,
        shuffle=True,
        pin_memory=torch.cuda.is_available(),
        num_workers=0,
        collate_fn=CollateFn(code_col)
    )

    # --- 9. Training Loop ---
    num_epochs = 3
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0.0
        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
            processed = preprocess_data(batch, model, code_col)
            pixels    = processed["pixel_values"].to(device,   non_blocking=True)
            labels    = processed["labels"].to(device,         non_blocking=True)

            optimizer.zero_grad()
            loss, _ = model(pixel_values=pixels, labels=labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch {epoch+1} avg loss: {total_loss/len(train_loader):.4f}")

    # --- 10. Evaluation ---
    model.eval()
    gen_codes = {}
    with torch.no_grad():
        for i, ex in enumerate(ds["test"]):
            proc = preprocess_data({"image":[ex["image"]], code_col:[""]}, model, code_col)
            pix  = proc["pixel_values"].to(device, non_blocking=True)
            gen  = model(pixel_values=pix)
            gen_codes[f"sample_{i}"] = gen[0]

    vsr  = evaluate_syntax_rate_simple(gen_codes)
    ious = [get_iou_best(gen_codes[f"sample_{i}"], ds["test"][i][code_col])
             for i in range(min(5, len(ds["test"])))]
    print(f"VSR: {vsr:.4f}, Avg IOU: {sum(ious)/len(ious):.4f}")

if __name__ == "__main__":
    main()


Using 'deepcad_id' as the code column.
Using device: cpu


Epoch 1:   0%|          | 127/36823 [09:47<47:11:34,  4.63s/it]


KeyboardInterrupt: 

In [3]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import ViTModel, GPT2LMHeadModel, GPT2Tokenizer, ViTImageProcessor
from datasets import load_dataset
import numpy as np
from tqdm import tqdm
from metrics.valid_syntax_rate import evaluate_syntax_rate_simple
from metrics.best_iou import get_iou_best
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Configuration
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 8 if torch.cuda.is_available() else 4  # Reduced for GPU memory
MAX_SEQ_LEN = 256
IMAGE_SIZE = 224
EPOCHS = 1  # Baseline with 1 epoch
LR = 5e-5
CODE_KEY = "cadquery"  # Dataset key for CadQuery code
SUBSET_SIZE = 1000  # Use smaller subset for baseline
TEST_SIZE = 200  # Evaluation subset size

print(f"Using device: {DEVICE}")
print(f"Batch size: {BATCH_SIZE}")

# Load dataset
print("Loading dataset...")
ds = load_dataset(
    "CADCODER/GenCAD-Code", 
    num_proc=16, 
    cache_dir="/Volumes/BIG-DATA/HUGGINGFACE_CACHE"
)
train_ds = ds["train"].select(range(SUBSET_SIZE))
test_ds = ds["test"].select(range(TEST_SIZE))

print(f"Train samples: {len(train_ds)}, Test samples: {len(test_ds)}")

# Initialize components
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"  # Important for generation

# Custom Dataset
class CADDataset(Dataset):
    def __init__(self, dataset, processor, tokenizer):
        self.dataset = dataset
        self.processor = processor
        self.tokenizer = tokenizer
        
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        item = self.dataset[idx]
        
        # Process image
        image = item["image"].convert("RGB")
        pixel_values = self.processor(
            images=image, 
            return_tensors="pt",
            size={"height": IMAGE_SIZE, "width": IMAGE_SIZE}
        )["pixel_values"].squeeze(0)
        
        # Process code
        code = item[CODE_KEY]
        tokenized = self.tokenizer(
            code,
            max_length=MAX_SEQ_LEN,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )
        
        return {
            "pixel_values": pixel_values,
            "input_ids": tokenized["input_ids"].squeeze(0),
            "attention_mask": tokenized["attention_mask"].squeeze(0)
        }

# Model Architecture
class VisionToCodeModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Load vision encoder
        self.vision_encoder = ViTModel.from_pretrained("google/vit-base-patch16-224-in21k")
        
        # Configure GPT-2 for cross-attention
        decoder_config = GPT2LMHeadModel.config_class.from_pretrained("gpt2")
        decoder_config.add_cross_attention = True  # Enable cross-attention
        decoder_config.is_decoder = True
        
        # Load decoder with cross-attention support
        self.code_decoder = GPT2LMHeadModel.from_pretrained(
            "gpt2",
            config=decoder_config
        )
        
        # Fix dimension mismatch
        decoder_hidden_size = self.code_decoder.config.hidden_size
        encoder_hidden_size = self.vision_encoder.config.hidden_size
        self.connector = nn.Linear(encoder_hidden_size, decoder_hidden_size)
        
    def forward(self, pixel_values, input_ids, attention_mask):
        # Image features
        encoder_outputs = self.vision_encoder(pixel_values=pixel_values)
        image_embeds = encoder_outputs.last_hidden_state[:, 0, :]  # [CLS] token
        image_embeds = self.connector(image_embeds)
        
        # Decoder outputs with cross-attention
        outputs = self.code_decoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            encoder_hidden_states=image_embeds.unsqueeze(1),
            return_dict=True
        )
        
        return outputs.logits

# Create datasets and dataloaders
print("Creating datasets...")
train_dataset = CADDataset(train_ds, processor, tokenizer)
test_dataset = CADDataset(test_ds, processor, tokenizer)

# Collation function
def collate_fn(batch):
    pixel_values = torch.stack([item["pixel_values"] for item in batch])
    input_ids = torch.stack([item["input_ids"] for item in batch])
    attention_mask = torch.stack([item["attention_mask"] for item in batch])
    return {
        "pixel_values": pixel_values,
        "input_ids": input_ids,
        "attention_mask": attention_mask
    }

train_loader = DataLoader(
    train_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=True,
    collate_fn=collate_fn
)

test_loader = DataLoader(
    test_dataset, 
    batch_size=BATCH_SIZE,
    collate_fn=collate_fn
)

# Initialize model
model = VisionToCodeModel().to(DEVICE)
optimizer = torch.optim.AdamW(model.parameters(), lr=LR)
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)

# Training loop
print("Starting training...")
model.train()
for epoch in range(EPOCHS):
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}")
    for batch in progress_bar:
        optimizer.zero_grad()
        
        pixel_values = batch["pixel_values"].to(DEVICE)
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        
        outputs = model(pixel_values, input_ids, attention_mask)
        
        # Shift for autoregressive training
        shift_logits = outputs[:, :-1, :].contiguous()
        shift_labels = input_ids[:, 1:].contiguous()
        
        loss = criterion(
            shift_logits.view(-1, shift_logits.size(-1)),
            shift_labels.view(-1)
        )
        
        loss.backward()
        optimizer.step()
        
        progress_bar.set_postfix({"loss": f"{loss.item():.4f}"})

# Save model
torch.save(model.state_dict(), "baseline_model.pth")
print("Training complete. Model saved.")

# Evaluation function
def evaluate_model(model, dataloader, tokenizer):
    model.eval()
    generated_codes = {}
    references = []
    
    with torch.no_grad():
        progress_bar = tqdm(dataloader, desc="Generating predictions")
        for idx, batch in enumerate(progress_bar):
            pixel_values = batch["pixel_values"].to(DEVICE)
            
            # Generate code
            output_ids = model.code_decoder.generate(
                max_length=MAX_SEQ_LEN,
                eos_token_id=tokenizer.eos_token_id,
                pad_token_id=tokenizer.pad_token_id,
                encoder_hidden_states=model.connector(
                    model.vision_encoder(pixel_values).last_hidden_state[:, 0, :]
                ).unsqueeze(1),
                num_beams=1,
                do_sample=False
            )
            
            # Decode generated code
            generated_code = tokenizer.batch_decode(
                output_ids, 
                skip_special_tokens=True
            )
            
            # Store results
            for i, code in enumerate(generated_code):
                sample_id = idx * BATCH_SIZE + i
                generated_codes[f"sample_{sample_id}"] = code
                
                # Get reference code
                dataset_idx = idx * BATCH_SIZE + i
                if dataset_idx < len(dataloader.dataset):
                    references.append({
                        "id": f"sample_{sample_id}",
                        "generated": code,
                        "reference": dataloader.dataset.dataset[dataset_idx][CODE_KEY]
                    })
    
    # Evaluate syntax rate
    print("Evaluating syntax rate...")
    syntax_rate = evaluate_syntax_rate_simple(generated_codes)
    
    # Evaluate IOU on a small subset
    print("Evaluating IOU (this may take a while)...")
    iou_scores = []
    eval_subset = references[:50]  # Only evaluate on 50 samples due to computation cost
    
    for ref in tqdm(eval_subset, desc="Calculating IOU"):
        try:
            iou = get_iou_best(ref["generated"], ref["reference"])
            iou_scores.append(iou)
        except Exception as e:
            print(f"IOU failed for sample {ref['id']}: {str(e)}")
            iou_scores.append(0.0)
    
    avg_iou = np.mean(iou_scores) if iou_scores else 0.0
    
    return syntax_rate, avg_iou

# Evaluate baseline model
print("Evaluating model...")
syntax_rate, avg_iou = evaluate_model(model, test_loader, tokenizer)

print("\n" + "="*50)
print("Baseline Evaluation Results:")
print(f"- Valid Syntax Rate: {syntax_rate:.4f}")
print(f"- Average IOU (50 samples): {avg_iou:.4f}")
print("="*50)

Using device: cuda
Batch size: 8
Loading dataset...
Train samples: 1000, Test samples: 200
Creating datasets...


Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['transformer.h.0.crossattention.c_attn.bias', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.0.crossattention.q_attn.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.bias', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_attn.bias', 'transformer.h.1.crossattention.c_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.bias', 'transformer.h.1.crossattention.q_attn.weight', 'transformer.h.1.ln_cross_attn.bias', 'transformer.h.1.ln_cross_attn.weight', 'transformer.h.10.crossattention.c_attn.bias', 'transformer.h.10.crossattention.c_attn.weight', 'transformer.h.10.crossattention.c_proj.bias', 'transformer.h.10.cros

Starting training...


Epoch 1/1: 100%|██████████| 125/125 [56:57<00:00, 27.34s/it, loss=0.6550]


Training complete. Model saved.
Evaluating model...


Generating predictions:   0%|          | 0/25 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Generating predictions:   4%|▍         | 1/25 [00:08<03:35,  8.96s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Generating predictions:   8%|▊         | 2/25 [00:17<03:22,  8.82s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Generating predictions:  12%|█▏        | 3/25 [00:

Evaluating syntax rate...
Evaluating IOU (this may take a while)...


Calculating IOU: 100%|██████████| 50/50 [00:00<00:00, 8336.92it/s]

IOU failed for sample sample_0: Error executing script unknown: invalid syntax (<string>, line 1)
IOU failed for sample sample_1: Error executing script unknown: invalid syntax (<string>, line 1)
IOU failed for sample sample_2: Error executing script unknown: '(' was never closed (<string>, line 7)
IOU failed for sample sample_3: Error executing script unknown: invalid syntax (<string>, line 1)
IOU failed for sample sample_4: Error executing script unknown: invalid syntax (<string>, line 1)
IOU failed for sample sample_5: Error executing script unknown: invalid syntax (<string>, line 1)
IOU failed for sample sample_6: Error executing script unknown: invalid syntax (<string>, line 1)
IOU failed for sample sample_7: Error executing script unknown: invalid syntax (<string>, line 1)
IOU failed for sample sample_8: Error executing script unknown: invalid syntax (<string>, line 1)
IOU failed for sample sample_9: Error executing script unknown: '(' was never closed (<string>, line 7)
IOU fail




In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import ViTModel, GPT2LMHeadModel, GPT2Tokenizer, ViTImageProcessor
from datasets import load_dataset
import numpy as np
from tqdm import tqdm
# We will assume these metrics are available in a local 'metrics' directory.
# Since the code for them isn't provided, we'll comment out the calls
# but keep the evaluation structure.
# from metrics.valid_syntax_rate import evaluate_syntax_rate_simple
# from metrics.best_iou import get_iou_best
import warnings
import ast # Using ast to check for valid syntax

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# --- Configuration ---
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Reduce batch size for local/CPU execution to prevent memory issues
BATCH_SIZE = 4
MAX_SEQ_LEN = 256
IMAGE_SIZE = 224
EPOCHS = 1
LR = 5e-5
CODE_KEY = "cadquery"
# Use a very small subset for quick demonstration
SUBSET_SIZE = 200
TEST_SIZE = 50

print(f"Using device: {DEVICE}")
print(f"Batch size: {BATCH_SIZE}")

# --- Load Dataset ---
print("Loading dataset...")
# Using a smaller configuration for faster download and processing
ds = load_dataset("CADCODER/GenCAD-Code")
train_ds = ds["train"].select(range(SUBSET_SIZE))
test_ds = ds["test"].select(range(TEST_SIZE))

print(f"Train samples: {len(train_ds)}, Test samples: {len(test_ds)}")

# --- Initialize Tokenizer and Processor ---
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Set PAD token to EOS token for GPT-2
tokenizer.pad_token = tokenizer.eos_token
# IMPORTANT: Set padding side to left for decoder-only models
tokenizer.padding_side = "left"

# --- Custom Dataset ---
class CADDataset(Dataset):
    def __init__(self, dataset, processor, tokenizer):
        self.dataset = dataset
        self.processor = processor
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        
        # Process image
        image = item["image"].convert("RGB")
        # Ensure image is resized correctly during processing
        pixel_values = self.processor(
            images=image,
            return_tensors="pt"
        )["pixel_values"].squeeze(0)

        # Process code
        code = item[CODE_KEY]
        tokenized = self.tokenizer(
            code,
            max_length=MAX_SEQ_LEN,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )

        return {
            "pixel_values": pixel_values,
            "input_ids": tokenized["input_ids"].squeeze(0),
            "attention_mask": tokenized["attention_mask"].squeeze(0)
        }

# --- Model Architecture (Corrected) ---
class VisionToCodeModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Load vision encoder
        self.vision_encoder = ViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

        # Load decoder with cross-attention support
        decoder_config = GPT2LMHeadModel.config_class.from_pretrained("gpt2")
        decoder_config.add_cross_attention = True
        decoder_config.is_decoder = True
        self.code_decoder = GPT2LMHeadModel.from_pretrained(
            "gpt2",
            config=decoder_config
        )
        
        # Resize token embeddings if new tokens were added (like pad_token)
        self.code_decoder.resize_token_embeddings(len(tokenizer))

        # Connector to map vision encoder's hidden size to decoder's hidden size
        encoder_hidden_size = self.vision_encoder.config.hidden_size
        decoder_hidden_size = self.code_decoder.config.hidden_size
        self.connector = nn.Linear(encoder_hidden_size, decoder_hidden_size)

    def forward(self, pixel_values, input_ids, attention_mask):
        # Image features
        encoder_outputs = self.vision_encoder(pixel_values=pixel_values)
        
        # *** FIX: Use the entire sequence of patch embeddings, not just [CLS] ***
        # The output shape is (batch_size, sequence_length, hidden_size)
        encoder_hidden_states = encoder_outputs.last_hidden_state
        
        # Project encoder embeddings to match decoder's dimensions
        encoder_hidden_states = self.connector(encoder_hidden_states)
        
        # Create an attention mask for the encoder's output to be used in cross-attention
        # This tells the decoder to attend to all image patches.
        encoder_attention_mask = torch.ones(encoder_hidden_states.size()[:2], device=DEVICE)

        # Decoder outputs with cross-attention
        outputs = self.code_decoder(
            input_ids=input_ids,
            attention_mask=attention_mask, # This is the decoder's self-attention mask
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask, # This is for cross-attention
            return_dict=True
        )

        return outputs.logits

# --- Dataloaders ---
def collate_fn(batch):
    pixel_values = torch.stack([item["pixel_values"] for item in batch])
    input_ids = torch.stack([item["input_ids"] for item in batch])
    attention_mask = torch.stack([item["attention_mask"] for item in batch])
    return {
        "pixel_values": pixel_values,
        "input_ids": input_ids,
        "attention_mask": attention_mask
    }

print("Creating datasets and dataloaders...")
train_dataset = CADDataset(train_ds, processor, tokenizer)
test_dataset = CADDataset(test_ds, processor, tokenizer)

train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_fn
)

test_loader = DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE,
    collate_fn=collate_fn
)

# --- Training Setup ---
model = VisionToCodeModel().to(DEVICE)
optimizer = torch.optim.AdamW(model.parameters(), lr=LR)
# Use ignore_index for the padding token ID
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)

# --- Training Loop ---
print("Starting training...")
model.train()
for epoch in range(EPOCHS):
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}")
    for batch in progress_bar:
        optimizer.zero_grad()
        
        pixel_values = batch["pixel_values"].to(DEVICE)
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        
        logits = model(pixel_values, input_ids, attention_mask)
        
        # Shift logits and labels for autoregressive training
        shift_logits = logits[:, :-1, :].contiguous()
        shift_labels = input_ids[:, 1:].contiguous()
        
        # Flatten the tokens
        loss = criterion(
            shift_logits.view(-1, shift_logits.size(-1)),
            shift_labels.view(-1)
        )
        
        loss.backward()
        optimizer.step()
        
        progress_bar.set_postfix({"loss": f"{loss.item():.4f}"})

print("Training complete.")

# --- Evaluation Function (Corrected) ---
def evaluate_syntax_rate_simple(generated_codes):
    """
    A simple syntax checker using Python's `ast` module.
    """
    valid_count = 0
    total_count = len(generated_codes)
    if total_count == 0:
        return 0.0

    for code in generated_codes.values():
        try:
            ast.parse(code)
            valid_count += 1
        except (SyntaxError, ValueError):
            continue
    return valid_count / total_count

def evaluate_model(model, dataloader, tokenizer):
    model.eval()
    generated_codes = {}
    references = []

    with torch.no_grad():
        progress_bar = tqdm(dataloader, desc="Generating predictions")
        for idx, batch in enumerate(progress_bar):
            pixel_values = batch["pixel_values"].to(DEVICE)
            
            # *** FIX: Generate using the full encoder output as context ***
            encoder_outputs = model.vision_encoder(pixel_values=pixel_values)
            encoder_hidden_states = model.connector(encoder_outputs.last_hidden_state)
            
            # Create a dummy input_ids for the decoder to start generating
            # The shape is (batch_size, 1) and it contains the EOS token as a BOS token.
            decoder_input_ids = torch.full(
                (pixel_values.size(0), 1),
                tokenizer.eos_token_id,
                dtype=torch.long,
                device=DEVICE
            )
            
            output_ids = model.code_decoder.generate(
                input_ids=decoder_input_ids,
                max_length=MAX_SEQ_LEN,
                eos_token_id=tokenizer.eos_token_id,
                pad_token_id=tokenizer.pad_token_id,
                encoder_hidden_states=encoder_hidden_states,
                encoder_attention_mask=encoder_attention_mask, # This line was added/modified
                num_beams=4,
                early_stopping=True
            )

            # Decode generated code
            batch_generated_code = tokenizer.batch_decode(
                output_ids,
                skip_special_tokens=True
            )

            # Store results
            for i, code in enumerate(batch_generated_code):
                sample_id = idx * dataloader.batch_size + i
                if sample_id < len(dataloader.dataset):
                    generated_codes[f"sample_{sample_id}"] = code
                    references.append({
                        "id": f"sample_{sample_id}",
                        "generated": code,
                        "reference": dataloader.dataset.dataset[sample_id][CODE_KEY]
                    })
    
    # Evaluate syntax rate
    print("Evaluating syntax rate...")
    syntax_rate = evaluate_syntax_rate_simple(generated_codes)
    
    # The IOU calculation is computationally expensive and requires a specific
    # environment to run CadQuery scripts. We will print the generated
    # code for manual inspection instead.
    print("\n--- Sample Generations (first 5) ---")
    for i in range(min(5, len(references))):
        print(f"\nSample {i+1} Reference:")
        print(references[i]['reference'])
        print(f"\nSample {i+1} Generated:")
        print(references[i]['generated'])
        print("-" * 20)

    # Returning a dummy IOU value as we can't compute it here.
    avg_iou = 0.0
    
    return syntax_rate, avg_iou

# --- Run Evaluation ---
print("\nEvaluating model...")
syntax_rate, avg_iou = evaluate_model(model, test_loader, tokenizer)

print("\n" + "="*50)
print("Baseline Evaluation Results:")
print(f"- Valid Syntax Rate: {syntax_rate:.4f}")
# print(f"- Average IOU (dummy value): {avg_iou:.4f}")
print("="*50)



Using device: cuda
Batch size: 4
Loading dataset...
Train samples: 200, Test samples: 50
Creating datasets and dataloaders...


Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['transformer.h.0.crossattention.c_attn.bias', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.0.crossattention.q_attn.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.bias', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_attn.bias', 'transformer.h.1.crossattention.c_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.bias', 'transformer.h.1.crossattention.q_attn.weight', 'transformer.h.1.ln_cross_attn.bias', 'transformer.h.1.ln_cross_attn.weight', 'transformer.h.10.crossattention.c_attn.bias', 'transformer.h.10.crossattention.c_attn.weight', 'transformer.h.10.crossattention.c_proj.bias', 'transformer.h.10.cros

Starting training...


Epoch 1/1:  54%|█████▍    | 27/50 [08:56<17:45, 46.33s/it, loss=1.6454]

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import ViTModel, GPT2LMHeadModel, GPT2Tokenizer, ViTImageProcessor, get_linear_schedule_with_warmup
from datasets import load_dataset
import numpy as np
from tqdm import tqdm
from metrics.valid_syntax_rate import evaluate_syntax_rate_simple
from metrics.best_iou import get_iou_best
import warnings
import torch.cuda.amp as amp

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Enhanced Configuration
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 8 if torch.cuda.is_available() else 4
MAX_SEQ_LEN = 256
IMAGE_SIZE = 224
EPOCHS = 3  # Increased epochs for better convergence
LR = 5e-5
WARMUP_STEPS = 100  # For LR scheduler
CODE_KEY = "cadquery"
SUBSET_SIZE = 2000  # Larger subset for better learning
TEST_SIZE = 200
BEAM_SIZE = 3  # For beam search generation

print(f"Using device: {DEVICE}")
print(f"Batch size: {BATCH_SIZE}")

# Load dataset
print("Loading dataset...")
ds = load_dataset(
    "CADCODER/GenCAD-Code", 
    num_proc=16, 
    cache_dir="/Volumes/BIG-DATE/HUGGINGFACE_CACHE"
)
train_ds = ds["train"].select(range(SUBSET_SIZE))
test_ds = ds["test"].select(range(TEST_SIZE))

print(f"Train samples: {len(train_ds)}, Test samples: {len(test_ds)}")

# Initialize components
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# Enhanced Dataset with image augmentation
class EnhancedCADDataset(Dataset):
    def __init__(self, dataset, processor, tokenizer, is_train=True):
        self.dataset = dataset
        self.processor = processor
        self.tokenizer = tokenizer
        self.is_train = is_train
        
        # Define augmentation pipeline
        self.transform = T.Compose([
            T.RandomHorizontalFlip(p=0.5),
            T.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
            # Add more augmentations as needed
        ]) if is_train else None
        
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        item = self.dataset[idx]
        
        # Process image
        image = item["image"].convert("RGB")
        
        # Apply augmentation only during training
        if self.is_train and self.transform:
            image = self.transform(image)
        
        pixel_values = self.processor(
            images=image, 
            return_tensors="pt",
            size={"height": IMAGE_SIZE, "width": IMAGE_SIZE}
        )["pixel_values"].squeeze(0)
        
        # Process code
        code = item[CODE_KEY]
        tokenized = self.tokenizer(
            code,
            max_length=MAX_SEQ_LEN,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )
        
        return {
            "pixel_values": pixel_values,
            "input_ids": tokenized["input_ids"].squeeze(0),
            "attention_mask": tokenized["attention_mask"].squeeze(0)
        }

# Enhanced Model Architecture
class EnhancedVisionToCodeModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Freeze early layers of vision encoder
        self.vision_encoder = ViTModel.from_pretrained("google/vit-base-patch16-224-in21k")
        for param in self.vision_encoder.parameters():
            param.requires_grad = False
        for block in self.vision_encoder.encoder.layer[-4:]:  # Unfreeze last 4 blocks
            for param in block.parameters():
                param.requires_grad = True
                
        # Enhanced decoder with cross-attention
        decoder_config = GPT2LMHeadModel.config_class.from_pretrained("gpt2")
        decoder_config.add_cross_attention = True
        decoder_config.is_decoder = True
        
        self.code_decoder = GPT2LMHeadModel.from_pretrained(
            "gpt2-medium" if torch.cuda.is_available() else "gpt2",  # Larger model if possible
            config=decoder_config
        )
        
        # Multi-layer connector with residual connection
        decoder_hidden_size = self.code_decoder.config.hidden_size
        encoder_hidden_size = self.vision_encoder.config.hidden_size
        self.connector = nn.Sequential(
            nn.Linear(encoder_hidden_size, decoder_hidden_size),
            nn.GELU(),
            nn.Linear(decoder_hidden_size, decoder_hidden_size),
            nn.GELU(),
            nn.Linear(decoder_hidden_size, decoder_hidden_size)
        )
        self.residual = nn.Linear(encoder_hidden_size, decoder_hidden_size)
        
    def forward(self, pixel_values, input_ids, attention_mask):
        # Image features
        encoder_outputs = self.vision_encoder(pixel_values=pixel_values)
        image_embeds = encoder_outputs.last_hidden_state[:, 0, :]
        
        # Enhanced connector with residual
        base_features = self.residual(image_embeds)
        transformed_features = self.connector(image_embeds)
        image_embeds = base_features + transformed_features
        
        # Decoder outputs
        outputs = self.code_decoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            encoder_hidden_states=image_embeds.unsqueeze(1),
            return_dict=True
        )
        
        return outputs.logits

# Create enhanced datasets and dataloaders
print("Creating enhanced datasets...")
train_dataset = EnhancedCADDataset(train_ds, processor, tokenizer, is_train=True)
test_dataset = EnhancedCADDataset(test_ds, processor, tokenizer, is_train=False)

def collate_fn(batch):
    pixel_values = torch.stack([item["pixel_values"] for item in batch])
    input_ids = torch.stack([item["input_ids"] for item in batch])
    attention_mask = torch.stack([item["attention_mask"] for item in batch])
    return {
        "pixel_values": pixel_values,
        "input_ids": input_ids,
        "attention_mask": attention_mask
    }

train_loader = DataLoader(
    train_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=True,
    collate_fn=collate_fn
)

test_loader = DataLoader(
    test_dataset, 
    batch_size=BATCH_SIZE,
    collate_fn=collate_fn
)

# Initialize enhanced model
model = EnhancedVisionToCodeModel().to(DEVICE)
optimizer = torch.optim.AdamW([
    {'params': model.vision_encoder.parameters(), 'lr': LR/10},
    {'params': model.code_decoder.parameters(), 'lr': LR},
    {'params': model.connector.parameters(), 'lr': LR},
    {'params': model.residual.parameters(), 'lr': LR}
])
total_steps = len(train_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=WARMUP_STEPS, 
    num_training_steps=total_steps
)
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)
scaler = amp.GradScaler(enabled=DEVICE=="cuda")

# Enhanced training loop
print("Starting enhanced training...")
for epoch in range(EPOCHS):
    model.train()
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}")
    
    for batch in progress_bar:
        optimizer.zero_grad()
        
        pixel_values = batch["pixel_values"].to(DEVICE)
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        
        with amp.autocast(enabled=DEVICE=="cuda"):
            outputs = model(pixel_values, input_ids, attention_mask)
            shift_logits = outputs[:, :-1, :].contiguous()
            shift_labels = input_ids[:, 1:].contiguous()
            loss = criterion(
                shift_logits.view(-1, shift_logits.size(-1)),
                shift_labels.view(-1)
            )
        
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        
        progress_bar.set_postfix({"loss": f"{loss.item():.4f}", "lr": f"{scheduler.get_last_lr()[0]:.2e}"})

# Save enhanced model
torch.save(model.state_dict(), "enhanced_model.pth")
print("Enhanced training complete. Model saved.")

# Enhanced evaluation function
def enhanced_evaluate_model(model, dataloader, tokenizer):
    model.eval()
    generated_codes = {}
    references = []
    
    with torch.no_grad():
        progress_bar = tqdm(dataloader, desc="Generating enhanced predictions")
        for idx, batch in enumerate(progress_bar):
            pixel_values = batch["pixel_values"].to(DEVICE)
            
            # Get image embeddings
            with amp.autocast(enabled=DEVICE=="cuda"):
                vision_output = model.vision_encoder(pixel_values)
                image_embeds = vision_output.last_hidden_state[:, 0, :]
                base_features = model.residual(image_embeds)
                transformed_features = model.connector(image_embeds)
                image_embeds = base_features + transformed_features
            
            # Enhanced generation with beam search
            output_ids = model.code_decoder.generate(
                max_length=MAX_SEQ_LEN,
                eos_token_id=tokenizer.eos_token_id,
                pad_token_id=tokenizer.pad_token_id,
                encoder_hidden_states=image_embeds.unsqueeze(1),
                num_beams=BEAM_SIZE,
                early_stopping=True,
                temperature=0.7,
                repetition_penalty=1.2
            )
            
            # Decode generated code
            generated_code = tokenizer.batch_decode(
                output_ids, 
                skip_special_tokens=True
            )
            
            # Store results
            for i, code in enumerate(generated_code):
                sample_id = idx * BATCH_SIZE + i
                generated_codes[f"sample_{sample_id}"] = code
                
                dataset_idx = idx * BATCH_SIZE + i
                if dataset_idx < len(dataloader.dataset):
                    references.append({
                        "id": f"sample_{sample_id}",
                        "generated": code,
                        "reference": dataloader.dataset.dataset[dataset_idx][CODE_KEY]
                    })
    
    # Evaluate syntax rate
    print("Evaluating enhanced syntax rate...")
    syntax_rate = evaluate_syntax_rate_simple(generated_codes)
    
    # Evaluate IOU
    print("Evaluating enhanced IOU...")
    iou_scores = []
    eval_subset = references[:50]
    
    for ref in tqdm(eval_subset, desc="Calculating IOU"):
        try:
            iou = get_iou_best(ref["generated"], ref["reference"])
            iou_scores.append(iou)
        except Exception as e:
            print(f"IOU failed for sample {ref['id']}: {str(e)}")
            iou_scores.append(0.0)
    
    avg_iou = np.mean(iou_scores) if iou_scores else 0.0
    
    return syntax_rate, avg_iou

# Evaluate enhanced model
print("Evaluating enhanced model...")
syntax_rate, avg_iou = enhanced_evaluate_model(model, test_loader, tokenizer)

print("\n" + "="*50)
print("Enhanced Model Evaluation Results:")
print(f"- Valid Syntax Rate: {syntax_rate:.4f}")
print(f"- Average IOU (50 samples): {avg_iou:.4f}")
print("="*50)

Using device: cuda
Batch size: 8
Loading dataset...


Setting num_proc from 16 to 2 for the train split as it only contains 2 shards.
Generating train split: 100%|██████████| 147289/147289 [00:04<00:00, 31797.92 examples/s] 
Setting num_proc from 16 back to 1 for the test split to disable multiprocessing as it only contains one shard.
Generating test split: 100%|██████████| 7355/7355 [00:00<00:00, 72271.03 examples/s]
Setting num_proc from 16 back to 1 for the validation split to disable multiprocessing as it only contains one shard.
Generating validation split: 100%|██████████| 8204/8204 [00:00<00:00, 74496.80 examples/s]


Train samples: 2000, Test samples: 200
Creating enhanced datasets...


AttributeError: module 'torch.nn' has no attribute 'RandomHorizontalFlip'