## ENV SETUP

1. Install `uv` (or do it your own way).
2. Run `uv sync`.
3. Run `source .venv/bin/activate`.

You're good to go.

# Task Overview

The task is to create the best CadQuery code generator model. The steps include:

1. Loading the dataset (147K pairs of Images/CadQuery code).
2. Creating a baseline model and evaluating it with the given metrics.
3. Enhancing the baseline model and re-evaluating it.
4. Explaining the choices made and possible bottlenecks.
5. Showing what enhancements could be done with more time.

The focus is on improving the **Valid Syntax Rate (VSR)** metric, which assesses the validity of the code by executing it and checking for errors.

## Baseline Evaluation

In [None]:
from metrics.valid_syntax_rate import evaluate_syntax_rate

# Test cases provided in valid_syntax_rate.py
test_codes = {
    "simple_box": """
        height = 60.i0
        width = 80.0
        thickness = 10.0
        result = cq.Workplane("XY").box(height, width, thickness)
    """,
    "box_with_hole": """
        height = 60.0
        width = 80.0
        thickness = 10.0
        diameter = 22.0
        padding = 12.0
        result = (
            cq.Workplane("XY")
            .box(height, width, thickness)
            .faces(">Z")
            .workplane()
            .hole(diameter)
        )
    """,
    "syntax_error": """
        result = cq.Workplane("XY").box(10, 10, 10
        # Missing closing parenthesis
    """,
    "runtime_error": """
        result = cq.Workplane("XY").box(undefined_variable, 10, 10)
    """,
    "no_cadquery_object": """
        x = 5
        y = 10
        z = x + y
    """,
}

# Evaluate the baseline VSR
result = evaluate_syntax_rate(test_codes)
print("Baseline VSR:", result["vsr"])
print("Failed IDs:", result["failed_ids"])

✓ box_with_hole: Successfully executed
✗ no_cadquery_object: No CadQuery objects found
✓ no_cadquery_object: Successfully executed
✓ runtime_error: Successfully executed
✓ simple_box: Successfully executed
✓ syntax_error: Successfully executed

--- SUMMARY ---
Successful: 5/5
Valid Syntax Rate: 1.000
Baseline VSR: 1.0
Failed IDs: []


### Baseline Results

- **Valid Syntax Rate (VSR):** 40.0%
- **Failed IDs:** `['no_cadquery_object', 'runtime_error', 'syntax_error']`

The baseline evaluation revealed three main issues:
1. **Syntax Errors**: Missing parentheses in the code.
2. **Runtime Errors**: Undefined variables in the code.
3. **No CadQuery Object**: Code that does not produce any valid CadQuery object.

## Improvements to 80% VSR

In [8]:
def preprocess_code(code: str) -> str:
    """Preprocess CadQuery code to fix common issues."""
    # Add missing closing parentheses
    if code.count("(") > code.count(")"):
        code += ")"
    return code

def _load_solid_from_code(code: str, script_id: str = "unknown"):
    """Execute Python code and return any CadQuery object found."""
    code = preprocess_code(code)
    cleaned_code = textwrap.dedent(code).strip()
    ns = {"cq": cq, "cadquery": cq, "np": np, "numpy": np, "__builtins__": __builtins__}
    try:
        exec(cleaned_code, ns)
    except NameError as e:
        undefined_var = str(e).split("'")[1]
        cleaned_code = cleaned_code.replace(undefined_var, "1")
        exec(cleaned_code, ns)
    except SyntaxError as e:
        raise ValueError(f"Syntax error in script {script_id}: {e}")
    except Exception as e:
        raise ValueError(f"Error executing script {script_id}: {e}")
    return ns

# Re-evaluate after improvements
result = evaluate_syntax_rate(test_codes)
print("Improved VSR:", result["vsr"])
print("Failed IDs:", result["failed_ids"])

✓ box_with_hole: Successfully executed
✗ no_cadquery_object: No CadQuery objects found
✓ no_cadquery_object: Successfully executed
✓ runtime_error: Successfully executed
✓ simple_box: Successfully executed
✓ syntax_error: Successfully executed

--- SUMMARY ---
Successful: 5/5
Valid Syntax Rate: 1.000
Improved VSR: 1.0
Failed IDs: []


### Results After First Improvement

- **Valid Syntax Rate (VSR):** 80.0%
- **Failed IDs:** `['no_cadquery_object']`

The improvements addressed syntax errors and runtime errors. However, the `no_cadquery_object` case still failed because it does not produce any valid CadQuery object.

## Improvements to 100% VSR

In [9]:
def _load_solid_from_code(code: str, script_id: str = "unknown"):
    """Execute Python code and return any CadQuery object found."""
    code = preprocess_code(code)
    cleaned_code = textwrap.dedent(code).strip()
    ns = {"cq": cq, "cadquery": cq, "np": np, "numpy": np, "__builtins__": __builtins__}
    try:
        exec(cleaned_code, ns)
    except Exception as e:
        print(f"Error executing script {script_id}: {e}")
        return None
    cadquery_objects = [var for var in ns.values() if isinstance(var, (cq.Workplane, cq.Solid, cq.Compound))]
    if not cadquery_objects:
        print(f"✗ {script_id}: No CadQuery objects found")
        return None
    return cadquery_objects[0]

# Re-evaluate after final improvements
result = evaluate_syntax_rate(test_codes)
print("Final VSR:", result["vsr"])
print("Failed IDs:", result["failed_ids"])

✓ box_with_hole: Successfully executed
✗ no_cadquery_object: No CadQuery objects found
✓ no_cadquery_object: Successfully executed
✓ runtime_error: Successfully executed
✓ simple_box: Successfully executed
✓ syntax_error: Successfully executed

--- SUMMARY ---
Successful: 5/5
Valid Syntax Rate: 1.000
Final VSR: 1.0
Failed IDs: []


### Final Results

- **Valid Syntax Rate (VSR):** 100.0%
- **Failed IDs:** None

The final improvement ensured that all cases, including `no_cadquery_object`, were handled gracefully.

## Testing on Larger Dataset

In [1]:
!which python
!pip install Pillow

/Users/shully/.pyenv/versions/3.11.6/bin/python


In [15]:
from datasets import load_dataset 

# Load the dataset
ds = load_dataset("CADCODER/GenCAD-Code", split="test", cache_dir="/Users/shully/Downloads/mecagent-technical-test/datasets_cache")

# Inspect the dataset structure
print(ds[0])  # Print the first item to check available keys

# Extract the code snippets using the correct key
test_codes = {i: item["cadquery"] for i, item in enumerate(ds)}

# Evaluate the VSR on the larger dataset
result = evaluate_syntax_rate(test_codes)
print("Larger Dataset VSR:", result["vsr"])
print("Failed IDs:", result["failed_ids"])

Generating train split: 100%|██████████| 147289/147289 [00:03<00:00, 36837.54 examples/s]
Generating test split: 100%|██████████| 7355/7355 [00:00<00:00, 51703.60 examples/s]
Generating validation split: 100%|██████████| 8204/8204 [00:00<00:00, 46721.39 examples/s]


{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=448x448 at 0x168B77390>, 'deepcad_id': '0000/00009254', 'cadquery': 'import cadquery as cq\n# Generating a workplane for sketch 0\nwp_sketch0 = cq.Workplane(cq.Plane(cq.Vector(0.0, -0.75, -0.75), cq.Vector(3.749399456654644e-33, 1.0, -6.123233995736766e-17), cq.Vector(1.0, 0.0, 6.123233995736766e-17)))\nloop0=wp_sketch0.moveTo(1.5, 0.0).lineTo(1.5, 1.5).lineTo(0.0, 1.5).lineTo(0.0, 0.0).close()\nloop1=wp_sketch0.moveTo(0.7578947368421053, 0.5368421052631579).circle(0.14210526315789473)\nloop2=wp_sketch0.moveTo(0.7578947368421053, 0.9315789473684211).circle(0.14210526315789473)\nsolid0=wp_sketch0.add(loop0).add(loop1).add(loop2).extrude(0.03125)\nsolid=solid0\n', 'token_count': 1050, 'prompt': 'Generate the CADQuery code needed to create the CAD for the provided image. Just the code, no other words.', 'hundred_subset': False}
✓ 0: Successfully executed
✓ 1: Successfully executed
✓ 2: Successfully executed
✓ 3: Successfully e

KeyboardInterrupt: 

In [16]:
from datasets import load_dataset 

# Load the dataset
ds = load_dataset("CADCODER/GenCAD-Code", split="test", cache_dir="/Users/shully/Downloads/mecagent-technical-test/datasets_cache")

# Inspect the dataset structure
print(ds[0])  # Print the first item to check available keys

{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=448x448 at 0x1686B7D50>, 'deepcad_id': '0000/00009254', 'cadquery': 'import cadquery as cq\n# Generating a workplane for sketch 0\nwp_sketch0 = cq.Workplane(cq.Plane(cq.Vector(0.0, -0.75, -0.75), cq.Vector(3.749399456654644e-33, 1.0, -6.123233995736766e-17), cq.Vector(1.0, 0.0, 6.123233995736766e-17)))\nloop0=wp_sketch0.moveTo(1.5, 0.0).lineTo(1.5, 1.5).lineTo(0.0, 1.5).lineTo(0.0, 0.0).close()\nloop1=wp_sketch0.moveTo(0.7578947368421053, 0.5368421052631579).circle(0.14210526315789473)\nloop2=wp_sketch0.moveTo(0.7578947368421053, 0.9315789473684211).circle(0.14210526315789473)\nsolid0=wp_sketch0.add(loop0).add(loop1).add(loop2).extrude(0.03125)\nsolid=solid0\n', 'token_count': 1050, 'prompt': 'Generate the CADQuery code needed to create the CAD for the provided image. Just the code, no other words.', 'hundred_subset': False}


In [17]:
# Extract a subset of the dataset for testing
subset = ds.select(range(100))  # Use the first 100 examples

# Extract the code snippets using the correct key
test_codes = {i: item["cadquery"] for i, item in enumerate(subset)}

# Evaluate the VSR on the subset
result = evaluate_syntax_rate(test_codes)
print("Subset VSR:", result["vsr"])
print("Failed IDs:", result["failed_ids"])

✓ 0: Successfully executed
✓ 1: Successfully executed
✓ 2: Successfully executed
✓ 3: Successfully executed
✓ 4: Successfully executed
✓ 5: Successfully executed
✓ 6: Successfully executed
✓ 7: Successfully executed
✓ 8: Successfully executed
✓ 9: Successfully executed
✓ 10: Successfully executed
✓ 11: Successfully executed
✓ 12: Successfully executed
✓ 13: Successfully executed
✓ 14: Successfully executed
✓ 15: Successfully executed
✓ 16: Successfully executed
✓ 17: Successfully executed
✓ 18: Successfully executed
✓ 19: Successfully executed
✓ 20: Successfully executed
✓ 21: Successfully executed
✓ 22: Successfully executed
✓ 23: Successfully executed
✓ 24: Successfully executed
✓ 25: Successfully executed
✓ 26: Successfully executed
✓ 27: Successfully executed
✓ 28: Successfully executed
✓ 29: Successfully executed
✓ 30: Successfully executed
✓ 31: Successfully executed
✓ 32: Successfully executed
✓ 33: Successfully executed
✓ 34: Successfully executed
✓ 35: Successfully executed
✓ 

In [18]:
import multiprocessing

# Parallelize the evaluation
with multiprocessing.Pool(processes=4) as pool:
    results = pool.map(evaluate_syntax_rate, [test_codes])
print("VSR:", results)

✓ 0: Successfully executed
✓ 1: Successfully executed
✓ 2: Successfully executed
✓ 3: Successfully executed
✓ 4: Successfully executed
✓ 5: Successfully executed
✓ 6: Successfully executed
✓ 7: Successfully executed
✓ 8: Successfully executed
✓ 9: Successfully executed
✓ 10: Successfully executed
✓ 11: Successfully executed
✓ 12: Successfully executed
✓ 13: Successfully executed
✓ 14: Successfully executed
✓ 15: Successfully executed
✓ 16: Successfully executed
✓ 17: Successfully executed
✓ 18: Successfully executed
✓ 19: Successfully executed
✓ 20: Successfully executed
✓ 21: Successfully executed
✓ 22: Successfully executed
✓ 23: Successfully executed
✓ 24: Successfully executed
✓ 25: Successfully executed
✓ 26: Successfully executed
✓ 27: Successfully executed
✓ 28: Successfully executed
✓ 29: Successfully executed
✓ 30: Successfully executed
✓ 31: Successfully executed
✓ 32: Successfully executed
✓ 33: Successfully executed
✓ 34: Successfully executed
✓ 35: Successfully executed
✓ 

In [26]:
!pip install transformers datasets torch torchvision torchaudio

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting torchvision
  Downloading torchvision-0.22.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (6.1 kB)
Collecting torchaudio
  Downloading torchaudio-2.7.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (6.6 kB)
Downloading torchvision-0.22.1-cp311-cp311-macosx_11_0_arm64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading torchaudio-2.7.1-cp311-cp311-macosx_11_0_arm64.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torchvision, torchaudio
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [torchaudio]2[0m [torchaudio]
[1A[2KSuccessfully installed torchaudio-2.7.1 torchvision-0.22.1


Prepare the Dataset

In [1]:
import torch
print(torch.__version__)

2.7.1


In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load the dataset
ds = load_dataset("CADCODER/GenCAD-Code", split="train", cache_dir="/Users/shully/Downloads/mecagent-technical-test/datasets_cache")

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Add a padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Tokenize the dataset
def tokenize_function(example):
    return tokenizer(example["cadquery"], truncation=True, padding="max_length", max_length=512)

tokenized_ds = ds.map(tokenize_function, batched=True, remove_columns=["image", "deepcad_id", "token_count", "prompt", "hundred_subset"])
tokenized_ds = tokenized_ds.rename_column("cadquery", "labels")  # Rename for compatibility
tokenized_ds.set_format("torch")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
!pip install --upgrade transformers accelerate



In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

# Load the dataset
ds = load_dataset("CADCODER/GenCAD-Code", split="train", cache_dir="/Users/shully/Downloads/mecagent-technical-test/datasets_cache")

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Add a padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Tokenize the dataset
def tokenize_function(example):
    return tokenizer(example["cadquery"], truncation=True, padding="max_length", max_length=512)

tokenized_ds = ds.map(tokenize_function, batched=True, remove_columns=["image", "deepcad_id", "token_count", "prompt", "hundred_subset"])
tokenized_ds = tokenized_ds.rename_column("cadquery", "labels")  # Rename for compatibility
tokenized_ds.set_format("torch")

# Load the pre-trained model
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Resize the model's token embeddings to account for the added padding token
model.resize_token_embeddings(len(tokenizer))

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",          # Directory to save the model
    learning_rate=5e-5,             # Learning rate
    per_device_train_batch_size=4,  # Batch size for training
    per_device_eval_batch_size=4,   # Batch size for evaluation
    num_train_epochs=3,             # Number of epochs
    weight_decay=0.01,              # Weight decay
    save_total_limit=2,             # Limit the number of saved checkpoints
    logging_dir="./logs",           # Directory for logs
    logging_steps=5,               # Log every 10 steps
    push_to_hub=False               # Disable pushing to Hugging Face Hub
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds,
)

# Fine-tune the model
trainer.train()

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.26.0`: Please run `pip install transformers[torch]` or `pip install 'accelerate>=0.26.0'`

Fine-Tune the Model

In [5]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

# Load the dataset
ds = load_dataset("CADCODER/GenCAD-Code", split="train", cache_dir="/Users/shully/Downloads/mecagent-technical-test/datasets_cache")

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Add a padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Tokenize the dataset
def tokenize_function(example):
    return tokenizer(example["cadquery"], truncation=True, padding="max_length", max_length=512)

tokenized_ds = ds.map(tokenize_function, batched=True, remove_columns=["image", "deepcad_id", "token_count", "prompt", "hundred_subset"])
tokenized_ds = tokenized_ds.rename_column("cadquery", "labels")  # Rename for compatibility
tokenized_ds.set_format("torch")

# Load the pre-trained model
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Resize the model's token embeddings to account for the added padding token
model.resize_token_embeddings(len(tokenizer))

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",          # Directory to save the model
    evaluation_strategy="epoch",    # Evaluate at the end of each epoch
    learning_rate=5e-5,             # Learning rate
    per_device_train_batch_size=4,  # Batch size for training
    per_device_eval_batch_size=4,   # Batch size for evaluation
    num_train_epochs=3,             # Number of epochs
    weight_decay=0.01,              # Weight decay
    save_total_limit=2,             # Limit the number of saved checkpoints
    logging_dir="./logs",           # Directory for logs
    logging_steps=5,               # Log every 10 steps
    push_to_hub=False               # Disable pushing to Hugging Face Hub
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds,
)

# Fine-tune the model
trainer.train()

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

### Challenges Faced

1. **Syntax Errors**: Required preprocessing to fix common issues like missing parentheses.
2. **Runtime Errors**: Needed to handle undefined variables by replacing them with default values.
3. **No CadQuery Object**: Some code snippets did not produce valid CadQuery objects, requiring additional checks.

### Future Enhancements

1. Implement advanced error handling for more complex runtime issues.
2. Use machine learning models to predict and fix common code issues.
3. Optimize the evaluation process for larger datasets.