# üêõ SOFAI-Core: Code Debugging Domain

## Automated Bug Fixing with DebugBench & LeetCode Validation

In this notebook, you will learn:

1. **What is the Code Debugging Domain?** - Problem definition and components
2. **The DebugBench Dataset** - 17 bug types, 4,253+ instances
3. **Setting Up LeetCode Validation** - Step-by-step session cookie setup
4. **Domain Architecture** - All components explained
5. **Running the Full Pipeline** - End-to-end debugging with SOFAI

---

### ‚ö†Ô∏è Important Prerequisites

This domain requires:
1. **DebugBench Dataset** - Already included in `domains/code_debugging/data/`
2. **LeetCode Account** - For real code validation
3. **LEETCODE_SESSION Cookie** - See setup instructions below
4. **Ollama** - For LLM inference (optional for component demos)

In [1]:
# ============================================================
# COLAB SETUP - Run this cell first if using Google Colab
# ============================================================
import subprocess
import sys
import os

# system deps
!sudo apt-get update
!sudo apt-get install -y zstd

# Clone repo
!git clone https://github.com/SOFAI-LM-AAAILab/SOFAI-LM.git
%cd SOFAI-LM

# Setup: Add project root to path
import sys
import os

project_root = os.path.dirname(os.getcwd()) if 'notebooks' in os.getcwd() else os.getcwd()
if project_root not in sys.path:
    sys.path.insert(0, project_root)

print(f"Project root: {project_root}")

# Install Ollama
!curl https://ollama.ai/install.sh | sh
!pip install -q ollama

!nvidia-smi
!ollama serve > /tmp/ollama.log 2>&1 &
!sleep 2

!pip install -r requirements.txt

0% [Working]            Get:1 https://cli.github.com/packages stable InRelease [3,917 B]
0% [Connecting to archive.ubuntu.com (91.189.92.24)] [Connecting to security.ub0% [Connecting to archive.ubuntu.com (91.189.92.24)] [Connecting to security.ub                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:3 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:5 https://cli.github.com/packages stable/main amd64 Packages [354 B]
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:7 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,870 kB]
Get:10 https://developer.download.nvidia.com/compute/cuda

In [3]:
!pip install python-leetcode
!pip install gym



---

## Part 1: Understanding the Code Debugging Domain

### What is Code Debugging?

**Code Debugging** is the process of identifying and fixing errors (bugs) in source code. In the SOFAI framework, this is treated as a Constraint Satisfaction Problem where:

- **Problem**: A buggy code snippet with a known problem description
- **Solution**: Fixed code that passes all test cases
- **Constraint**: The fixed code must be semantically equivalent to the intended solution

### The Challenge

Unlike graph coloring where validation is deterministic, code debugging requires:
- **Semantic understanding** of what the code should do
- **Syntax correctness** (no compile errors)
- **Passing all test cases** (often hidden)

### SOFAI's Approach

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     Code Debugging Flow                         ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                 ‚îÇ
‚îÇ  1. DEBUGBENCH        2. PROMPT             3. LLM              ‚îÇ
‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ      ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ             ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ                ‚îÇ
‚îÇ  Load buggy code  ‚Üí   IO_INTENTION   ‚Üí    Generate             ‚îÇ
‚îÇ  + description        format prompt        fixed code           ‚îÇ
‚îÇ                                                                 ‚îÇ
‚îÇ       ‚Üì                                                         ‚îÇ
‚îÇ                                                                 ‚îÇ
‚îÇ  4. PARSER            5. LEETCODE API      6. FEEDBACK          ‚îÇ
‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ           ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ        ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ          ‚îÇ
‚îÇ  Extract code    ‚Üí    Submit &        ‚Üí   Pass: Done!          ‚îÇ
‚îÇ  from <code>tags      run tests           Fail: Retry          ‚îÇ
‚îÇ                                                                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

---

## Part 2: The DebugBench Dataset

### What is DebugBench?

[DebugBench](https://github.com/thunlp/DebugBench) is a benchmark dataset for code debugging containing:
- **4,253 Python3 buggy code instances**
- **17 distinct bug types**
- Problems sourced from LeetCode
- Oracle (correct) solutions for reference

### The 17 Bug Types

| Category | Bug Types |
|----------|----------|
| **Logic Errors** | condition error, operation error, variable error |
| **Syntax Errors** | missing colons, illegal indentation, unclosed parentheses, unclosed string |
| **Reference Errors** | undefined methods, undefined objects, faulty indexing |
| **Semantic Errors** | misused == or =, illegal keywords, illegal comment |
| **Multiple Bugs** | double (2 bugs), triple (3 bugs), quadruple (4 bugs), other error |

### Dataset Location

The dataset is pre-loaded in:
```
domains/code_debugging/data/
‚îú‚îÄ‚îÄ python3_condition error.json
‚îú‚îÄ‚îÄ python3_double.json
‚îú‚îÄ‚îÄ python3_faulty indexing.json
‚îú‚îÄ‚îÄ python3_illegal comment.json
‚îú‚îÄ‚îÄ python3_illegal indentation.json
‚îú‚îÄ‚îÄ python3_illegal keywords.json
‚îú‚îÄ‚îÄ python3_missing colons.json
‚îú‚îÄ‚îÄ python3_misused == or =.json
‚îú‚îÄ‚îÄ python3_operation error.json
‚îú‚îÄ‚îÄ python3_other error.json
‚îú‚îÄ‚îÄ python3_quadruple.json
‚îú‚îÄ‚îÄ python3_triple.json
‚îú‚îÄ‚îÄ python3_unclosed parentheses.json
‚îú‚îÄ‚îÄ python3_unclosed string.json
‚îú‚îÄ‚îÄ python3_undefined methods.json
‚îú‚îÄ‚îÄ python3_undefined objects.json
‚îî‚îÄ‚îÄ python3_variable error.json
```

In [3]:
# Explore the bug types and dataset size
from domains.code_debugging.data_loader import BUG_TYPES, get_problem_count, DEBUGBENCH_PATH

print("üìä DebugBench Dataset Overview")
print("=" * 60)
print(f"\nDataset location: {DEBUGBENCH_PATH}")
print(f"\nAvailable bug types ({len(BUG_TYPES)} total):")
print("-" * 60)

total_problems = 0
for bug_type in BUG_TYPES:
    count = get_problem_count(bug_type=bug_type)
    total_problems += count
    print(f"  ‚Ä¢ {bug_type:25s} : {count:4d} problems")

print("-" * 60)
print(f"  {'TOTAL':25s} : {total_problems:4d} problems")

üìä DebugBench Dataset Overview

Dataset location: /Users/vedantkhandelwal/Desktop/SOFAI-LM/domains/code_debugging/data

Available bug types (17 total):
------------------------------------------------------------
  ‚Ä¢ condition error           :   83 problems
  ‚Ä¢ double                    :  250 problems
  ‚Ä¢ faulty indexing           :   72 problems
  ‚Ä¢ illegal comment           :   40 problems
  ‚Ä¢ illegal indentation       :   45 problems
  ‚Ä¢ illegal keywords          :   39 problems
  ‚Ä¢ missing colons            :   43 problems
  ‚Ä¢ misused == or =           :   45 problems
  ‚Ä¢ operation error           :   62 problems
  ‚Ä¢ other error               :   20 problems
  ‚Ä¢ quadruple                 :  231 problems
  ‚Ä¢ triple                    :  250 problems
  ‚Ä¢ unclosed parentheses      :   45 problems
  ‚Ä¢ unclosed string           :   42 problems
  ‚Ä¢ undefined methods         :   56 problems
  ‚Ä¢ undefined objects         :   60 problems
  ‚Ä¢ variable er

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.


In [4]:
# Load and examine a sample problem
from domains.code_debugging.data_loader import load_problem_from_dataset

print("üìù Sample Problem (condition error)")
print("=" * 60)

# Load a specific bug type
problem = load_problem_from_dataset(language="Python3", bug_type="condition error", problem_index=0)

print(f"\nüîñ Problem: {problem.slug}")
print(f"üìä Level: {problem.level}")
print(f"üêõ Bug Type: {problem.bug_type}")
print(f"\nüìã Description:")
print(problem.description[:500] + "..." if len(problem.description) > 500 else problem.description)

üìù Sample Problem (condition error)

üîñ Problem: the-kth-factor-of-n
üìä Level: medium
üêõ Bug Type: condition error

üìã Description:
You are given two positive integers n and k. A factor of an integer n is defined as an integer i where n % i == 0.
Consider a list of all factors of n sorted in ascending order, return the kth factor in this list or return -1 if n has less than k factors.


In [5]:
# Show the buggy code
print("\nüêõ Buggy Code:")
print("-" * 60)
print(problem.buggy_code)


üêõ Buggy Code:
------------------------------------------------------------

class Solution:
    def kthFactor(self, n: int, k: int) -> int:
        j = 0
        for i in range(1, n + 1):
            if n % i == 0:
                num = i
                j += 1
            if j == k:
                break
        return num if j == k+1 else -1



In [6]:
# Show the oracle (correct) code for comparison
print("\n‚úÖ Oracle (Correct) Code:")
print("-" * 60)
print(problem.oracle_code)


‚úÖ Oracle (Correct) Code:
------------------------------------------------------------
class Solution:
    def kthFactor(self, n: int, k: int) -> int:
        j = 0
        for i in range(1, n + 1):
            if n % i == 0:
                num = i
                j += 1
            if j == k:
                break
        return num if j == k else -1


In [7]:
# The CodeDebuggingProblem dataclass
print("\nüì¶ CodeDebuggingProblem Structure")
print("=" * 60)
print("""
@dataclass
class CodeDebuggingProblem:
    slug: str          # LeetCode problem ID (e.g., 'two-sum')
    description: str   # Problem statement
    examples: List[str] # Input/output examples
    constraints: str    # Problem constraints
    level: str         # 'easy', 'medium', 'hard'
    buggy_code: str    # The code to fix
    oracle_code: str   # Reference correct solution
    explanations: str  # Bug explanation
    content: str       # Full problem content
    bug_type: str      # One of the 17 bug types
""")


üì¶ CodeDebuggingProblem Structure

@dataclass
class CodeDebuggingProblem:
    slug: str          # LeetCode problem ID (e.g., 'two-sum')
    description: str   # Problem statement
    examples: List[str] # Input/output examples
    constraints: str    # Problem constraints
    level: str         # 'easy', 'medium', 'hard'
    buggy_code: str    # The code to fix
    oracle_code: str   # Reference correct solution
    explanations: str  # Bug explanation
    content: str       # Full problem content
    bug_type: str      # One of the 17 bug types



---

## Part 3: Setting Up LeetCode Validation üîê

The Code Debugging domain uses **real LeetCode submission** to validate solutions. This ensures the fixed code actually works!

### Why LeetCode Validation?

- ‚úÖ Real test cases (including hidden ones)
- ‚úÖ Accurate feedback (runtime errors, wrong answers)
- ‚úÖ Performance metrics (runtime, memory)
- ‚ö†Ô∏è Requires LeetCode account
- ‚ö†Ô∏è Rate limited (15s cooldown between submissions)

### Step-by-Step Setup

#### Step 1: Log into LeetCode

1. Open your browser and go to [leetcode.com](https://leetcode.com)
2. Log into your account

#### Step 2: Get the Session Cookie

**For Chrome/Edge:**
1. Right-click anywhere on LeetCode ‚Üí "Inspect" (or press F12)
2. Go to **Application** tab ‚Üí **Cookies** ‚Üí `https://leetcode.com`
3. Find the cookie named `LEETCODE_SESSION`
4. Copy its **Value** (it's a long string starting with `eyJ...`)

**For Firefox:**
1. Right-click ‚Üí "Inspect Element" ‚Üí **Storage** tab ‚Üí **Cookies**
2. Find `LEETCODE_SESSION` and copy its value

#### Step 3: Set Environment Variable

```bash
# In terminal (temporary, for current session)
export LEETCODE_SESSION='eyJ...your_long_session_value...'

# Or add to ~/.bashrc or ~/.zshrc (permanent)
echo 'export LEETCODE_SESSION="eyJ..."' >> ~/.zshrc
source ~/.zshrc
```

#### Step 4: Verify Setup

In [8]:
# Check if LEETCODE_SESSION is set
from domains.code_debugging.utils import check_leetcode_session

print("üîê LeetCode Session Check")
print("=" * 60)

if check_leetcode_session():
    session = os.environ.get('LEETCODE_SESSION', '')
    print("‚úÖ LEETCODE_SESSION is set!")
    print(f"   Value preview: {session[:20]}...{session[-20:]}")
    print("\n   LeetCode validation will work.")
else:
    print("‚ùå LEETCODE_SESSION is NOT set!")
    print("\n   To set it, run in your terminal:")
    print("   export LEETCODE_SESSION='your_session_cookie_value'")
    print("\n   Or set it here for this notebook session:")
    print("   (uncomment and fill in the next cell)")

üîê LeetCode Session Check
‚úÖ LEETCODE_SESSION is set!
   Value preview: eyJhbGciOiJIUzI1NiIs...6hnTrRgks-6da1hfKf7k

   LeetCode validation will work.


In [9]:
# OPTIONAL: Set the session cookie here (for this notebook only)
# ‚ö†Ô∏è WARNING: Do not commit this value to git!

# Uncomment and fill in your session cookie:
# os.environ['LEETCODE_SESSION'] = '<enter your leetcode session cookie here>'

# Then re-run the check:
# print("Session set!" if check_leetcode_session() else "Failed to set session")

### Important Notes About LeetCode Validation

| Aspect | Details |
|--------|--------|
| **Rate Limiting** | 15-second cooldown between submissions (enforced by code) |
| **Session Expiry** | Cookies expire. If validation fails, get a new session cookie. |
| **Account Risk** | Using automated submissions may violate LeetCode ToS. Use responsibly. |
| **Without Session** | Domain can still load problems and parse solutions, but cannot validate. |

---

## Part 4: Domain Architecture Deep Dive

### File Structure

```
domains/code_debugging/
‚îú‚îÄ‚îÄ code_debugging_domain.py    # Main DomainInterface implementation
‚îú‚îÄ‚îÄ data_loader.py              # Load problems from DebugBench JSON
‚îú‚îÄ‚îÄ validator.py                # LeetCode API wrapper
‚îú‚îÄ‚îÄ leetcode_tester.py          # LeetCodeTester class
‚îú‚îÄ‚îÄ leetcode_env/               # LeetCode environment (from DebugBench)
‚îú‚îÄ‚îÄ prompt_builder.py           # IO_INTENTION_PROMPT construction
‚îú‚îÄ‚îÄ solution_parser.py          # Extract <code></code> tags
‚îú‚îÄ‚îÄ utils.py                    # Helper functions
‚îî‚îÄ‚îÄ data/                       # DebugBench JSON files
```

### 4.1 The Data Loader

In [10]:
print("üìÇ Data Loader")
print("=" * 60)
print("""
load_problem_from_dataset(
    language='Python3',        # Only Python3 supported
    bug_type=None,             # Specific type or None for random
    problem_index=None         # Specific index or None for random
) -> CodeDebuggingProblem

Loads problems from domains/code_debugging/data/python3_*.json
""")

# Demo: Load random problem
random_problem = load_problem_from_dataset()
print(f"\nRandom problem loaded:")
print(f"  Slug: {random_problem.slug}")
print(f"  Bug type: {random_problem.bug_type}")
print(f"  Level: {random_problem.level}")

üìÇ Data Loader

load_problem_from_dataset(
    language='Python3',        # Only Python3 supported
    bug_type=None,             # Specific type or None for random
    problem_index=None         # Specific index or None for random
) -> CodeDebuggingProblem

Loads problems from domains/code_debugging/data/python3_*.json


Random problem loaded:
  Slug: number-of-ways-to-split-array
  Bug type: condition error
  Level: medium


### 4.2 The Prompt Builder (IO_INTENTION_PROMPT)

In [11]:
from domains.code_debugging.prompt_builder import build_debugging_prompt, IO_INTENTION_PROMPT_TEMPLATE

print("üìù Prompt Builder (IO_INTENTION_PROMPT)")
print("=" * 60)
print("\nTemplate structure:")
print("-" * 60)
print(IO_INTENTION_PROMPT_TEMPLATE[:500] + "...")

üìù Prompt Builder (IO_INTENTION_PROMPT)

Template structure:
------------------------------------------------------------
Observe the function intention and its corresponding {LANG} implementation which is complete with no extra context. The implementation is faulty. Your task is to fix up the code and explain on the modification in less than 20 words.
You have to write the fixed code again. You should put <code></code> and <exp></exp> on the boundary of the code and the explanation. Do not write anything else in your response. Your reply should be like this:
<code>
fixed code
</code>
<exp>
short explanation about ...


In [12]:
# Generate a full prompt
print("\nüìú Generated Prompt Example")
print("=" * 60)

prompt = build_debugging_prompt(problem, episodic_examples=None)
print(prompt[:1500] + "..." if len(prompt) > 1500 else prompt)


üìú Generated Prompt Example
Observe the function intention and its corresponding Python3 implementation which is complete with no extra context. The implementation is faulty. Your task is to fix up the code and explain on the modification in less than 20 words.
You have to write the fixed code again. You should put <code></code> and <exp></exp> on the boundary of the code and the explanation. Do not write anything else in your response. Your reply should be like this:
<code>
fixed code
</code>
<exp>
short explanation about the bug
</exp>

Function Intention:
You are given two positive integers n and k. A factor of an integer n is defined as an integer i where n % i == 0.
Consider a list of all factors of n sorted in ascending order, return the kth factor in this list or return -1 if n has less than k factors.

Examples:
Input: n = 12, k = 3
Output: 3
Explanation: Factors list is [1, 2, 3, 4, 6, 12], the 3rd factor is 3.
Input: n = 7, k = 2
Output: 7
Explanation: Factors list is [1, 

### Key Points About the Prompt:

1. **Structured Format**: Uses `{LANG}`, `{DESCRIPTION}`, `{EXAMPLES}`, `{CONSTRAINTS}`, `{BUGGY_CODE}`

2. **Expected Output**: LLM must respond with:
   ```
   <code>
   fixed code here
   </code>
   <exp>
   short explanation about the bug
   </exp>
   ```

3. **Episodic Examples**: Can include past problem-solution pairs for few-shot learning

### 4.3 The Solution Parser

In [13]:
from domains.code_debugging.solution_parser import parse_fixed_code, parse_explanation

print("üîç Solution Parser")
print("=" * 60)
print("""
Extracts fixed code from LLM response.

Priority:
1. <code>...</code> tags (preferred)
2. ```python...``` markdown blocks
3. ```...``` generic code blocks
4. Entire response (fallback)
""")

# Test parsing
test_responses = [
    # Correct format
    """<code>
class Solution:
    def twoSum(self, nums, target):
        for i in range(len(nums)):
            for j in range(i+1, len(nums)):
                if nums[i] + nums[j] == target:
                    return [i, j]
</code>
<exp>
Fixed the loop range to avoid index out of bounds
</exp>""",

    # Markdown format
    """Here is the fixed code:
```python
def solution():
    return True
```""",
]

for i, response in enumerate(test_responses, 1):
    code = parse_fixed_code(response)
    exp = parse_explanation(response)
    print(f"\nResponse {i}:")
    print(f"  Extracted code ({len(code)} chars): {code[:50]}...")
    print(f"  Explanation: {exp if exp else 'None found'}")

üîç Solution Parser

Extracts fixed code from LLM response.

Priority:
1. <code>...</code> tags (preferred)
2. ```python...``` markdown blocks
3. ```...``` generic code blocks
4. Entire response (fallback)


Response 1:
  Extracted code (212 chars): class Solution:
    def twoSum(self, nums, target)...
  Explanation: Fixed the loop range to avoid index out of bounds

Response 2:
  Extracted code (31 chars): def solution():
    return True...
  Explanation: None found


### 4.4 The LeetCode Validator

In [14]:
print("‚úÖ LeetCode Validator")
print("=" * 60)
print("""
validate_code_with_leetcode(
    code: str,           # The Python code to submit
    task_id: str,        # LeetCode problem slug (e.g., 'two-sum')
    language: str        # 'Python3'
) -> Tuple[bool, Dict]

Returns:
  - (True, {status_msg: 'Accepted', runtime, memory})
  - (False, {status_msg, error, last_testcase, expected, actual})

Features:
  ‚Ä¢ 15-second cooldown between submissions (prevents rate limiting)
  ‚Ä¢ Singleton pattern (reuses LeetCodeTester instance)
  ‚Ä¢ Handles environment errors gracefully
""")

‚úÖ LeetCode Validator

validate_code_with_leetcode(
    code: str,           # The Python code to submit
    task_id: str,        # LeetCode problem slug (e.g., 'two-sum')
    language: str        # 'Python3'
) -> Tuple[bool, Dict]

Returns:
  - (True, {status_msg: 'Accepted', runtime, memory})
  - (False, {status_msg, error, last_testcase, expected, actual})

Features:
  ‚Ä¢ 15-second cooldown between submissions (prevents rate limiting)
  ‚Ä¢ Singleton pattern (reuses LeetCodeTester instance)
  ‚Ä¢ Handles environment errors gracefully



In [15]:
# Demonstrate validation (without actually submitting)
print("\nüìã Validation Flow")
print("-" * 60)
print("""
1. Check LEETCODE_SESSION environment variable
2. Create LeetCodeTester (singleton)
3. Wait for cooldown if needed (15s)
4. Submit code to LeetCode API
5. Wait for results
6. Return (is_valid, feedback)
""")

# Show what feedback looks like
print("\nüìä Example Feedback (Success):")
print({
    'status_msg': 'Accepted',
    'runtime': '40 ms',
    'memory': '14.2 MB',
    'status_runtime': 'beats 95%'
})

print("\nüìä Example Feedback (Failure):")
print({
    'status_msg': 'Wrong Answer',
    'last_testcase': 'nums = [2,7,11,15], target = 9',
    'expected_output': '[0, 1]',
    'code_output': '[1, 0]'
})


üìã Validation Flow
------------------------------------------------------------

1. Check LEETCODE_SESSION environment variable
2. Create LeetCodeTester (singleton)
3. Wait for cooldown if needed (15s)
4. Submit code to LeetCode API
5. Wait for results
6. Return (is_valid, feedback)


üìä Example Feedback (Success):
{'status_msg': 'Accepted', 'runtime': '40 ms', 'memory': '14.2 MB', 'status_runtime': 'beats 95%'}

üìä Example Feedback (Failure):
{'status_msg': 'Wrong Answer', 'last_testcase': 'nums = [2,7,11,15], target = 9', 'expected_output': '[0, 1]', 'code_output': '[1, 0]'}


---

## Part 5: Understanding the Feedback Loop

When validation fails, the domain provides detailed feedback to help the LLM correct its solution.

In [16]:
from domains.code_debugging import CodeDebuggingDomain
# Demonstrate feedback formatting
print("üîÑ Feedback Examples")
print("=" * 60)

domain = CodeDebuggingDomain()

# Example 1: Accepted
feedback1 = {'status_msg': 'Accepted', 'runtime': '40 ms', 'memory': '14.2 MB'}
print(f"\n‚úÖ Success Feedback:")
print(f"   {domain.format_feedback(feedback1)}")

# Example 2: Wrong Answer
feedback2 = {
    'status_msg': 'Wrong Answer',
    'last_testcase': 'nums = [2,7,11,15], target = 9',
    'expected_output': '[0, 1]',
    'code_output': '[1, 0]'
}
print(f"\n‚ùå Wrong Answer Feedback:")
print(f"   {domain.format_feedback(feedback2)}")

# Example 3: Runtime Error
feedback3 = {
    'status_msg': 'Runtime Error',
    'full_runtime_error': 'IndexError: list index out of range'
}
print(f"\n‚ùå Runtime Error Feedback:")
print(f"   {domain.format_feedback(feedback3)}")

# Example 4: Compile Error
feedback4 = {
    'status_msg': 'Compile Error',
    'compile_error': 'SyntaxError: invalid syntax at line 5'
}
print(f"\n‚ùå Compile Error Feedback:")
print(f"   {domain.format_feedback(feedback4)}")

üîÑ Feedback Examples

‚úÖ Success Feedback:
   Solution accepted! Runtime: 40 ms, Memory: 14.2 MB

‚ùå Wrong Answer Feedback:
   Status: Wrong Answer
Failed Test Case: nums = [2,7,11,15], target = 9
Expected: [0, 1]
Actual: [1, 0]


‚ùå Runtime Error Feedback:
   Status: Runtime Error
Error: IndexError: list index out of range


‚ùå Compile Error Feedback:
   Status: Compile Error
Error: SyntaxError: invalid syntax at line 5



---

## Part 6: The Complete Domain Implementation

In [17]:
from domains.code_debugging.code_debugging_domain import CodeDebuggingDomain

print("üéÆ Complete Domain Workflow (Without LLM)")
print("=" * 60)

# Step 1: Create domain
domain = CodeDebuggingDomain()
print("\n1Ô∏è‚É£ Created CodeDebuggingDomain")

# Step 2: Generate a problem
problem = domain.generate_problem(bug_type="condition error")
print(f"\n2Ô∏è‚É£ Generated Problem:")
print(f"   Slug: {problem.slug}")
print(f"   Bug Type: {problem.bug_type}")
print(f"   Level: {problem.level}")

# Step 3: Build prompt
prompt = domain.build_prompt(problem, episodic_examples=None)
print(f"\n3Ô∏è‚É£ Built Prompt ({len(prompt)} chars)")

# Step 4: Simulate LLM response (use oracle code)
simulated_response = f"<code>\n{problem.oracle_code}\n</code>\n<exp>Fixed the condition error</exp>"
print(f"\n4Ô∏è‚É£ Simulated LLM Response (using oracle code)")

# Step 5: Parse solution
solution = domain.parse_solution(simulated_response)
print(f"\n5Ô∏è‚É£ Parsed Solution ({len(solution)} chars)")
print(f"   First 100 chars: {solution[:100]}...")

# Step 6: Memory representation
prob_repr = domain.get_problem_representation(problem)
sol_repr = domain.format_solution_for_memory(solution)
print(f"\n6Ô∏è‚É£ Memory Representations:")
print(f"   Problem: {prob_repr[:80]}...")
print(f"   Solution: {sol_repr[:80]}...")

üéÆ Complete Domain Workflow (Without LLM)

1Ô∏è‚É£ Created CodeDebuggingDomain

2Ô∏è‚É£ Generated Problem:
   Slug: minimum-bit-flips-to-convert-number
   Bug Type: condition error
   Level: easy

3Ô∏è‚É£ Built Prompt (2186 chars)

4Ô∏è‚É£ Simulated LLM Response (using oracle code)

5Ô∏è‚É£ Parsed Solution (258 chars)
   First 100 chars: class Solution:
    def minBitFlips(self, start: int, goal: int) -> int:
        s=bin(start)[2:].zf...

6Ô∏è‚É£ Memory Representations:
   Problem: Problem: minimum-bit-flips-to-convert-number
Bug Type: condition error
Descripti...
   Solution: Fixed Code:
class Solution:
    def minBitFlips(self, start: int, goal: int) -> ...


In [18]:
# Test validation (only if LEETCODE_SESSION is set)
print("\n7Ô∏è‚É£ Validation Step")
print("-" * 60)

if check_leetcode_session():
    print("‚ö†Ô∏è LeetCode session is set.")
    print("   To actually validate, uncomment the code below.")
    print("   WARNING: This will submit to LeetCode and count toward rate limits!")

    # Uncomment to actually validate:
    is_valid, feedback = domain.validate_solution(problem, solution)
    print(f"   Valid: {is_valid}")
    print(f"   Feedback: {domain.format_feedback(feedback)}")
else:
    print("‚ùå LEETCODE_SESSION not set - skipping validation")
    print("   Set the environment variable to enable real validation.")


7Ô∏è‚É£ Validation Step
------------------------------------------------------------
‚ö†Ô∏è LeetCode session is set.
   To actually validate, uncomment the code below.
   Valid: True
   Feedback: Solution accepted! Runtime: N/A, Memory: 19476000


---

## Part 7: Running with the Full SOFAI Framework üöÄ

**Requirements:**
- Ollama running with a model
- LEETCODE_SESSION set (for validation)

Without LeetCode session, the domain will run but validation will fail.

In [19]:
# Check Ollama availability and ensure models are pulled
import subprocess

S1_MODEL = 'codegemma:2b'      # S1 LLM model
S2_MODEL = 'deepseek-r1:1.5b'  # S2 LRM model

def check_ollama():
    try:
        result = subprocess.run(['ollama', 'list'], capture_output=True, text=True, timeout=5)
        if result.returncode == 0:
            print("‚úÖ Ollama is available!")
            return True
    except:
        pass
    print("‚ùå Ollama not available")
    return False

def ensure_model_available(model_name):
    """Check if model exists, pull if not."""
    try:
        result = subprocess.run(['ollama', 'list'], capture_output=True, text=True, timeout=10)
        if model_name in result.stdout:
            print(f"‚úÖ Model '{model_name}' is available.")
            return True

        print(f"‚¨áÔ∏è Model '{model_name}' not found. Pulling...")
        pull_result = subprocess.run(['ollama', 'pull', model_name], timeout=600)
        if pull_result.returncode == 0:
            print(f"‚úÖ Model '{model_name}' pulled successfully.")
            return True
    except Exception as e:
        print(f"‚ùå Error: {e}")
    return False

ollama_available = check_ollama()
if ollama_available:
    print("\nüì¶ Ensuring models are available...")
    ensure_model_available(S1_MODEL)
    ensure_model_available(S2_MODEL)


‚úÖ Ollama is available!

üì¶ Ensuring models are available...
‚¨áÔ∏è Model 'codegemma:2b' not found. Pulling...


[?2026h[?25l[1Gpulling manifest ‚†ã [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†ô [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†π [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†∏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†º [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†¥ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†¶ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†ß [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†á [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†è [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†ã [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling dd0c6f2ea876:   0% ‚ñï                  ‚ñè 2.4 MB/1.6 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling dd0c6f2ea876:   0% ‚ñï                  ‚ñè 3.7 MB/1.6 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling dd0c6f2ea876:   0% ‚ñï                  ‚ñè 5.4 MB/1.

‚úÖ Model 'codegemma:2b' pulled successfully.
‚¨áÔ∏è Model 'deepseek-r1:1.5b' not found. Pulling...


[?2026h[?25l[1Gpulling manifest ‚†ô [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†π [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†π [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†∏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†¥ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†¶ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†ß [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling aabd4debf0c8:   0% ‚ñï                  ‚ñè 152 KB/1.1 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling aabd4debf0c8:   0% ‚ñï                  ‚ñè 1.5 MB/1.1 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling aabd4debf0c8:   0% ‚ñï                  ‚ñè 2.3 MB/1.1 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling aabd4debf0c8:   0% ‚ñï                  ‚ñè 4.0 MB/1.1 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
p

‚úÖ Model 'deepseek-r1:1.5b' pulled successfully.


[?2026h[?25l[A[A[A[A[A[A[1Gpulling manifest [K
pulling aabd4debf0c8: 100% ‚ñï‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè 1.1 GB                         [K
pulling c5ad996bda6e: 100% ‚ñï‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  556 B                         [K
pulling 6e4c38e1172f: 100% ‚ñï‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè 1.1 KB                         [K
pulling f4d24e9138dd: 100% ‚ñï‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  148 B                         [K
pulling a85fe2a2e58e: 100% ‚ñï‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  487 B                         [K
verifying sha256 digest [K
writing manifest [K
success [K[?25h[?2026l


In [20]:
# ============================================================
# EXAMPLE 1
# ============================================================
if ollama_available:
    from core.metacognitive_module import MCModule

    print("=" * 60)
    print("üêõ EXAMPLE 1")
    print("=" * 60)

    # Create an EASY problem - simple syntax error
    domain = CodeDebuggingDomain()
    problem = domain.generate_problem(bug_type="missing colons")

    print(f"\nüìä Problem: {problem.slug}")
    print(f"   Bug Type: {problem.bug_type}")
    print(f"   Level: {problem.level}")

    mc = MCModule(
        domain=domain,
        s1_llm="codegemma:2b",
        max_iterations=5,
        s2_lrm="deepseek-r1:1.5b"
    )

    result = mc.solve(problem, verbose=True)

    print(f"\n‚úÖ Solved by: {'S1 (LLM)' if result['s1_solved'] else 'S2 (LRM)' if result['s2_solved'] else 'None'}")
else:
    print("‚ö†Ô∏è Ollama not available - skipping Example 1")


üêõ EXAMPLE 1

üìä Problem: can-place-flowers
   Bug Type: missing colons
   Level: easy

Starting SOFAI solve process...

--- Iteration 1/5 ---
S1 response generated in 14.16s
‚úó Invalid solution. Feedback: Status: Runtime Error
Error: SyntaxError: invalid syntax
              ^^^^^^^
    Corrected Python3 Implementation:
Line 1  (Solution.py)
Failed Test Case: [1,0,0,0,1]
1

--- Iteration 2/5 ---
S1 response generated in 0.76s
Waiting 14.2s before next LeetCode submission...
‚úó Invalid solution. Feedback: Status: Runtime Error
Error: SyntaxError: invalid syntax
                    ^
    Feedback: Status: Runtime Error
Line 3  (Solution.py)
Failed Test Case: [1,0,0,0,1]
1

--- Iteration 3/5 ---
S1 response generated in 0.43s
Waiting 14.6s before next LeetCode submission...
‚úó Invalid solution. Feedback: Status: Runtime Error
Error: NameError: name 'Ac' is not defined
    Ac
Line 2 in <module> (Solution.py)
Failed Test Case: [1,0,0,0,1]
1

--- Iteration 4/5 ---
S1 response generat

---

## Summary: Key Takeaways üéØ

1. **Code Debugging as CSP**: Fix buggy code to pass all test cases

2. **DebugBench Dataset**: 4,253+ Python3 problems across 17 bug types

3. **LeetCode Validation**: Real test execution for accurate feedback
   - Requires `LEETCODE_SESSION` environment variable
   - 15-second cooldown between submissions

4. **IO_INTENTION_PROMPT**: Structured format for LLM debugging
   - Expects `<code>...</code>` and `<exp>...</exp>` tags

5. **Feedback Loop**: Detailed error messages help LLM iterate

6. **Challenges**: Rate limiting, session expiry, complex logic bugs

---

## Next Steps

- Review the Graph Coloring notebook for comparison
- Try creating a custom domain using the templates
- Experiment with different LLM models for code debugging

Happy debugging! üêõüîß