In [5]:
import os
from dotenv import load_dotenv
from openai import OpenAI

# Load environment variables from .env file
load_dotenv()

client = OpenAI()

content = """
type x = int | string
let x = 35 
if x is int {
    print("howdy do")
} else {
    print("dandy")
}

what does this program do? break it down into a syntax tree in a simple markdown format,
then print the answer. 
"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": content}]
)

print(response.choices[0].message.content)

Sure! Let's break down the provided code into a syntax tree format first, and then we will analyze what the program does.

### Syntax Tree

```
Program
├── TypeDeclaration
│   └── Type: x
│       └── UnionType
│           ├── int
│           └── string
├── VariableDeclaration
│   ├── Variable: x
│   └── Value: 35
└── IfStatement
    ├── Condition
    │   └── TypeCheck
    │       ├── Variable: x
    │       └── Type: int
    ├── ThenBranch
    │   └── PrintStatement
    │       └── "howdy do"
    └── ElseBranch
        └── PrintStatement
            └── "dandy"
```

### Explanation of the Program

1. **Type Declaration:** 
   - `type x = int | string` declares a type `x` that can either be an `int` or a `string`.

2. **Variable Declaration:**
   - `let x = 35` creates a variable `x` and initializes it with the value `35`, which is an integer.

3. **If Statement:**
   - The program checks if `x` is of type `int`.
   - Since `x` is initialized with the value `35`, it is indeed an integer

holy shit that worked? i wonder how consistent that is

In [6]:
# try with different models 

response = client.chat.completions.create(
    model="gpt-4.1-nano-2025-04-14",
    # messages=[{"role": "user", "content": "Hello, world! what is the capital of turkey?"}]
    # model="gpt-4o-mini",
    messages=[{"role": "user", "content": content}]
)

print(response.choices[0].message.content)

### Syntax Tree in Markdown Format

```markdown
Program
├── Type Definition
│   └── type x = int | string
├── Variable Declaration & Initialization
│   └── let x = 35
├── If Statement
│   ├── Condition
│   │   └── x is int
│   └── Then Branch
│       └── print("howdy do")
│   └── Else Branch
│       └── print("dandy")
```

### Explanation:

1. **Type Definition:**  
   Defines a variant type `x` which can be either `int` or `string`.  
   ```ocaml
   type x = int | string
   ```
   
2. **Variable Declaration & Initialization:**  
   Binds the variable `x` to the integer value `35`.  
   ```ocaml
   let x = 35
   ```
   
3. **Conditional Check:**  
   Checks if `x` is of type `int`.  
   ```ocaml
   if x is int { ... }
   ```
   
4. **Then Branch:**  
   Since `x` is indeed an `int` (it's 35), it executes `print("howdy do")`.  
   
5. **Else Branch:**  
   Not executed in this case, but if `x` had been a `string`, it would have printed `"dandy"`.

### What the program does:

- It checks

In [7]:

response = client.chat.completions.create(
    model="gpt-3.5-turbo-0125",
    # messages=[{"role": "user", "content": "Hello, world! what is the capital of turkey?"}]
    # model="gpt-4o-mini",
    messages=[{"role": "user", "content": content}]
)

print(response.choices[0].message.content)

```
- Declare a type x that can be either an int or a string
- Assign the value 35 to the variable x
- Check if x is of type int
  - If so, print "howdy do"
  - If not, print "dandy"
```

The program checks if the value of x is an integer. Since 35 is an integer, it will print "howdy do".


Wow, even gpt 3.5 can do it. This might be a fun project where we try to write some basic psudo-code examples and see if the LLM can evaluate them. it's not really an agent though, so maybe we could give it a tool to evaluate expressions if it needs to. We could instruct the agent to parse the program into a syntax tree, then have it "compile" it into python. Since I don't fully trust this, i'll check the output before evaluating the resulting program. 

Perhaps the tool would be a REPL that the Agent could use to check it's work, or test things, or even do "compile time" const evaluation. What ever it want's to do to try and figure out the result of a provided program.

In [8]:
models = [
    "gpt-4o-mini",
    "gpt-4.1-2025-04-14",
    "gpt-4.1-nano-2025-04-14",
    "gpt-3.5-turbo-0125",
]

program1 = """
<program>
let x = 1 
let y = 2 
let z = x + y 

print(z)
</program>


what does this program do? break it down into a syntax tree in a simple markdown format,
then print the answer. 
"""

for model in models:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": program1}]
    )
    print(f"{model}: {response.choices[0].message.content}")


gpt-4o-mini: The provided program performs a simple addition operation using variables and then prints the result. Below is a breakdown of the program's syntax tree in a simple markdown format:

```markdown
Program
├── Variable Declaration
│   ├── let x = 1
│   ├── let y = 2
│   └── let z = x + y
└── Print Statement
    └── print(z)
```

### Explanation:
1. **Variable Declarations**:
   - `let x = 1`: A variable `x` is declared and initialized to `1`.
   - `let y = 2`: A variable `y` is declared and initialized to `2`.
   - `let z = x + y`: A variable `z` is declared and initialized to the sum of `x` and `y`, which is `1 + 2`.

2. **Print Statement**:
   - The program then outputs the value of `z`, which, after the addition, is `3`.

### Final Output:
The final output of the program is:

```
3
```
gpt-4.1-2025-04-14: Absolutely! Let’s break down the program step by step.


## What does the program do?

This program creates three variables (`x`, `y`, `z`).  
- `x` is assigned the value 

they all did it. how does it work? is it really doing computation?


lets give it a harder program. Come up with a psudocode fibbonnaci function


In [15]:
system_prompt = """
what does the following program do? break it down into a syntax tree in a simple markdown format,
then print the answer. 

if you need help evaluating the program, you can convert portions of the program or all of it to python and evaluate it using the `py_eval` tool:

`py_eval` is a tool that can be used to evaluate python code. it takes a string of python code and returns the result.

here's an example of how to use the `py_eval` tool:

<tool name="py_eval">
    <parameter name="code" type="string">
        def add(a, b):
            return a + b

        add(1, 2)
    </parameter>
</tool>

the value returned by the tool is the result of the code. in this case, it would be 3.

The following is a program that you should evaluate:
"""


program2 = """
<program>
fn fib(n: int) -> int: {
    if n <= 1 {
        return n;
    }

    return fib(n - 1) + fib(n - 2);
}

let a = fib(10)

print(a)

</program>
"""


content = system_prompt + program2
print(content)

def run_program(program):
    for model in models:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": program}]
        )
        print(f"{model}: {response.choices[0].message.content}")

run_program(content)



what does the following program do? break it down into a syntax tree in a simple markdown format,
then print the answer. 

if you need help evaluating the program, you can convert portions of the program or all of it to python and evaluate it using the `py_eval` tool:

`py_eval` is a tool that can be used to evaluate python code. it takes a string of python code and returns the result.

here's an example of how to use the `py_eval` tool:

<tool name="py_eval">
    <parameter name="code" type="string">
        def add(a, b):
            return a + b

        add(1, 2)
    </parameter>
</tool>

the value returned by the tool is the result of the code. in this case, it would be 3.

The following is a program that you should evaluate:

<program>
fn fib(n: int) -> int: {
    if n <= 1 {
        return n;
    }

    return fib(n - 1) + fib(n - 2);
}

let a = fib(10)

print(a)

</program>

gpt-4o-mini: To evaluate the provided program, we'll first break it down into a simple syntax tree. The

OK lets take a look at how well it did with calling the tool. here's one of the generated outputs (from gpt-40-mini)

In [16]:
def fib(n):
    if n <= 1:
        return n
    return fib(n - 1) + fib(n - 2)

a = fib(10)
print(a)

55


Lets see if we can get the tool calls cleaned up, and give a tool for submitting a response, as well as a simple evaluation. This will complete the assignment. 

First we need to define a tool-call response in pydantic, then fix up the system prompt formatting to match the tool. 

In [18]:
from pydantic import BaseModel 
#question: is this a tool call description or a response format? 
class ProgramEvalResponse(BaseModel):
    """
    A response to a program evaluation request. Usage is as follows:

    - think: a string of your thoughts about the program
    - ast_markdown: a markdown string of the AST of the program
    - program: a string of the python program that you think is correct
    - result: a string of the result of the program. Format it as a typed variable with the correct type of the result. Ie, if the program returns an int, format it as `result: int = 1`
    """
    think: str
    ast_markdown: str
    program: str 
    result: str

def call_model(model: str, system_prompt: str, user_prompt: str) -> ProgramEvalResponse | None:
    response = client.beta.chat.completions.parse(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        response_format=ProgramEvalResponse
    )
    return response.choices[0].message.parsed

format = ProgramEvalResponse.model_json_schema()

print(format)

{'description': 'A response to a program evaluation request. Usage is as follows:\n\n- think: a string of your thoughts about the program\n- ast_markdown: a markdown string of the AST of the program\n- program: a string of the python program that you think is correct\n- result: a string of the result of the program. Format it as a typed variable with the correct type of the result. Ie, if the program returns an int, format it as `result: int = 1`', 'properties': {'think': {'title': 'Think', 'type': 'string'}, 'ast_markdown': {'title': 'Ast Markdown', 'type': 'string'}, 'program': {'title': 'Program', 'type': 'string'}, 'result': {'title': 'Result', 'type': 'string'}}, 'required': ['think', 'ast_markdown', 'program', 'result'], 'title': 'ProgramEvalResponse', 'type': 'object'}


In [19]:
system_prompt_with_pydantic = """
what does the following program do? break it down into a syntax tree in a simple markdown format,
then print the answer. 
"""

response = call_model(
    model="gpt-4o-mini",
    system_prompt=system_prompt_with_pydantic,
    user_prompt=program2
)

print(response)

think='The program defines a recursive function to compute the Fibonacci number of a given integer n. It checks if n is less than or equal to 1, returning n in that case (since fib(0) is 0 and fib(1) is 1). For values greater than 1, it recursively calculates the Fibonacci number by summing the results of fib(n-1) and fib(n-2). It then calls this function with the argument 10, stores the result in variable a, and prints this result. The expected output for fib(10) is 55.' ast_markdown='FunctionDeclaration\n ├─ FunctionName: fib\n ├─ Parameters\n │  └─ Parameter\n │     ├─ Name: n\n │     └─ Type: int\n ├─ ReturnType: int\n └─ Body\n    ├─ IfStatement\n    │  ├─ Condition: n <= 1\n    │  └─ Body\n    │     └─ ReturnStatement\n    │        └─ Value: n\n    └─ ReturnStatement\n       └─ Value\n          ├─ FunctionCall: fib\n          │  ├─ Argument: n - 1\n          └─ Addition\n             ├─ Value: fib(n - 1)\n             └─ Value: fib(n - 2)\n ├─ VariableDeclaration\n │  ├─ Variable

In [20]:
if response is not None:
    print(response.think)
    print(response.ast_markdown)
    print(response.program)
    print(response.result)
else:
    print("No response")

The program defines a recursive function to compute the Fibonacci number of a given integer n. It checks if n is less than or equal to 1, returning n in that case (since fib(0) is 0 and fib(1) is 1). For values greater than 1, it recursively calculates the Fibonacci number by summing the results of fib(n-1) and fib(n-2). It then calls this function with the argument 10, stores the result in variable a, and prints this result. The expected output for fib(10) is 55.
FunctionDeclaration
 ├─ FunctionName: fib
 ├─ Parameters
 │  └─ Parameter
 │     ├─ Name: n
 │     └─ Type: int
 ├─ ReturnType: int
 └─ Body
    ├─ IfStatement
    │  ├─ Condition: n <= 1
    │  └─ Body
    │     └─ ReturnStatement
    │        └─ Value: n
    └─ ReturnStatement
       └─ Value
          ├─ FunctionCall: fib
          │  ├─ Argument: n - 1
          └─ Addition
             ├─ Value: fib(n - 1)
             └─ Value: fib(n - 2)
 ├─ VariableDeclaration
 │  ├─ VariableName: a
 │  └─ Initializer
 │     └─ Functi

that kindof works. for the program it just copied the original, so might need to iterate on that. For now, I think it's better to just focus on whether the result value is correct, since we can evaluate that easily. 

lets come up with some example programs and their resulting outputs.

In [21]:
program_test_1 = ("""
<program>
let x = 1 
let y = 2 
let z = x + y 

print(z)
</program>
""", 3)

program_test_2 = ("""
<program>
# a program that computes the greatest common divisor of two numbers

fn gcd(a: int, b: int) -> int:
    if b == 0:
        return a
    return gcd(b, a % b)

let a = 10
let b = 5
let result = gcd(a, b)

print(result)
</program>
""", 5)


res_1 = call_model("gpt-4o-mini", system_prompt, program_test_1[0])
res_2 = call_model("gpt-4o-mini", system_prompt, program_test_2[0])

if res_1 is not None:
    print(res_1.model_dump_json(indent=2)) 
    print("result: ", res_1.result)
    print("Expected: ", program_test_1[1])

if res_2 is not None:
    print(res_2.model_dump_json(indent=2))
    print("result: ", res_2.result)
    print("Expected: ", program_test_2[1])


{
  "think": "The program initializes two variables, `x` and `y`, with values 1 and 2 respectively. It then computes the sum of `x` and `y`, storing it in variable `z`. Finally, it prints the value of `z`, which should be the sum of 1 and 2.",
  "ast_markdown": "```plaintext\nProgram\n └─ LetStatement\n     ├─ VariableDeclaration (x)\n     │  └─ NumberLiteral (1)\n     ├─ VariableDeclaration (y)\n     │  └─ NumberLiteral (2)\n     └─ VariableDeclaration (z)\n        └─ BinaryExpression ( + )\n           ├─ Identifier (x)\n           └─ Identifier (y)\n └─ CallExpression (print)\n    └─ Identifier (z)\n```",
  "program": "x = 1\n y = 2\n z = x + y\n print(z)",
  "result": "result: int = 3"
}
result:  result: int = 3
Expected:  3
{
  "think": "This program defines a function to compute the greatest common divisor (GCD) using recursion. It then creates two variables, a and b, assigns them values, and calls the GCD function with those values. Finally, it prints the result of the GCD calcul

Thats pretty good. 

Areas for improvement:
- Make an automated testing framework
- Hook up tools the agent can use in a multi-turn scenario, such as a python REPL, or a linter. 
- Get really crazy with the syntax of the psudo-code input 
- Try more model types