The first take-home project will be a chance to get some hands-on practice with the concepts of Week 1 for an agent task of your choice.

Goal:

- Choose a task involving multi-turn interaction/tool use
- Implement an agent scaffold either using an API directly, or one of the frameworks we covered
- Create a small set of "test prompts"
- Create a "reward function" to evaluate your agent
- Test your agent setup with multiple models/prompts
- Examine multiple agent outputs, identify a consistent "problem", adjust the setup (prompts/tools) OR adjust your evals to measure/address the problem

Ideas for agent tasks:
- Search agent for your favorite blog/website
- Agent which
- Agent for playing a simple board/card game
- Code agent specialized to only use a specific library, e.g. iterating on a matplotlib plot until it "looks right"
- Terminal-based chat agent with user handoff/confirmation

Ideas for reward functions:
- Format checks using regex
- Deterministic checks (parsing math answers, running code with test cases, solving a puzzle/game)
- Embedding or text overlap similarity to a "ground truth"
- LLM judges which can see the "ground truth"
- LLM judges which evaluate a set of fuzzy criteria + give scores for each

Tips:
- Start simple, get a basic version working, then ramp up complexity
- If your agent "just works" with a fairly powerful model, try it with a weaker model and see what breaks


Bonus goals:
- Try making a "parallel-friendly" version using asyncio + error handling
- Try implementing Best-of-N selection -- can your eval function match your judgment for which outputs are "best"?
- Try testing either a "multi-agent" (parallelized) version of your agent, OR a Client/Server version (e.g. MCP, A2A)


Deliverable:
- A repo, notebook, *or* short writeup detailing your setup + experimentation
- What approaches did you try?
- What roadblocks did you run into?
- Which evaluation methods worked best for your task?
- What's the smallest model that worked decently well?

In [2]:
# simplest possible LLM call
import os
from openai import OpenAI

oai = OpenAI()

response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "user", "content": "What is the capital of France?"},
    ],
)
print(response.choices[0].message.content)

The capital of France is Paris.


In [3]:
import re
from openai import OpenAI

class SimpleLineageAgent:
    """Basic lineage agent using only LLM"""
    
    def __init__(self, model: str = "gpt-4.1-mini"):
        self.client = OpenAI()
        self.model = model
    
    def analyze_lineage(self, code: str, target_variable: str) -> str:
        """Analyze lineage using only LLM"""
        
        system_prompt = """You are a data engineer which needs to analyze the lineage of a target variable to refactor the code. 
        
        Analyze the code and explain how the target variable was created:
        1. What source tables/files were used
        2. What operations were performed 
        3. The step-by-step data flow
        
        Be clear and concise."""
        
        user_prompt = f"""
        Variable to analyze: {target_variable}
        
        Code:
        ```
        {code}
        ```
        """
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0
        )
        
        return response.choices[0].message.content



In [4]:
# Simple test function

sample_code = '''
# Load data
transactions = spark.table("raw.transactions")
customers = spark.table("raw.customers")

# Filter recent transactions
recent_tx = transactions.filter(col("date") >= "2023-01-01")

# Group by customer
customer_totals = recent_tx.groupBy("customer_id").sum("amount")

# Join with customer names
final_result = customer_totals.join(customers, "customer_id")
'''
    
agent = SimpleLineageAgent()

print("=== Simple Lineage Analysis ===")
print(f"Code:\n{sample_code}")
print("\n" + "="*50)

# Test analyzing final_result
print("Analyzing 'final_result':")
result = agent.analyze_lineage(sample_code, "final_result")
print(result)

print("\n" + "-"*30)

# Test analyzing intermediate variable
print("Analyzing 'customer_totals':")
result2 = agent.analyze_lineage(sample_code, "customer_totals")
print(result2)

=== Simple Lineage Analysis ===
Code:

# Load data
transactions = spark.table("raw.transactions")
customers = spark.table("raw.customers")

# Filter recent transactions
recent_tx = transactions.filter(col("date") >= "2023-01-01")

# Group by customer
customer_totals = recent_tx.groupBy("customer_id").sum("amount")

# Join with customer names
final_result = customer_totals.join(customers, "customer_id")


Analyzing 'final_result':
1. Source tables/files used:
   - "raw.transactions" table
   - "raw.customers" table

2. Operations performed:
   - Load the two tables into Spark DataFrames.
   - Filter the transactions to keep only those from January 1, 2023, onwards.
   - Group the filtered transactions by customer_id and sum the amount for each customer.
   - Join the aggregated transaction totals with the customers table on customer_id.

3. Step-by-step data flow:
   - Load "raw.transactions" into `transactions`.
   - Load "raw.customers" into `customers`.
   - Filter `transactions` to 

In [5]:
# Simple test function

sample_code = '''
# Load data
transactions = spark.table("raw.transactions")
customers = spark.table("raw.customers")

# Filter recent transactions
recent_tx = transactions.filter(col("date") >= "2023-01-01")

# Group by customer
customer_totals = recent_tx.groupBy("customer_id").sum("amount")

# Join with customer names
final_result = customer_totals.join(customers, "customer_id")
'''
    
agent = SimpleLineageAgent(model="gpt-4.1-nano")

print("=== Simple Lineage Analysis ===")
print(f"Code:\n{sample_code}")
print("\n" + "="*50)

# Test analyzing final_result
print("Analyzing 'final_result':")
result = agent.analyze_lineage(sample_code, "final_result")
print(result)

print("\n" + "-"*30)

# Test analyzing intermediate variable
print("Analyzing 'customer_totals':")
result2 = agent.analyze_lineage(sample_code, "customer_totals")
print(result2)

=== Simple Lineage Analysis ===
Code:

# Load data
transactions = spark.table("raw.transactions")
customers = spark.table("raw.customers")

# Filter recent transactions
recent_tx = transactions.filter(col("date") >= "2023-01-01")

# Group by customer
customer_totals = recent_tx.groupBy("customer_id").sum("amount")

# Join with customer names
final_result = customer_totals.join(customers, "customer_id")


Analyzing 'final_result':
The target variable `final_result` was created through the following data flow:

1. **Source Tables:**
   - `raw.transactions`: Contains transaction data, including `customer_id`, `amount`, and `date`.
   - `raw.customers`: Contains customer details, including `customer_id` and customer name information.

2. **Operations Performed:**
   - Filtered `transactions` to include only recent transactions from January 1, 2023, onwards (`recent_tx`).
   - Aggregated `recent_tx` by `customer_id` to compute the total transaction amount per customer (`customer_totals`).
 

In [12]:
from agents import Agent, Runner, function_tool
from openai import OpenAI

@function_tool
async def extract_sources_tool(code: str) -> list[str]:
    """Extract data sources from code
    
    Args:
        code (str): The code to analyze
    
    Returns:
        list[str]: List of data sources found (files, tables, etc.)
    """
    patterns = [
        r'pd\.read_csv\s*\(\s*["\']([^"\']+)["\']',
        r'spark\.table\s*\(\s*["\']([^"\']+)["\']',
        r'pd\.read_parquet\s*\(\s*["\']([^"\']+)["\']'
    ]
    
    sources = []
    for pattern in patterns:
        matches = re.findall(pattern, code)
        sources.extend(matches)
    
    return list(set(sources))

@function_tool
async def extract_operations_tool(code: str) -> list[str]:
    """Extract operations from code
    
    Args:
        code (str): The code to analyze
    
    Returns:
        list[str]: List of operations found (filter, groupby, join, etc.)
    """
    operations = []
    
    if '.filter(' in code or '.query(' in code:
        operations.append('filter')
    if '.groupby(' in code or '.groupBy(' in code:
        operations.append('groupby') 
    if '.join(' in code or '.merge(' in code:
        operations.append('join')
    if '.sum(' in code:
        operations.append('sum')
    if '.mean(' in code:
        operations.append('mean')
        
    return operations

lineage_agent = Agent(
    model="gpt-4.1-nano",
    name="advanced_lineage_agent",
    
    # SYSTEM PROMPT: Instructions for HOW to behave
    instructions="""You are an expert data lineage analyst. Your job is to help users understand how their data flows through code.

    **Your approach:**
    1. Always use your tools to extract technical details first
    2. Think step-by-step about the data flow
    3. Explain things clearly for both technical and non-technical users
    4. Be thorough but concise
    
    **When analyzing code:**
    - First, use extract_sources_tool to find all data sources
    - Then, use extract_operations_tool to find all operations
    - Finally, synthesize this information into a clear explanation
    
    **Always think step-by-step before calling a tool.**
    
    Format your final answer with:
    - Sources: What data comes from
    - Operations: What happens to the data  
    - Flow: Step-by-step explanation""",
    
    # TOOLS: What the agent CAN DO
    tools=[extract_sources_tool, extract_operations_tool],
)

sample_code = '''
import pandas as pd

# Load data
sales = pd.read_csv("sales_data.csv")
customers = pd.read_csv("customers.csv")

# Process data
recent_sales = sales.query("date >= '2023-01-01'")
customer_totals = recent_sales.groupby('customer_id')['amount'].sum()
final_result = customer_totals.merge(customers, on='customer_id')
'''
    
prompt = f"""
Analyze the lineage of the variable 'final_result' in this code:

{sample_code}

I need to understand where the data comes from and how it's transformed.
"""

print("=== AGENT WITH SYSTEM PROMPT + TOOLS ===")
result = await Runner.run(lineage_agent, prompt)
print(result.final_output)

=== AGENT WITH SYSTEM PROMPT + TOOLS ===
**Sources:**
- The data for `sales` comes from the CSV file "sales_data.csv."
- The data for `customers` comes from the CSV file "customers.csv."

**Operations:**
- The `sales` data is filtered to include only recent sales from 2023-01-01 onwards.
- The filtered sales data (`recent_sales`) is grouped by `customer_id`, summing the `amount` for each customer, producing `customer_totals`.
- The `customer_totals` data is then merged with the `customers` data on `customer_id` to produce the final result.

**Flow:**
1. Data is loaded from two CSV files: sales and customers.
2. From the sales data, only recent sales (from 2023-01-01) are selected.
3. The recent sales are grouped by customer, summing the amounts to get total sales per customer.
4. These totals are merged with customer details, combining sales summaries with customer information.
5. The result, `final_result`, contains each customer’s details along with their total sales for 2023 onwards

In [18]:
from test_cases import TEST_CASES

def simple_reward(agent_response: str, expected_sources: list, expected_operations: list) -> float:
    """Simple reward function that scores agent response"""
    response_lower = agent_response.lower()
    
    # Count how many expected sources are mentioned
    sources_found = sum(1 for source in expected_sources 
                       if source.lower() in response_lower)
    source_score = sources_found / len(expected_sources) if expected_sources else 1.0
    
    # Count how many expected operations are mentioned  
    ops_found = sum(1 for op in expected_operations 
                   if op.replace('_', ' ').lower() in response_lower)
    op_score = ops_found / len(expected_operations) if expected_operations else 1.0
    
    return (source_score + op_score) / 2

for i, test_case in enumerate(TEST_CASES[:3]):  # First 3 cases
    print(f"Test {i+1}: {test_case['name']}")
    
    # Get agent response
    prompt = f"Analyze the lineage of variable '{test_case['target']}' in this code:\n\n{test_case['code']}"
    result = await Runner.run(lineage_agent, prompt)    

    # Calculate reward
    reward = simple_reward(result.final_output, 
                          test_case['expected_sources'], 
                          test_case['expected_operations'])
    
    print(f"  Reward Score: {reward:.2f}")
    print(f"  Expected: sources={test_case['expected_sources']}, ops={test_case['expected_operations']}")
    print(f"  Agent response (first 200 chars): {result.final_output[:200]}...")
    print("-" * 60)

Test 1: basic_pyspark_filter
  Reward Score: 1.00
  Expected: sources=['users'], ops=['filter']
  Agent response (first 200 chars): Let's analyze the data lineage step-by-step:

- **Sources**: The data for `filtered_df` ultimately originates from the "users" table, which is read into the variable `df`.

- **Operations**:
  - First...
------------------------------------------------------------
Test 2: basic_pandas_filter
  Reward Score: 1.00
  Expected: sources=['sales.csv'], ops=['filter']
  Agent response (first 200 chars): **Sources:**
- The data source for `high_sales` is the CSV file named "sales.csv".

**Operations:**
- The variable `high_sales` is created by filtering the DataFrame `df`.
- The filtering operation se...
------------------------------------------------------------
Test 3: pandas_groupby_merge
  Reward Score: 0.57
  Expected: sources=['orders.csv', 'customers.csv', 'products.csv'], ops=['read_csv', 'dropna', 'groupby', 'agg_sum', 'agg_count', 'reset_index', 'merge'

In [23]:
# Cell 2: LLM Judge Agent with Same Tools
judge_agent = Agent(
    model="gpt-4.1-nano",
    name="lineage_judge_agent",
    
    instructions="""You are an expert judge for data lineage analysis. Your job is to evaluate how well another agent analyzed code lineage.

    **Your approach:**
    1. First, use your tools to find the ground truth (actual sources and operations in the code)
    2. Compare the agent's response against both the ground truth AND the expected results
    3. Give a fair, objective score from 0-10
    
    **When judging:**
    - Use extract_sources_tool to find all actual data sources
    - Use extract_operations_tool to find all actual operations  
    - Compare what the agent found vs. what actually exists
    - Consider accuracy, completeness, and clarity
    
    **Always use your tools first, then provide your judgment.**
    
    Format your response as:
    SCORE: X/10
    EXPLANATION: explanation of the score. What finds the judge agent which was not in the expected results""",
    
    tools=[extract_sources_tool, extract_operations_tool],
)

async def agent_judge(code: str, target: str, agent_response: str, expected_sources: list, expected_operations: list) -> float:
    """Use judge agent to evaluate the lineage agent's response"""
    
    judge_prompt = f"""Please evaluate this lineage analysis:

TARGET VARIABLE: {target}

CODE TO ANALYZE:
{code}

EXPECTED RESULTS:
- Sources: {expected_sources}
- Operations: {expected_operations}

AGENT'S RESPONSE:
{agent_response}

Please use your tools to find the ground truth, then score the agent's response from 0-10."""

    # Run the judge agent
    judge_result = await Runner.run(judge_agent, judge_prompt)
    
    # Extract score from judge response
    try:
        response_text = judge_result.final_output.upper()
        if "SCORE:" in response_text:
            score_line = [line for line in response_text.split('\n') if 'SCORE:' in line][0]
            score = float(score_line.split('SCORE:')[1].split('/')[0].strip())
            return score / 10.0  # Convert to 0-1
    except:
        pass
    
    return 0.5  # Default if parsing fails

# Test judge agent on lineage agent responses
for i, test_case in enumerate(TEST_CASES[:2]):  # First 2 cases only (API cost)
    print(f"Judge Agent Test {i+1}: {test_case['name']}")
    
    # Get lineage agent response
    prompt = f"Analyze the lineage of variable '{test_case['target']}' in this code:\n\n{test_case['code']}"
    result = await Runner.run(lineage_agent, prompt)
    agent_output = result.final_output
    
    # Judge the response using judge agent
    score = await agent_judge(test_case['code'], test_case['target'], agent_output, 
                             test_case['expected_sources'], test_case['expected_operations'])
    
    print(f"  Judge Agent Score: {score:.2f}")
    print(f"  Lineage Agent said: {agent_output}")
    print("-" * 50)

Judge Agent Test 1: basic_pyspark_filter
  Judge Agent Score: 1.00
  Lineage Agent said: **Sources:**  
- The data for `filtered_df` originates from the data source `users`, which appears to be a table in a Spark environment.

**Operations:**  
- The main operation is a filter operation applied to the `df` DataFrame, selecting only the records where the `age` column value is greater than 18.

**Flow:**  
1. The data is loaded from the `users` table into the DataFrame `df`.  
2. The DataFrame `df` undergoes a filtering operation, resulting in `filtered_df`.  
3. `filtered_df` contains only the records from `users` where the `age` exceeds 18.  

This process narrows down the original `users` dataset to include only adult users.
--------------------------------------------------
Judge Agent Test 2: basic_pandas_filter
  Judge Agent Score: 0.80
  Lineage Agent said: **Sources:**  
The variable `high_sales` is derived from the data source `sales.csv`, which is a CSV file containing sales da

In [2]:
from tools import CodeAnalysisTools

sample_code = """
import pandas as pd

# Load data
sales = pd.read_csv("sales_data.csv")
customers = pd.read_csv("customers.csv")

# Process data
recent_sales = sales.query("date >= '2023-01-01'")
customer_totals = recent_sales.groupby('customer_id')['amount'].sum().reset_index()
final_result = customer_totals.merge(customers, on='customer_id')
"""

tools = CodeAnalysisTools()

print("🔧 TESTING CODE ANALYSIS TOOLS")
print("=" * 50)

print("📁 Sources found:")
sources = tools.extract_table_sources(sample_code)
print(sources)

print("\n📝 Variable assignments:")
assignments = tools.extract_variable_assignments(sample_code)
for var, lines in assignments.items():
    print(f"  {var}: lines {lines}")

print("\n⚙️ Operations found:")
operations = tools.extract_operations(sample_code)
for op in operations:
    print(f"  Line {op['line']}: {op['operation']} - {op['code']}")

print("\n🎯 Trace 'final_result':")
trace = tools.trace_variable_dependencies(sample_code, "final_result")
print(f"  Sources: {trace['sources']}")
print(f"  Operations: {trace['operations']}")
print(f"  Assignment lines: {trace['assignment_lines']}")

🔧 TESTING CODE ANALYSIS TOOLS
📁 Sources found:
['sales_data.csv', 'customers.csv']

📝 Variable assignments:
  sales: lines [5]
  customers: lines [6]
  recent_sales: lines [9]
  customer_totals: lines [10]
  final_result: lines [11]

⚙️ Operations found:
  Line 9: filter - recent_sales = sales.query("date >= '2023-01-01'")
  Line 10: groupby - customer_totals = recent_sales.groupby('customer_id')['amount'].sum().reset_index()
  Line 10: reset_index - customer_totals = recent_sales.groupby('customer_id')['amount'].sum().reset_index()
  Line 10: agg_sum - customer_totals = recent_sales.groupby('customer_id')['amount'].sum().reset_index()
  Line 10: groupBy - customer_totals = recent_sales.groupby('customer_id')['amount'].sum().reset_index()
  Line 10: sql_SUM - customer_totals = recent_sales.groupby('customer_id')['amount'].sum().reset_index()
  Line 11: merge - final_result = customer_totals.merge(customers, on='customer_id')

🎯 Trace 'final_result':
  Sources: ['sales_data.csv', 'custo

In [None]:
    # def compare_tool_vs_llm(self, code: str, target_variable: str) -> dict[str, any]:
    #     """Compare tool analysis vs LLM analysis"""

    #     # Tool analysis
    #     tool_result = self.trace_variable_dependencies(code, target_variable)

    #     # Pure LLM analysis (without tools)
    #     system_prompt = """You are a code lineage analysis expert.

    #     Analyze the code to trace how the target variable was created:
    #     1. What source tables/files were used
    #     2. What operations were performed
    #     3. The step-by-step flow"""

    #     user_prompt = f"Target variable: {target_variable}\n\nCode:\n```\n{code}\n```"

    #     llm_response = self.client.chat.completions.create(
    #         model=self.model,
    #         messages=[
    #             {"role": "system", "content": system_prompt},
    #             {"role": "user", "content": user_prompt},
    #         ],
    #         temperature=0,
    #     )

    #     # Enhanced analysis (tools + LLM)
    #     enhanced_response = self.analyze_lineage_with_tools(code, target_variable)

    #     return {
    #         "tool_analysis": tool_result,
    #         "llm_only": llm_response.choices[0].message.content,
    #         "enhanced_analysis": enhanced_response,
    #     }