<img src="http://wandb.me/logo-im-png" width="400" alt="Weights & Biases" />
<!--- @wandbcode{rag-hackercup} -->


# Introduction


<a target="_blank" href="https://colab.research.google.com/github/HackerCupAI/starter-kits/blob/master/rag/demo.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


In this notebook, we will build a few Code Generation agents for the [HackerCup AI](https://hackercupai.github.io/) challenge.

We will build three different agents using different techniques and evaluate them using [W&B Weave](https://weave-docs.wandb.ai/).


<img src="https://raw.githubusercontent.com/wandb/weave/master/docs/static/img/evals-hero.png" width="800" height="450">

A more detailed walkthough of the approach we will use in this notebook can be found in the following Youtube video:
Hint: Click on the image to watch the video 😎

<a target="_blank" href="https://www.youtube.com/watch?v=cObBj2UpWK8">
<img src="https://img.youtube.com/vi/cObBj2UpWK8/0.jpg" width="600" height="450">
</a>

## Weave


Weave is a lightweight toolkit for tracking and evaluating LLM applications, built by Weights & Biases. We will use the following weave to trace and evaluate the various agents we build.

We will use Weave to keep track and evaluate the different agents we build.

Our goal is to bring rigor, best-practices, and composability to the inherently experimental process of developing AI applications, without introducing cognitive overhead.

If you want to learn more about Weave, you can [get started](https://weave-docs.wandb.ai/quickstart) by decorating Python functions with `@weave.op`.

## Setup 

**Note: You need to run this cell only once**
We will clone the starter-kits repo
Set the rag folder as our working directory
and install the dependencies for the project.

**You can comment out the cell after you have run it once.**

In [None]:
# Clone the starter-kits repo
!git clone https://github.com/HackerCupAI/starter-kits
# Change directory to the rag folder. Running the next line twice in the same session will raise an error.
%cd starter-kits/rag
# Install dependencies
!pip install -r requirements.txt

In [None]:
import weave

# Weave Setup

WEAVE_PROJECT = "hackercup"  # REPLACE WITH YOUR PROJECT NAME
weave_client = weave.init(WEAVE_PROJECT)

## Dataset
We will use [HackerCup dataset](https://huggingface.co/datasets/hackercupai/hackercup) in this notebook.

Specifically, the **practice** dataset from the **2023** season.

We have already processed the dataset and saved it as a [`weave.Dataset`](https://weave-docs.wandb.ai/guides/core-types/datasets/). You can either use the Dataset by running the next cell or download the dataset using the instructions below.

We will use the dataset to load some practice problems and solutions from the HackerCup dataset and evaluate our agents on it.

In [None]:
from utils import (FAST_LLM, STRONG_LLM, Problem, Solution, async_client,
                   check_correctness, format_response)

practice_dataset_uri = "weave:///parambharat/hackercup/object/practice_dataset:R35fXf9N3FE2IOesg7bRPaPAxiE9YbpirhXO9HcHs8w"
problems_dataset = weave.ref(practice_dataset_uri).get().rows[:]
problems = list(map(lambda x: Problem(**x), problems_dataset))
problem = problems[0]
print("Sample Problem:\n\n", problem.model_dump_json(indent=2))

Alternatively, you can download the dataset by running the download script from the [submit-first-solution](https://github.com/HackerCupAI/starter-kits/tree/main/submit_first_solution). Specifically, you can run the following command to download the dataset:

```bash
python download.py --year 2023 --dataset_folder data
```


This should create a `dataset` folder with the problems and solutions.

Here's an example of what the data looks like for the `dim_sum_delivery` problem from the `2023` season:

```
data/dataset/2023/practice
...
├── dim_sum_delivery.cpp
├── dim_sum_delivery.in
├── dim_sum_delivery.md
├── dim_sum_delivery.out
├── dim_sum_delivery_sample_input.txt
├── dim_sum_delivery_sample_output.txt
├── dim_sum_delivery_sol.md
...
```

Each problem has a `in`, `out`, `md`, `cpp`, and `sol` file.

The `in` file contains the input data for the problem.
The `out` file contains the expected output for the problem.
The `md` file contains the problem statement.
The `cpp` file contains the source code to the solution.
The `sol` file contains the detailed solution to the problem.
The `sample_input.txt` and `sample_output.txt` files contain the sample input and output for the problem. These are the test cases that will be available to the agent during development and evaluation.

In [None]:
import asyncio
import logging

from nest_asyncio import apply

apply()

# Some logging to see the progress
logging.basicConfig(
    format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)

logger = logging.getLogger(__name__)

## Zero-shot Agent

For our first agent, we will use a `zero-shot solver`.
It's a simple LLM API call with a detailed prompt to solve the problem.

But first we need to load the problems and convert them to a more structured format and define a way to run the code and evaluate the solution.

First we'll start with loading some utilities. While there are other utilities we load, the ones we care about the most are `load_problem` and `check_correctness`.

The `load_problem` function will load a problem from our dataset into a more structured format.
The `check_correctness` function will run the generated code and evaluate the solution against the expected output for the sample test cases.

In [None]:
import getpass
import os

# Set the OpenAI API KEY for this session
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [None]:
# Simple check to see if the code evaluation works
# We will use this to check the programs our the agents generate

program_code = "print('hello, world!')"
input_data = ""
expected_output = "hello, world!"
timeout = 2

test_result = check_correctness(program_code, input_data, expected_output, timeout)
print("Example 1: ", test_result)
test_result = check_correctness("print('goodbye')", input_data, "hi there", timeout)
print("Example 2: ", test_result)

Now that we have a way to load a problem and evaluate a solution, let's define a prompt to solve the problem and create a simple agent to solve the problem. 

Here'e one such prompt we will use to solve the problem, it contains instructions for the model on how to solve the problem and the format of the response we expect from the model. Feel free to tweak the prompt if you like but this should work decently well for our use case.

In [None]:
from agent import SOLVER_INSTRUCTIONS

print(SOLVER_INSTRUCTIONS)

**Note**: Here we have defined a `Solution` model to enforce the format of the response we expect from the model.
If you change the `SOLVER_INSTRUCTIONS`, you need to change the `Solution` model to enforce the new format.
We use `format_response` to enforce the format of the response we expect from the model.

In [None]:
@weave.op
async def draft_solution(
        problem: Problem, model: str = FAST_LLM, temperature: float = 0.0
) -> Solution:
    user_prompt = f"""{problem.as_xml}
---
Let's think step by step to solve the problem:
"""

    response = await async_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SOLVER_INSTRUCTIONS},
            {"role": "user", "content": user_prompt},
        ],
        response_model=None,
        temperature=temperature,
    )
    formatted_response = await format_response(
        response.choices[0].message.content, Solution
    )
    return formatted_response

With the main solution drafter ready, we can define the `zero_shot_solver` agent.
The agent will use the `draft_solution` function to draft a solution and the `check_correctness` function to check the correctness of the generated solution and return the result.



In [None]:
@weave.op
async def zero_shot_solver(
        problem: Problem, model: str = FAST_LLM, temperature: float = 0.0, timeout: int = 10
) -> dict:
    logger.info("Drafting intial zero-shot solution")
    solution = await draft_solution(
        problem=problem,
        model=model,
        temperature=temperature,
    )
    test_report = check_correctness(
        solution.source_code, problem.sample_input, problem.sample_output, timeout
    )
    logger.info(f"Draft solution result: {repr(test_report)}")
    return {"solution": solution, "test_report": test_report, "stage": "zero-shot"}

In [None]:
# test the zero-shot agent on the sample problem
zero_shot_result = await zero_shot_solver(problem)
print("*" * 80)
print(zero_shot_result["solution"].source_code)
print("*" * 80)
print(zero_shot_result["test_report"])

Let's build a simple evaluation using weave to evaluate the zero-shot agent.
You'll quickly see how this simple evaluation framework can become very powerful and will scale to very complex workflows.
Our agent already takes care of running the code, evaluating the solution against the expected output for the sample test cases and returning the report in the model output.
We expect that the `test_report` is `"passed"` in the agent output so we can use that to evaluate the agent. 

But first we need to load all the problems and convert them to a more structured format. A good agent should be able to handle all the problems in the dataset.

In [None]:
# This is a simple depection of the evaluation.
# We expect the output to be `"passed"` for all the problems if the agent is working correctly.
examples = [{"problem": problem, "expected": "passed"} for problem in problems]


# A simple scorer that checks if the code generated by agent passed the test case
@weave.op
def scorer(expected: str, model_output: dict) -> dict:
    return {"passed": expected == model_output["test_report"]}


# This is a simple evaluation that checks if the code generated by agent passed the test
eval = weave.Evaluation(dataset=examples, scorers=[scorer])

Now we are ready to evaluate the zero-shot agent.
We will create a `weave.Model` instance for the zero-shot agent.
This will help us conduct robust experiments and comparisons by helping us track various settings and parameters for the agent.
For now, we will focus on the `LLM` and the `temperature` for the model.


In [None]:
# Nothing fancy here, just a model that takes in a problem and returns a solution


class ZeroshotAgent(weave.Model):
    model: str = FAST_LLM
    temperature: float = 0.0
    timeout: int = 30

    @weave.op
    async def predict(self, problem: Problem):
        return await zero_shot_solver(
            Problem(**problem),
            model=self.model,
            temperature=self.temperature,
            timeout=self.timeout,
        )

In [None]:
# Evaluate the zero shot agent for all the models and temperatures
eval_models = [FAST_LLM, STRONG_LLM]
eval_temperatures = [0.0, 0.5, 1.0]
tasks = []
for LLM in eval_models:
    for temperature in eval_temperatures:
        zeroshot_agent = ZeroshotAgent(model=LLM, temperature=temperature, timeout=30)
        zeroshot_results = eval.evaluate(zeroshot_agent)
        tasks.append(zeroshot_results)

# Phew that's 2(models)*3(temps)*5(problems) = 30 evaluations

zeroshot_results = await asyncio.gather(*tasks)

Once you have the results you should also be able to visit your weave dashboard to see the results.

## RAG Agent

The RAG agent is a more complex agent that uses the retriever to retrieve the similar problems and solutions, and then uses these as few-shot examples to a model to generate a new solution. We will be using the [codecontests](https://huggingface.co/datasets/deepmind/code_contests) dataset to find the similar problems and the solutions. 

Retriving similar problems and solutions for a given problem statement is a non-trivial task. It involves indexing a large corpus of problems and solutions and then using a search algorithm to find the most similar problems and solutions. We will use the `bm25` algorithm to index the problems and solutions. However, it's important to note that two problems with similar wording - Such as `Alice` and `Bob` are not similar problems. A keyword search algorithm like BM25 might not be able to find similar problems and solutions based on the problem statement due to this limitation. 

While we could use `semantic search` it, would require a lot of data and compute. Therefore, we will use the `bm25` algorithm to index the problems and solutions and then use our zero-shot agent to generate a solution for a given problem statement. Then we can look for similar problems and solutions using the generated solution by comparing the AST (Abstract Syntax Tree) of the problems and solutions. This is a very simplistic approach and is not perfect by any means, but it's a good starting point.


For now, you can just load the retriever from the wandb artifact store below, however, If you wish to use your own data, you might need to pre-process the data and create the retriever. You can checkout `starter-kits/rag/retriever.py` for more details.


However, simply using BM25 is not enough to find similar problems and solutions because two problems with similar solutions might have different problem statements and vice versa.

Can use semantic search to mitigate this by finding the most similar problems and solutions from an initial candidate pool retrieved using BM25. This should keep our compute requirements in check. We can use the `cosine similarity` to find the most similar problems and solutions.

In [None]:
from agent import describe_examples, format_examples, generate_solution
from retriever import Retriever, rerank_docs

logger.info("Loading retriever ... this may take a while ...")
retriever = Retriever()

We are now ready to build the RAG agent.

As we laid out earlier, a RAG agent is a model that takes in a problem and returns a solution using the retriever to retrieve the similar problems and the solutions and then use the model to generate a new solution. We will use the `draft_solution` function to draft a solution for a given problem statement. Then we can look for similar problems and solutions using the generated solution by comparing the AST (Abstract Syntax Tree) of the solution to the solutions in our dataset. We will than present these are few-shot examples to the model to generate a new solution for the given problem statement.

In [None]:
@weave.op
async def rag_solver(
        retriever: Retriever,
        problem: Problem,
        model: str = FAST_LLM,
        temperature: float = 0.0,
        timeout: int = 10,
) -> dict:
    """The RAG Solver"""

    zero_shot_result = await zero_shot_solver(
        problem=problem,
        model=model,
        temperature=temperature,
        timeout=timeout,
    )
    solution = zero_shot_result["solution"]
    test_report = zero_shot_result["test_report"]
    if test_report == "passed":
        return zero_shot_result
    logger.info("Iterating on a RAG solution")

    @weave.op
    async def create_examplars(
            problem: Problem, solution: Solution, top_k: int = 50, top_n: int = 5
    ):
        logger.info(f"Generating examplars:")
        retrieve_docs = retriever.retrieve(solution.source_code, top_k)
        reranked_docs = await rerank_docs(problem, solution, retrieve_docs, top_n)
        analyses = await describe_examples(reranked_docs)
        examplars = format_examples(reranked_docs, analyses)
        return examplars

    @weave.op
    async def rag_solution(
            problem: Problem,
            draft_solution: Solution,
            model: str = STRONG_LLM,
            temperature: float = 0.0,
            timeout: int = timeout,
    ) -> dict:
        logger.info(f"Generating RAG solution:")
        examplars = await create_examplars(problem, draft_solution)
        rag_solution = await generate_solution(
            problem=problem,
            examples=examplars,
            model=model,
            temperature=temperature,
        )
        test_report = check_correctness(
            rag_solution.source_code,
            problem.sample_input,
            problem.sample_output,
            timeout,
        )
        logger.info(f"RAG Solution Result: {repr(test_report)}")
        return {"solution": rag_solution, "test_report": test_report}

    rag_result = await rag_solution(problem, solution, model, temperature, timeout)
    solution = rag_result["solution"]
    test_report = rag_result["test_report"]
    return {"solution": solution, "stage": "rag", "test_report": test_report}

In [None]:
rag_result = await rag_solver(retriever, problem, timeout=30)
print("*" * 80)
print(rag_result["solution"].source_code)
print("*" * 80)
print(rag_result["test_report"])

Again we are now ready to evaluate the RAG agent.
We will create a `weave.Model` instance for the RAG agent and evaluate it using the same evaluation framework we used for the zero-shot agent.

In [None]:
class RAGAgent(weave.Model):
    retriever: Retriever
    model: str = FAST_LLM
    temperature: float = 0.0
    timeout: int = 30

    @weave.op
    async def predict(self, problem: Problem):
        return await rag_solver(
            retriever=self.retriever,
            problem=Problem(**problem),
            model=self.model,
            temperature=self.temperature,
            timeout=self.timeout,
        )

In [None]:
# Evaluate the RAG agent for all the models and temperatures
tasks = []
for LLM in eval_models:
    for temperature in eval_temperatures:
        rag_agent = RAGAgent(
            retriever=retriever, model=LLM, temperature=temperature, timeout=30
        )
        rag_results = eval.evaluate(rag_agent)
        tasks.append(rag_results)

# Again, 30 evals for the RAG agent with different models and temperatures

rag_results = await asyncio.gather(*tasks)

## Reflection Agent



While the RAG agent is an improvement over the zero-shot agent, it's still not perfect.
It's still susceptible to hallucinations and incorrect solutions. 
One way to mitigate this is to use reflection.
We can use another LLM call to reflect on the solution and test results and improve it.
We can then use the improved solution to generate new few-shot examples and repeat the process in a loop until we converge to a solution or the iteration limit is reached.

Again, this is not the best approach to solve the problem and has a lot of room for improvement, but it should help us get towards a working solution.

Here are the reflection instructions we will provide to the LLM to reflect on the solution and test results, feel free to change the instructions to improve the agent's performance.

In [None]:
from agent import REFLECTION_INSTRUCTIONS, rework_solution

print(REFLECTION_INSTRUCTIONS)

In [None]:
@weave.op
async def rag_solver_with_reflection(
        retriever: Retriever,
        problem: Problem,
        model: str = FAST_LLM,
        temperature: float = 0.0,
        max_iterations: int = 2,
        timeout: int = 10,
):
    num_iterations = 0
    test_report = "failed"
    solution = None
    while not test_report == "passed" and num_iterations < max_iterations:
        rag_result = await rag_solver(
            retriever=retriever,
            problem=problem,
            timeout=timeout,
            model=model,
            temperature=temperature,
        )
        solution = rag_result["solution"]
        test_report = rag_result["test_report"]
        if test_report == "passed":
            return rag_result
        rework_result = await rework_solution(
            problem=problem,
            incorrect_solution=solution,
            test_report=test_report,
            model=model,
            temperature=temperature,
            timeout=timeout,
        )
        solution = rework_result["solution"]
        test_report = rework_result["test_report"]
        if test_report == "passed":
            return {
                "solution": solution,
                "stage": "reflection",
                "test_report": test_report,
            }
        num_iterations += 1
    logger.info("Failed to generate a solution")
    return {"solution": solution, "stage": "failed", "test_report": test_report}

In [None]:
reflection_result = await rag_solver_with_reflection(
    retriever, problem, max_iterations=2, timeout=30
)

print("*" * 80)
print(reflection_result["solution"].source_code)
print("*" * 80)
print(reflection_result["test_report"])

Great, now, we are ready to evaluate a more complex agent that uses reflection
This agent will try to solve the problem using the retriever
and if it fails, it will ask the model to reflect on the problem
and then re-work the solution
and repeat this process for a fixed number of iterations
or until the solution is correct or the iteration limit is reached

But the best part is that we can use the same evaluation framework we used for the zero-shot and RAG agent to evaluate the RAG reflection agent.

In [None]:
class RAGReflectionAgent(weave.Model):
    retriever: Retriever
    max_iterations: int = 2
    timeout: int = 30
    model: str = STRONG_LLM
    temperature: float = 0.0

    @weave.op
    async def predict(self, problem: Problem):
        return await rag_solver_with_reflection(
            self.retriever,
            Problem(**problem),
            model=self.model,
            temperature=self.temperature,
            max_iterations=self.max_iterations,
            timeout=self.timeout,
        )

In [None]:
# Evaluate the RAG reflection agent for all the models and temperatures
tasks = []
for LLM in eval_models:
    for temperature in eval_temperatures:
        rag_reflection_agent = RAGReflectionAgent(
            retriever=retriever, model=LLM, temperature=temperature, timeout=30
        )
        rag_reflection_results = eval.evaluate(rag_reflection_agent)
        tasks.append(rag_reflection_results)
rag_reflection_results = await asyncio.gather(*tasks)

Okay, that completes the demo!

Key takeaways from this demo:
1. We tried to solve some challenging competitive programming problems using LLM agents.
2. We tried three different agents:
    - Zero-shot agent
    - RAG agent
    - RAG reflection agent
3. We used Weave to evaluate the agents and compare their performance.

We hope you found this demo useful and interesting and that it gave you some ideas on how to use LLM agents to solve challenging problems.
