# Demo Notebook 

This notebook serves the purpose of demonstrating how the Retriaval Augmented Generation improves the LLM code generation. 

## LLM code creation without RAG and CoALA

Firstly, we import the necessary libraries and create an agent object.

In [None]:
import re
import pandas as pd
from llms.agents.react import ReActAgent
from llms.clients.gpt import GPTClient
from llms.settings import settings

client = GPTClient(
    client_id=settings.CLIENT_ID,
    client_secret=settings.CLIENT_SECRET,
    auth_url=settings.AUTH_URL,
    api_base=settings.API_BASE,
    deployment_id="gpt-4-32k",
    max_response_tokens=1000,
    temperature=0.0,
)
agent = ReActAgent(client)

In [None]:
def parse_function_name(function_string: str) -> str | None:
    """Parses the function name from the LLMs response."""
    match = re.match(r"^\s*def\s+([a-zA-Z_]\w*)\s*\(", function_string)
    return match.group(1) if match else None

Then we define a prompt and ask the agent without RAG and CoALA implemented for the code. Then we execute the function code and save the function that the agent generated under `generated_function`.

In [3]:
prompt = """
How can I convert this one-hot encoded dataframe:
df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], "col2_c": [0, 0, 1]})
into a categorical dataframe?
"""

generated_code = agent.run(prompt)
function_name = parse_function_name(generated_code)

namespace_agent = {}
exec(generated_code, namespace_agent)
generated_function = namespace_agent[function_name]

23/01/24 21:51:51 INFO User prompt: 
How can I convert this one-hot encoded dataframe:
df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], "col2_c": [0, 0, 1]})
into a categorical dataframe?

23/01/24 21:51:51 INFO Waiting 1s to avoid rate limit
23/01/24 21:51:52 INFO Starting call to 'llms.clients.base.BaseLLMClient._request_handler', this is the 1st time calling it.
23/01/24 21:52:05 INFO API response: 
{'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'content': '{\n    "Thought": "To convert a one-hot encoded dataframe into a categorical dataframe, we need to identify the original columns from the encoded column names, then for each original column, find the category for each row by identifying which encoded column has a 1. We can do this by splitting the column names on the underscore to get the original column names and the categories, then using idxmax to find the column with the maximum value (which will be 1 for o

In [4]:
print(generated_code)

def response_function(df):
    import pandas as pd
    # Split the column names on the underscore to get the original column names and the categories
    df.columns = df.columns.str.split('_', expand=True)
    df.columns = df.columns.swaplevel(0, 1)
    # Group by the original column names and find the column with the maximum value for each row
    df = df.groupby(level=0, axis=1).idxmax(axis=1)
    # Remove the original column name from the category
    df = df.applymap(lambda x: x[1])
    return df


Next, we define our input and the expected function.

In [5]:
df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], "col2_c": [0, 0, 1]})

def correct_function(df):
    result = pd.from_dummies(df, sep="_")
    return result

df1 = correct_function(df)
df2 = generated_function(df)



Now we can compare the results of the two functions:

In [6]:
print('Is output equal:')
print(df1.equals(df2))

Is output equal:
False


In [7]:
from pprint import pprint
for step in agent.reasoning:
    for key, value in step.items():
        pprint(f"{key}:{value}")
        print()

('User prompt:\n'
 'How can I convert this one-hot encoded dataframe:\n'
 'df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], "col2_a": [0, '
 '1, 0], "col2_b": [1, 0, 0], "col2_c": [0, 0, 1]})\n'
 'into a categorical dataframe?\n')

('Thought:To convert a one-hot encoded dataframe into a categorical dataframe, '
 'we need to identify the original columns from the encoded column names, then '
 'for each original column, find the category for each row by identifying '
 'which encoded column has a 1. We can do this by splitting the column names '
 'on the underscore to get the original column names and the categories, then '
 'using idxmax to find the column with the maximum value (which will be 1 for '
 'one-hot encoded data) for each row. We will then need to remove the original '
 'column name from the category to get the final category. We can do this '
 'using pandas.')

('Tool:def response_function(df):\n'
 '    import pandas as pd\n'
 '    # Split the column names on the

## LLM code creation with RAG and CoALA

We start by creating a new LLM agent that is augmented by RAG and CoALA.

In [8]:
from llms.rag.faiss import FAISS
from llms.rag.coala import CoALA

filename = "embeddings_EUCLIDEAN_DISTANCE_k_results_1_threshold_0.0_chunk_size_512_avg_True"

docs_vector_store = FAISS.load_local("embeddings/semantic/", filename, client)
code_vector_store = FAISS.load_local("embeddings/episodic/", filename, client)

coala = CoALA(docs_vector_store, code_vector_store)

In [9]:
rag_agent = ReActAgent(client, rag=coala)

In [10]:
prompt = """
How can I convert this one-hot encoded dataframe:
df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], "col2_c": [0, 0, 1]})
into a categorical dataframe?
"""

generated_code = rag_agent.run(prompt)
function_name = parse_function_name(generated_code)

namespace_rag_agent = {}
exec(generated_code, namespace_rag_agent)
rag_agent_func = namespace_rag_agent[function_name]

23/01/24 21:52:17 INFO User prompt: 
How can I convert this one-hot encoded dataframe:
df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], "col2_c": [0, 0, 1]})
into a categorical dataframe?

23/01/24 21:52:17 INFO Waiting 1s to avoid rate limit
23/01/24 21:52:18 INFO Starting call to 'llms.clients.base.BaseLLMClient._request_handler', this is the 1st time calling it.
23/01/24 21:52:19 INFO API response: 
{'data': [{'embedding': [0.007271461, 0.023915028, 0.015189274, -0.010354561, -0.00666389, 0.0072456067, 0.02403137, -0.002779314, -0.012442278, -0.025647251, 0.01997228, 0.024238203, -0.006999993, -0.0076398817, -0.015047077, 0.017296381, 0.016766373, -0.0007218945, 0.011362869, -0.028698033, -0.051320355, -0.007620491, -0.022325002, -0.015253909, -0.031050755, 0.015240982, 0.017115403, -0.020049842, -0.018576158, 0.032136627, 0.012119101, 0.00015936619, -0.0011892879, -0.032912247, -0.00649907, -0.007168045, 0.010561394, -0.00316712

As the `df` and `correct_function` stay the same, we can now compare the results of the rag enhanced agent with the expected result.

In [11]:
print(generated_code)

def response_function(df):
    import pandas as pd
    return pd.from_dummies(df, sep='_')


In [12]:
df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], "col2_c": [0, 0, 1]})

df1 = correct_function(df)
df2 = generated_function(df)



In [13]:
print('Is output equal:')
print(df1.equals(df2))

Is output equal:
False


In [14]:
from pprint import pprint
for step in agent.reasoning:
    for key, value in step.items():
        pprint(f"{key}:{value}")
        print()

('User prompt:\n'
 'How can I convert this one-hot encoded dataframe:\n'
 'df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], "col2_a": [0, '
 '1, 0], "col2_b": [1, 0, 0], "col2_c": [0, 0, 1]})\n'
 'into a categorical dataframe?\n')

('Thought:To convert a one-hot encoded dataframe into a categorical dataframe, '
 'we need to identify the original columns from the encoded column names, then '
 'for each original column, find the category for each row by identifying '
 'which encoded column has a 1. We can do this by splitting the column names '
 'on the underscore to get the original column names and the categories, then '
 'using idxmax to find the column with the maximum value (which will be 1 for '
 'one-hot encoded data) for each row. We will then need to remove the original '
 'column name from the category to get the final category. We can do this '
 'using pandas.')

('Tool:def response_function(df):\n'
 '    import pandas as pd\n'
 '    # Split the column names on the