# Context

We started our course with:
- [Week 1: Bootstrap retrieval evaluation](https://github.com/567-labs/systematically-improving-rag/tree/main/week1_bootstrap_evals)
- [Week 2: Classify queries to prioritize improvements](https://github.com/567-labs/systematically-improving-rag/tree/main/week2_question_classification)

The improvements you'll prioritize based on week 2 include:
- Adding data sources
- Adding new indices to improve retrieval on existing sources
- Extract more information from existing data
    - Pre-processing so results are ready at query time
    - Post-processing so results are calculated at query time

This notebook shows sample code for handling new data types (e.g. images, tables and 3rd party APIs), adding indices, and extracting more information from existing data sources.

We'll add routers in week 4 to ensure we're calling the correct tools after we build them.

# Example

We continue our example of a hardware e-commerce site answering user questions with a RAG system that retrieves previous product reviews.

Many questions won't be answered by product reviews, but we could look them up from other sources. For example:
- `How many of these are available to be shipped right now?`
- `How much will it cost to ship this item to Florida?`

We can answer questions like this with tool calls on new data sources.

In [1]:
import openai
import instructor

from typing import Annotated, Any, Iterable, List, Literal, Optional

from pydantic import BaseModel


class ShippingDateRequest(BaseModel):
    sku: str


class ShippingCostRequest(BaseModel):
    sku: str
    shipping_location: str


def get_available_shipping_date(sku: str) -> str:
    """A mock function"""
    return "September 1, 2024."


def get_shipping_cost(sku: str, shipping_location: str) -> str:
    """A mock function"""
    return "$15.99."


basic_client = instructor.from_openai(openai.OpenAI())
tools_client = instructor.from_openai(
    openai.OpenAI(), mode=instructor.Mode.PARALLEL_TOOLS
)


example_messages = [
    {"role": "system", "content": "You must always use tools"},
    {
        "role": "user",
        "content": """A user asked the following question about a product with sku 1234567890.
When will this product be in stock again? And how much will it cost to ship this item to Florida?
            """,
    },
]

function_calls = tools_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=example_messages,
    response_model=Iterable[ShippingDateRequest | ShippingCostRequest],
)

for call in function_calls:
    if isinstance(call, ShippingDateRequest):
        result = get_available_shipping_date(call.sku)
    elif isinstance(call, ShippingCostRequest):
        result = get_shipping_cost(call.sku, call.shipping_location)
    print(result)

* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
* 'smart_union' has been removed


September 1, 2024.
$15.99.


Calling arbitrary functions as we've seen hear opens up a wide range of data sources.

- Read from a database
- Calculations
- Side effects (e.g. write to a database, send an email)
- More LLM calls

But you can also call functions that do more than reading data.

# Logging User Requests

We decide to store facts about customers and their interest in a database, so we can make better recommendations in the future.

In [2]:
class Fact(BaseModel):
    person: str
    fact: str


def log_requests(person, fact) -> None:
    """A mock function"""
    print(f"Logging data for {person}. Fact: {fact}.")
    return


# Could improve this with examples in the prompt
function_calls = tools_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "Extract a list of atomic facts about a person that may help us recommend better tools for them in the future.",
        },
        {
            "role": "user",
            "content": """dan@gmail.com made the following request while viewing a 7" axe:
            I'm looking for an axe with a rubberized handle. I have nerve damage in my elbow which makes hard handles painful. How hard is this axe handle? I will buy an axe with a soft handle if I can find it."
            """,
        },
    ],
    response_model=Iterable[Fact],
    tools=[{"type": "function", "function": log_requests}],
    temperature=0.0,
)

for call in function_calls:
    log_requests(call.person, call.fact)

Logging data for dan@gmail.com. Fact: looking for an axe with a rubberized handle.
Logging data for dan@gmail.com. Fact: has nerve damage in elbow.
Logging data for dan@gmail.com. Fact: finds hard handles painful.


# Extracting Structured Data

These plain language facts are useful for some workflows (I could iterate over them and figure out which users are especially interested in some new product).

But many RAG capabilities benefit from structured data (especially for improving retrieval quality). This is very natural with Instructor.

How would you ensure your system can answer questions like "What is a German-made axe that is at least 20 cm long with a wooden handle?"

We'll need to extract structured data from the request (and possibly from product descriptions). For the sake of example, we'll preprocess axe descriptions to populate a database of product specs which we can later query (details of querying will be in week 4).

In [3]:
class HandToolStats(BaseModel):
    tool_type: Optional[Literal["axe", "hammer", "screwdriver", "saw", "other"]]
    handle_material: Optional[str] = None
    blade_material: Optional[str] = None
    weight_grams: Optional[int] = None
    color: Optional[str] = None
    length_cm: Optional[int] = None
    country_of_origin: Optional[str] = None


# Could improve this with examples in the prompt
axe_specs = basic_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract specs from the product description."},
        {
            "role": "user",
            "content": """
            This Canadian axe is made with the finest spruce and carbon steel. It is 12 inches long
            """,
        },
    ],
    response_model=HandToolStats,
    temperature=0.0,
)

for s in axe_specs:
    print(f"{s[0]}: {s[1]}")

tool_type: axe
handle_material: spruce
blade_material: carbon steel
weight_grams: None
color: None
length_cm: 30
country_of_origin: Canada


We'll have some `None` values in the extracted data, and people won't always query over all data. You probably already see how you could put these together to answer questions in real-time using SQL queries.

# Images

We've looked at text input so far. But it's straightforward to extend this to images.

Let's extract some stats from an axe image instead of a written description

![Carpenter's axe](https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/Carpenter%27s_axe.jpg/340px-Carpenter%27s_axe.jpg)

In [4]:
def image_to_tool_stats(url: str) -> HandToolStats:
    return basic_client.chat.completions.create(
        model="gpt-4o",
        response_model=HandToolStats,
        messages=[
            {
                "role": "system",
                "content": """Extract specs from the product description. 
             Include all fields you can extract from the image.""",
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": url},
                    },
                ],
            },
        ],
        temperature=0.0,
    )


img_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/Carpenter%27s_axe.jpg/340px-Carpenter%27s_axe.jpg"
specs = image_to_tool_stats(img_url)
specs

HandToolStats(tool_type='axe', handle_material='wood', blade_material='metal', weight_grams=None, color=None, length_cm=None, country_of_origin=None)

There's nothing special about extraction and images as a pair.

We could have asked "What's in this image" and gotten a string description.

Or if the image can be interpreted as a table, we can ask for that table

### Are A Prompt and A Response Model All You Need?

To see this in a stripped down form, we'll do data extraction from an image in a familiar format. Then we'll add some structure and extract pandas DataFrames from images.

A user may ask what will be trendy for hardware materials in the coming years. 

If we have reports with images like this, we have a source to answer their question

<img src=https://market.us/wp-content/uploads/2023/02/Cabinet-Hardware-Market-Size.png height=400>

In [5]:
class Table(BaseModel):
    caption: str
    md_dataframe: str


def img_to_md_table(url: str) -> List[Table]:
    return basic_client.chat.completions.create(
        model="gpt-4o",
        max_tokens=4000,
        response_model=List[Table],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": url},
                    },
                    {
                        "type": "text",
                        "text": """
                            Analyze the image to determine appropriate headers for output tables.
                            For each identified table, create an informative h2 title and a concise description of the contents.
                            Finally, output the markdown representation of all data in the table/graph.

                            Escape the markdown table properly, and make sure to include the caption and the dataframe.
                            Only return a markdown table in dataframe, nothing else. Make sure to capture all data that should be in the table.

                            Capture visual data that is not explicitly labeled with text from the image
                        """,
                    },
                ],
            }
        ],
        temperature=0.0,
    )

In [6]:
url = "https://market.us/wp-content/uploads/2023/02/Cabinet-Hardware-Market-Size.png"

for table in img_to_md_table(url):
    print("-----------------------------------")
    print(table.caption, "\n", table.md_dataframe)

-----------------------------------
Global Cabinet Hardware Market Size, by Material, 2022-2032 (USD Billion) 
 | Year | Metal | Plastic | Ceramic | Glass | Wood |
|------|-------|---------|---------|-------|------|
| 2022 | 3.5   | 2.5     | 1.5     | 1.0   | 1.18 |
| 2023 | 3.7   | 2.7     | 1.6     | 1.1   | 1.27 |
| 2024 | 3.9   | 2.9     | 1.7     | 1.2   | 1.31 |
| 2025 | 4.1   | 3.1     | 1.8     | 1.3   | 1.39 |
| 2026 | 4.3   | 3.3     | 1.9     | 1.4   | 1.46 |
| 2027 | 4.5   | 3.5     | 2.0     | 1.5   | 1.56 |
| 2028 | 4.7   | 3.7     | 2.1     | 1.6   | 1.66 |
| 2029 | 4.9   | 3.9     | 2.2     | 1.7   | 1.68 |
| 2030 | 5.1   | 4.1     | 2.3     | 1.8   | 1.78 |
| 2031 | 5.3   | 4.3     | 2.4     | 1.9   | 1.89 |
| 2032 | 5.5   | 4.5     | 2.5     | 2.0   | 2.39 |


You can also extract info to tables for things that are less obviously "tables" in their raw form.

<img src="https://www.transparencymarketresearch.com/images/household-and-diy-hand-tools-market.jpg" height=400>

In [7]:
url = "https://www.transparencymarketresearch.com/images/household-and-diy-hand-tools-market.jpg"

for table in img_to_md_table(url):
    print("-----------------------------------")
    print(table.caption, "\n", table.md_dataframe)

-----------------------------------
Market Drivers 
 | Market Drivers |
|----------------|
| Rise in e-commerce |
| Surge in trend of do-it-yourself activities |
-----------------------------------
Market Revenue 
 | Year | Revenue (US$ Bn) | CAGR (2023-2031) |
|------|------------------|------------------|
| 2022 | 17.9             | 4.5%             |
-----------------------------------
Sales Channels 
 | Sales Channels |
|----------------|
| Online Sales   |
| Retail Sales   |
| Distributor Sales |
-----------------------------------
Key Players 
 | Key Players                       |
|----------------------------------|
| Akar Tools Ltd.                  |
| Apex Tools Group LLC             |
| Channellock, Inc.                |
| JK Files and Engineering Limited |
| Kennametal Inc.                  |
| Klein Tools, Inc.                |
| Snap-on Incorporated             |
| Stanley Black & Decker, Inc.     |
| Wera Tools                       |
-----------------------------------

## Image to DataFrame

Much more is possible with DataFrames than Markdown. So we likely want to extract a DataFrame. This requires some extra code (primarily for conversion to/from the DataFrame). We can see it below

In [8]:
from openai import OpenAI
from io import StringIO
from typing import Annotated, Any, List
from pydantic import (
    BaseModel,
    BeforeValidator,
    PlainSerializer,
    InstanceOf,
    WithJsonSchema,
)
import instructor
import pandas as pd
from rich.console import Console

console = Console()
client = instructor.from_openai(
    client=OpenAI(),
    mode=instructor.Mode.TOOLS,
)


def md_to_df(data: Any) -> Any:
    if isinstance(data, str):
        return (
            pd.read_csv(
                StringIO(data),  # Get rid of whitespaces
                sep="|",
                index_col=1,
            )
            .dropna(axis=1, how="all")
            .iloc[1:]
            .map(lambda x: x.strip())
        )  # type: ignore
    return data


MarkdownDataFrame = Annotated[
    InstanceOf[pd.DataFrame],
    BeforeValidator(md_to_df),
    PlainSerializer(lambda x: x.to_markdown()),
    WithJsonSchema(
        {
            "type": "string",
            "description": """
                The markdown representation of the table, 
                each one should be tidy, do not try to join tables
                that should be seperate""",
        }
    ),
]


class Table(BaseModel):
    caption: str
    dataframe: MarkdownDataFrame


class MultipleTables(BaseModel):
    tables: List[Table]


def extract(url: str) -> MultipleTables:
    return client.chat.completions.create(
        model="gpt-4o",
        max_tokens=4000,
        response_model=MultipleTables,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": url},
                    },
                    {
                        "type": "text",
                        "text": """
                            Analyze the image to determine appropriate headers for output tables.
                            For each identified table, create an informative h2 title and a concise description of the contents.
                            Finally, output the markdown representation of all data in the table/graph.

                            Escape the markdown table properly, and make sure to include the caption and the dataframe.
                            Only return a markdown table in dataframe, nothing else. Make sure to capture all data that should be in the table.

                            Capture visual data that is not explicitly labeled with text from the image
                        """,
                    },
                ],
            }
        ],
        temperature=0.0,
    )


url = "https://market.us/wp-content/uploads/2023/02/Cabinet-Hardware-Market-Size.png"
for table in extract(url).tables:
    console.print(table.caption, "\n", table.dataframe)

You've seen that you can (and likely will) create many tools. But it would be slow (and possibly harmful) to call all tools for all queries.

The `benchmark_tool_retrieval.ipynb` notebook will show how to measure tool retrieval with synthetic data much like we benchmarked content retrieval with synthetic data in week 1