# Leveraging large language models and vector databases for exploring life cycle inventory databases

* Author: Selim Youssry
* Kernel: `llm`
* License: [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/)

## ChatGPT

Let's run a simple query.
For that, open [OpenAI Chat](https://chat.openai.com/), sign up/sign in, and create a new chat (it's free).

![ChatGPT: What is Ecoinvent?](chatgpt-1.png)

ChatGPT knows about Ecoinvent, great! The answer seems pretty satisfactory. Let's continue the conversation.

![ChatGPT: Cool, where is it located?](chatgpt-2.png)

### What is special about this query?

<details>
  
<summary><b>Answer</b></summary>
This seems stateful, as if ChatGPT keeps a session of my conversation, as I used "it".
The context must be shomehow saved.

</details>


### Now, how popular is Brightcon?

![ChatGPT: what is Brightcon?](chatgpt-3.png)

Sadly, it does not know. A first question then comes to mind:

### How do we teach ChatGPT / LLMs in general new knowledge? How would you do it with a "traditional ML model"?

<details>
  
<summary><b>Answer</b></summary>
<b>Fine-tuning</b>. Take the model and train on top of it.

But as of only very recently, ChatGPT offers fine-tuning capability. Regardless, this is quite hard and not the first solution that should come to mind.
</details>


### Let's feed ChatGPT with a little extra information

We do a quick Google search for recent LCA conferences, paste a few summaries in the prompt, and try again

![ChatGPT: now with context, what is Brightcon?](chatgpt-4.png)

## Checkpoint

- LLMs can generate high-quality human-language answers, based on human-language input
- LLMs can be passed context at runtime, that they can use in their responses

## Moving to the API

Let's replicate the previous flows with the API.

In [1]:
import os

import openai

# os.environ["OPENAI_API_KEY"] = xxx

In [2]:
response1 = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Ecoinvent?"}
    ]
)
response1

<OpenAIObject chat.completion id=chatcmpl-80G6u82w5wvbzBNrrviiG3VDiTpD4 at 0x7faed8580830> JSON: {
  "id": "chatcmpl-80G6u82w5wvbzBNrrviiG3VDiTpD4",
  "object": "chat.completion",
  "created": 1695072620,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Ecoinvent is a widely used database that provides life cycle inventory data for various products and processes. It includes information on the environmental impacts associated with the production and use of different materials, energy sources, and technologies. This data can be used to assess and compare the sustainability and environmental performance of different products and systems. Ecoinvent is often used in life cycle assessment studies and sustainability analyses."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 78,
    "total_tokens": 101
  }
}

In [3]:
response2 = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Ecoinvent?"},
        {"role": "assistant", "content": response1["choices"][0]["message"]["content"]},
        {"role": "user", "content": "Where is it located?"}
    ]
)
response2

<OpenAIObject chat.completion id=chatcmpl-80G6xzOvJEXWL9biTEFoXfvZIh2jg at 0x7faef455a570> JSON: {
  "id": "chatcmpl-80G6xzOvJEXWL9biTEFoXfvZIh2jg",
  "object": "chat.completion",
  "created": 1695072623,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Ecoinvent is managed by the Swiss Centre for Life Cycle Inventories, located in Switzerland. The database is regularly updated and maintained by a team of experts who collect and process data from a wide range of sources. The Ecoinvent database is available online and can be accessed through a subscription or licensing agreement."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 114,
    "completion_tokens": 62,
    "total_tokens": 176
  }
}

## What is the `system` role used for?

In [4]:
response_pirate = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a pirate."},
        {"role": "user", "content": "What is Ecoinvent?"}
    ]
)
response_pirate

<OpenAIObject chat.completion id=chatcmpl-80G71sBqTxSApJK6sPrDIevlU4Dli at 0x7faed85812b0> JSON: {
  "id": "chatcmpl-80G71sBqTxSApJK6sPrDIevlU4Dli",
  "object": "chat.completion",
  "created": 1695072627,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Ahoy there! Ecoinvent be a comprehensive life cycle database that provides detailed information about the environmental impacts of various goods and services throughout their entire life cycle. It be used for assessing the carbon footprint, energy consumption, and other environmental aspects of different products. As a pirate, I reckon it be important for us to be mindful of our impact on the environment too, even on the high seas! Arr!"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 22,
    "completion_tokens": 82,
    "total_tokens": 104
  }
}

### Breaking down the response

```javascript
{
    "usage": {
    "prompt_tokens": 148,
    "completion_tokens": 95,
    "total_tokens": 243
}
```

What are these **tokens**?

OpenAI provides a [nice web app OpenAI Tokenizer](https://platform.openai.com/tokenizer) to understand how words are broken down into tokens.

They say

```
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
`````
```

![Tokenizer words](tokenizer-1.png)

### Tokens are mappings for sequences of letters to integers

![Tokenizer words](tokenizer-2.png)

### And programmatically with tiktoken

In [5]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [6]:
tokens = enc.encode("What is Ecoinvent?")
tokens

[3923, 374, 469, 7307, 688, 30]

In [7]:
[enc.decode([token]) for token in tokens]

['What', ' is', ' E', 'coin', 'vent', '?']

### Why do we care about tokens here?

<details>
  
<summary><b>Answer</b></summary>
Because there is a limit to the size of the context window, and it is measured in tokens.
Also, the pricing is per token.

</details>


## Checkpoint

- Text is chunked into tokens before being manipulated. Tokens are just mappings from strings to integers. They define an exhaustive vocabulary.
- There is a context window, we can't pass an unlimited amount of data to LLMS.

## Which applications can you think of? In the context of LCAs

Here is one I used for Brightcon:

- Text generation from a small description

<details>
  
<summary><b>Answer</b></summary>

Let's describe the process of doing an LCA:

1. Collect imperfect data from a customer - for instance a bunch of Excel files with random names for units (seen at the Hackathon): `kg CH4`, `PCS`. <b>Covered application 1: "kg CH4" -> "kg" and "PCS" -> "Item(s)"</b>
2. Create a supply chain in an LCA software, mapping this "company data" to LCI data (like Ecoinvent, but we'll use ELCD here).
3. Pick from the numerous processes in the LCI database the one that matches the closest your data (or refine this simplified approach). <b>Covered application 2: Search and match</b>
4. Compute impacts and do something with them

</details>


# Application 1. Mapping vocabularies

- Company data -> LCI format (units for instance)
- One ontology to another ontology (suggested through the Hackathon)

## Dataset

We assume you did the following:

- Go to [OpenLCA Nexus](https://nexus.openlca.org/)
- Create an account / sign in and download ELCD (a retired free EU LCI database)
- Download OpenLCA V2
- Click "Database > Restore database" and pick the ELCD ".zolca" file you've just downloaded
- Then click "File > Export > JSON-LD"
- Unzip the JSON-LD zip you now have, under `datasets/elcd/`

For licensing reasons, I cannot provide the data as is, you need to follow these steps.

Your folder structure should look like

```bash
tree datasets/ -L 2
datasets/
└── elcd
    ├── actors
    ├── bin
    ├── categories
    ├── context.json
    ├── currencies
    ├── dq_systems
    ├── flow_properties
    ├── flows
    ├── lcia_categories
    ├── lcia_methods
    ├── locations
    ├── meta.info
    ├── nw_sets
    ├── processes
    ├── sources
    └── unit_groups
```

In [8]:
# Here type the datasets root path
DSROOT="/srv/data/llm/"

In [9]:
import glob
import json
import os

from tqdm import tqdm

def load_units(root: str):
    units_set = set()
    for unit_group_fp in tqdm(glob.glob(os.path.join(root, "unit_groups", "*.json"))):
        with open(unit_group_fp) as unit_group_f:
            unit_group = json.load(unit_group_f)
        for unit in unit_group["units"]:
            units_set.add(unit["name"])
    return units_set

In [10]:
units_set = load_units(os.path.join(DSROOT, "elcd"))
print(f"N units: {len(units_set)}")
print(units_set)

  0%|          | 0/42 [00:00<?, ?it/s]

N units: 202
{'kg CO2-Equiv.', 'ug', 'USD 2002', 'bl (Imp)', 'ng', 'fur', 'Rutherford', 'in', 'kg Ethene-Equiv.', 'pt (US fl)', 'EUR 2003', 'a', 'cwt', 'dl', 'p*km', 'nmi', 'pt (US dry)', 'oz av', 's', 'mi', 'mi2*a', 'v*km', 'm3', '(mm*m2)/a', 'mol', 'AUD 2000', 'PJ', 'ct', 'kWh/m2*d', 'lb av', 'nBq', 'yd3', 'nmi2', 'CAD 2000', 'kg/a', 'm3*mi', 'pg', 'lb*mi', 'DKK 2000', 'mg', 'ft2*a', 'gal (Imp)', 'kt*km', 'MWh', 'cl', 'm2*a', 'l*km', 'm3*km', 'l*mi', 'cm', 'M$ 2000', 'ul', 't*mi', 'm3*nmi', 't*nmi', 'bl (US beer)', 'dr (Av)', 'ISK 2000', 'fl oz (Imp)', 'yd', 't*d', 'cu ft', 'Ci', 'mm2', 'TJ', 'kg SWU', 'pt (Imp)', 'ftm', 'CHF 2000', 'EUR 2000', 'Dozen(s)', 'Items*nmi', 'mi2', 'ha', 'm2 yr eq organic arable land', 'lb*nmi', 'l*nmi', 'pk', 'dwt', 'CHF 2005', 'LTL 2000', 'J', '(cmol*m2*a)/kg', 'EEK 2000', 'GBP 2000', 'kg R11-Equiv.', 'in3', 'KRW 2000', 't', 'Bq', 'kg*km', 'GJ', 'ac', 'km', 'dm2', 'sFr', 'TOE', 'USD 2000', 'bl (US fl)', 'Item(s)', 'kWh', 't*km', 'in2', 'yd2', 'MJ/kg*d', 

## Objective: convert "kg N" into a unit from ELCD above

Concrete goal: make a function

```python
from typing import Set

def map_unit(source_unit: str, units: Set[str]) -> str:
    return dest_unit
```

[Source from Hestia](https://www.hestia.earth/glossary?page=1&query=Excreta%20(kg%20N))

![Hestia kg N](hestia-1.png)

In [11]:
from typing import Set

def map_unit_1(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))
    
    prompt = f"""
    ```
    {', '.join(sorted_units)}
    ```

    Between the backticks is an exhaustive list of allowed units.
    
    Which of these units does {source_unit} correspond to?
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ]
    )
    return response["choices"][0]["message"]["content"]

In [12]:
map_unit_1("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'The unit "kg N" corresponds to a kilogram of nitrogen.'

In [13]:
map_unit_1("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'The unit "kg N" corresponds to kilograms of nitrogen.'

In [14]:
map_unit_1("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'The unit "kg N" corresponds to kilograms of nitrogen.'

### Notice a difference?

We don't always get the same output. There is randomness in the output. This is a bit scary.

We can set lower the randomness by settting `temperature=0`.

From now on this will be the default setting.

### Seems to have worked! But I'd like to use that in my code, how do I do that?

<details>
  
<summary><b>Answer</b></summary>

- Build a regex? Not robust.
- Ask the LLM to return a structured output? Yes please.

</details>


In [15]:
def map_unit_2(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = "Output a JSON"
    
    prompt = f"""
    ```
    {', '.join(sorted_units)}
    ```

    Between the backticks is an exhaustive list of allowed units.
    
    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    return response["choices"][0]["message"]["content"]

In [16]:
map_unit_2("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'{\n  "kg N": "kg of Nitrogen"\n}'

### Cool, that's an improvement, let's parse it

In [17]:
def map_unit_3(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = "Output a JSON"
    
    prompt = f"""
    ```
    {', '.join(sorted_units)}
    ```

    Between the backticks is an exhaustive list of allowed units.
    
    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)
    return parsed_output[source_unit]


In [18]:
map_unit_3("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'kg of Nitrogen'

In [20]:
def map_unit_4(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = "Output a JSON with key 'unit' and value the unit from the list of units between the backticks"
    
    prompt = f"""
    ```
    {', '.join(sorted_units)}
    ```

    Between the backticks is an exhaustive list of allowed units.
    
    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)
    return parsed_output["unit"]


In [21]:
map_unit_4("kg N", units_set)

The prompt is: 
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg R11-Equiv., kg SO2-Equiv., kg SWU, kg Sb-Equiv., kg*a, kg*d, kg*km, kg

'kg N'

In [22]:
def map_unit_5(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = "Output a JSON with key 'unit' and value the unit from the list of units between the backticks"
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)
    return parsed_output["unit"]


In [23]:
map_unit_5("kg N", units_set)

The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

'kg'

## This still feels really brittle. Can we type this?



In [24]:
import pydantic

PYDANTIC_FORMAT_INSTRUCTIONS = """The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {{"properties": {{"foo": {{"title": "Foo", "description": "a list of strings", "type": "array", "items": {{"type": "string"}}}}}}, "required": ["foo"]}}
the object {{"foo": ["bar", "baz"]}} is a well-formatted instance of the schema. The object {{"properties": {{"foo": ["bar", "baz"]}}}} is not well-formatted.

Here is the output schema:
```
{schema}
```"""

class UnitResponse(pydantic.BaseModel):
    unit: str = pydantic.Field("the mapped unit from the provided exhaustive list")

In [25]:
UnitResponse.schema()

/tmp/ipykernel_17791/3982187443.py:1: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  UnitResponse.schema()


{'properties': {'unit': {'default': 'the mapped unit from the provided exhaustive list',
   'title': 'Unit',
   'type': 'string'}},
 'title': 'UnitResponse',
 'type': 'object'}

In [26]:
def describe_pydantic_schema_as_str(model: pydantic.BaseModel):
    schema = model.schema()
    for key, value in schema.get("properties", {}).items():
        if "title" in value:
            del value["title"]
        if "type" in value and "description" in value:
            value = value["description"]
            schema["properties"][key] = value
    return json.dumps(schema)

In [27]:
describe_pydantic_schema_as_str(UnitResponse)

/tmp/ipykernel_17791/2251163081.py:2: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  schema = model.schema()


'{"properties": {"unit": {"default": "the mapped unit from the provided exhaustive list", "type": "string"}}, "title": "UnitResponse", "type": "object"}'

In [28]:
def map_unit_6(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = PYDANTIC_FORMAT_INSTRUCTIONS.format(
        schema=describe_pydantic_schema_as_str(UnitResponse)
    )
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    return json.loads(output_json_str)

In [29]:
map_unit_6("kg N", units_set)

/tmp/ipykernel_17791/2251163081.py:2: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  schema = model.schema()


The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

{'unit': 'kg'}

In [30]:
def map_unit_7(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = PYDANTIC_FORMAT_INSTRUCTIONS.format(
        schema=describe_pydantic_schema_as_str(UnitResponse)
    )
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does {source_unit} correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)

    assert UnitResponse.parse_obj(parsed_output)
    
    return parsed_output

In [31]:
map_unit_7("kg N", units_set)

/tmp/ipykernel_17791/2251163081.py:2: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  schema = model.schema()


The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

/tmp/ipykernel_17791/4077376177.py:35: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  assert UnitResponse.parse_obj(parsed_output)


{'unit': 'kg'}

### What is that is a sub/super unit?

Let's make our schema more complex and see if this works out of the box.

In [32]:
class UnitResponse2(pydantic.BaseModel):
    source_unit: str = pydantic.Field("the source unit")
    allowed_unit: str = pydantic.Field("the allowed unit")
    conversion_factor: int = pydantic.Field("the conversion factor from the provided unit to the mapped unit one")


def map_unit_8(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = PYDANTIC_FORMAT_INSTRUCTIONS.format(
        schema=describe_pydantic_schema_as_str(UnitResponse2)
    )
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does source unit: `{source_unit}` correspond to?

    {response_formatter}
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)

    assert UnitResponse2.parse_obj(parsed_output)
    
    return parsed_output

In [33]:
map_unit_8("kg N", units_set)

/tmp/ipykernel_17791/2251163081.py:2: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  schema = model.schema()


The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

/tmp/ipykernel_17791/3893187837.py:41: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  assert UnitResponse2.parse_obj(parsed_output)


{'source_unit': 'kg N', 'allowed_unit': 'kg', 'conversion_factor': 1}

In [34]:
def map_unit_9(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = PYDANTIC_FORMAT_INSTRUCTIONS.format(
        schema=describe_pydantic_schema_as_str(UnitResponse2)
    )
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does source unit: `{source_unit}` correspond to?

    {response_formatter}
    Also, provide a detailed explanation in the output.
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_json_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_json_str}")

    parsed_output = json.loads(output_json_str)

    assert UnitResponse2.parse_obj(parsed_output)
    
    return parsed_output

In [35]:
map_unit_9("kg N", units_set)

/tmp/ipykernel_17791/2251163081.py:2: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  schema = model.schema()


The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

JSONDecodeError: Extra data: line 7 column 1 (char 79)

### Too brittle! How do we fix this?

<details>
    <summary><b>Answer</b></summary>
    - Add a regex to parse JSON output out of the text response
</details>

In [36]:
import re

def extract_json(text: str):
    match = re.search(r"\{.*\}", text.strip(), re.MULTILINE | re.IGNORECASE | re.DOTALL)
    if not match:
        return None

    json_str = match.group()
    return json_str

def map_unit_10(source_unit: str, units_set: Set[str]) -> str:
    sorted_units = list(sorted(units_set))

    response_formatter = PYDANTIC_FORMAT_INSTRUCTIONS.format(
        schema=describe_pydantic_schema_as_str(UnitResponse2)
    )
    
    prompt = f"""
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    {', '.join(sorted_units)}
    ```

    Which of these units does source unit: `{source_unit}` correspond to?

    {response_formatter}
    Also, provide a detailed explanation in the output.
    """

    print(f"The prompt is: {prompt}")
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_str}")
    
    output_json_str = extract_json(output_str)
    print(f"Output JSON: {output_str}")

    parsed_output = json.loads(output_json_str)

    assert UnitResponse2.parse_obj(parsed_output)
    
    return parsed_output

In [37]:
map_unit_10("kg N", units_set)

/tmp/ipykernel_17791/2251163081.py:2: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  schema = model.schema()


The prompt is: 
    Between the backticks is an exhaustive list of allowed units:
    
    ```
    $, (cmol*m2)/kg, (cmol*m2*a)/kg, (mm*m2)/a, AUD 2000, Bq, CAD 2000, CHF 2000, CHF 2005, CZK 2000, Ci, DKK 2000, Dozen(s), EEK 2000, EUR, EUR 2000, EUR 2003, GBP 2000, GJ, HUF 2000, ISK 2000, Item(s), Items*a, Items*km, Items*mi, Items*nmi, J, JPY 2000, KRW 2000, LTL 2000, LVL 2000, M$ 2000, MJ, MJ/kg*d, MWh, Mg, Mt, NOK 2000, Nm3, PJ, Rutherford, SEK 2000, TCE, TJ, TOE, UBP, US fl oz, USD 2000, USD 2002, Wh, Yen, ZAR 2000, a, ac, bbl, bl (Imp), bl (US beer), bl (US dry), bl (US fl), bsh (Imp), bsh (US), btu, cg, ch, cl, cm, cm*m2/d, cm*m3, cm2, cm2a, cm3, cm3*a, ct, cu ft, cwt, d, dag, dal, dam, dg, dl, dm, dm2, dm3, dr (Av), dr (Fl), dwt, fl oz (Imp), ft, ft2, ft2*a, ftm, fur, g, g*a, gal (Imp), gal (US dry), gal (US fl), gal (US liq), gill, gr, h, ha, ha*a, hg, hl, hm, in, in2, in3, kBq, kJ, kWh, kWh/m2*d, kcal, kg, kg CO2-Equiv., kg DCB-Equiv., kg Ethene-Equiv., kg Phosphate-Equiv., kg

/tmp/ipykernel_17791/2516497585.py:49: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  assert UnitResponse2.parse_obj(parsed_output)


{'source_unit': 'kg N', 'allowed_unit': 'kg', 'conversion_factor': 1}

## What we built is a "chain"

See [Langchain chains](https://python.langchain.com/docs/modules/chains/)

The chain was:

- Prompt
- Input variables: "the unit" and the "allowed units"
- Structure that must be present in the output: the Pydantic format
- Output parsing: regex + Pydantic validation

# Application 2. Assisted search.

**Scenario:** a company ships 20 tons of aluminum from Luxembourg to Marseille by truck.

ELCD has a bunch of processes, how do we help this company / LCA practitioner find the right one?

### Truck options in ELCD

![Truck options in ELCD](trucks-1.png)

## Based on what we did so far, what's the most naive, brute force idea that comes to mind?

<details>
    <summary><b>Answer</b></summary>
    Show all ELCD processes in a prompt and ask questions!
</details>

In [38]:
def load_processes(root: str):
    processes_by_uuid = dict()
    for process_fp in tqdm(glob.glob(os.path.join(root, "processes", "*.json"))):
        with open(process_fp) as process_f:
            process = json.load(process_f)
        processes_by_uuid[process["@id"]] = process
    return processes_by_uuid

In [39]:
processes_by_uuid = load_processes(os.path.join(DSROOT, "elcd"))

print(len(processes_by_uuid))

  0%|          | 0/609 [00:00<?, ?it/s]

609


In [40]:
sample_process = processes_by_uuid["b4451be0-3393-11dd-bd11-0800200c9a66"]
print(f"name: {sample_process['name']} \n\ndescription: {sample_process['processDocumentation']['technologyDescription']}")

name: Small lorry transport, Euro 0, 1, 2, 3, 4 mix, 7,5 t total weight, 3,3 t max payload 

description: Weighted average of small lorries with 7.5t total weight for emission standards from EURO 0 to EURO 4. Payload of the lorry is 3.3t; its utilization ratio is 85%. The following combustion emissions (measured data) of the lorry are taken into account: ammonia, benzene, carbon dioxide, carbon monoxide, methane, nitrogen oxides, nitrous oxide, NMVOC, particulate PM 2.5, sulfur dioxide, toluene, xylene. NMVOC, toluene and xylene emissions of the vehicle result from imperfect combustion and evaporation losses via diffusion through the tank. Lorry fueled by diesel. Data set includes the whole fuel supply chain from exploration and extraction of crude oil over preparation to transportation to consumer. The background system is addressed as follows: Refinery products: Diesel, gasoline, technical gases, fuel oils, basic oils and residues such as bitumen are modelled via a country-specific, 

In [41]:
from typing import Any, Dict

def generate_processes_prompt(processes: Dict[str, Dict[str, Any]]):
    prompts = []
    for process in processes.values():
        process_prompt = f"""
        ID: `{process["@id"]}`, Name: `{process["name"]}`, Description: `{process.get("processDocumentation", {}).get("technologyDescription")}`
        """
        
        prompts.append(process_prompt)
    return "\n".join(prompts)

In [42]:
naive_prompt = generate_processes_prompt(processes_by_uuid)
print(f"Size: {len(naive_prompt)}")

print(naive_prompt[:10000])

Size: 1610967

        ID: `952ccd7f-dc94-4f84-b2cf-007ef682f3f6`, Name: `Process steam from light fuel oil, consumption mix, at plant, heat plant, MJ`, Description: `The process steam is produced in a light fuel oil specific heat plant. The Czech specific fuel supply (share of resources used, by import and / or domestic supply) including the Czech specific energy carrier properties (e.g. element and energy contents) are accounted for. Furthermore Czech specific technology standards of heat plants regarding efficiency, firing technology, flue-gas desulphurisation, NOx removal and dedusting are considered. The Czech emission factors can be found in the table below in the corresponding column. The data set considers the whole supply chain of the fuels from exploration over extraction and preparation to transport of fuels to the heat plants. Furthermore the data set comprises the infrastructure as well as end-of-life of the plant.   The background system is addressed as follows:  Transpor

In [43]:
user_query = "Truck transport of 20 tonnes of aluminum"

In [44]:
def search_1(user_query: str, processes_prompt: str) -> str:
    
    prompt = f"""
    Between the triple backticks is your knowledge about industrial processes. Each process has an ID, a Name, and a Description, in the format
    ID: `the ID`, Name: `the name`, Description: `the description`
    
    ```
    {processes_prompt}
    ```

    Which of these process is the most likely to represent: `{user_query}`?
    Return its ID, Name, and Description.
    """
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_str = response["choices"][0]["message"]["content"]
    
    return output_str

In [45]:
search_1(user_query, naive_prompt)

RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-DfwZQdiyjqKIMuK7jmO81ebG on tokens per min. Limit: 90000 / min. Current: 0 / min. Contact us through our help center at help.openai.com if you continue to have issues.

### Limit reached!

We reached both a per minute limit.

**So we need to be a little more subtle.**

#### Idea 2: let's only pass transport processes

In [46]:
sample_process["category"]

{'@type': 'Category',
 '@id': '462bd94c-1d21-3c92-ac49-42263f473d5b',
 'name': 'Road',
 'categoryPath': ['Transport services'],
 'categoryType': 'Process'}

In [47]:
def get_transport_processes(processes_by_uuid):
    filtered_processes_by_uuid = dict()
    for k, process in processes_by_uuid.items():
        if not "Transport services" in str(process["category"].get("categoryPath", [])):
            continue
        filtered_processes_by_uuid[k] = process
    return filtered_processes_by_uuid

In [48]:
transport_processes = get_transport_processes(processes_by_uuid)
print(f"Size: {len(transport_processes)}")

for process in sorted(transport_processes.values(), key=lambda p: p["name"]):
    print(process["name"])

Size: 22
Articulated lorry transport, Euro 0, 1, 2, 3, 4 mix, 40 t total weight, 27 t max payload
Articulated lorry transport, Euro 0, 1, 2, 3, 4 mix, 40 t total weight, 27 t max payload
Barge, technology mix, 1.228 t pay load capacity
Barge, technology mix, 1.228 t pay load capacity
Bulk carrier ocean, technology mix, 100.000-200.000 dwt
Bulk carrier ocean, technology mix, 100.000-200.000 dwt
Container ship ocean, technology mix, 27.500 dwt pay load capacity
Container ship ocean, technology mix, 27.500 dwt pay load capacity
Excavator, technology mix, 100 kW, Construction
Excavator, technology mix, 500 kW, Mining
Lorry transport, Euro 0, 1, 2, 3, 4 mix, 22 t total weight, 17,3 t max payload
Lorry transport, Euro 0, 1, 2, 3, 4 mix, 22 t total weight, 17,3t max payload
Mining Truck, technology mix, 220 t payload, 1.700 kW
Mining Truck, technology mix, 220 t payload, 1.700 kW
Plane, technology mix, cargo, 68 t payload
Plane, technology mix, cargo, 68 t payload
Rail transport, technology m

In [49]:
transport_processes_prompt = generate_processes_prompt(transport_processes)
print(f"Size: {len(transport_processes_prompt)}")

print(transport_processes_prompt[:10000])

Size: 21462

        ID: `6efc3814-2cf9-4d18-894f-c869c3505341`, Name: `Container ship ocean, technology mix, 27.500 dwt pay load capacity`, Description: `Container ship ocean with 27500 dead weight tons (dwt) pay load capacity. Variable parameters (with default setting) are: distance (100km). Inputs: heavy fuel oil and cargo. Outputs: cargo and combustion emissions (carbon dioxide, carbon monoxide, methane, nitrogen oxides, NMVOC, particulate PM 2.5, sulphur dioxide). Vessel production, end-of-life treatment of the vessel and the fuel supply chain (emissions of exploration, refinery, transportation etc.) are not included in the data set.`
        

        ID: `09aa1e7b-1d7d-4a7c-8dde-ec6d40022fa1`, Name: `Barge, technology mix, 1.228 t pay load capacity`, Description: `Barge with 1228t pay load capacity. The following combustion emissions (measured data) of the barge are taken into account: carbon dioxide, carbon monoxide, methane, nitrogen oxides, NMVOC, particulate PM 2.5, sulphur 

In [50]:
search_1(user_query, transport_processes_prompt)

InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 5104 tokens. Please reduce the length of the messages.

### Limit reached!

We reached both a limit of tokens per query.

Read about overcoming context window limitations, if you want to write a full book or PhD thesis with ChatGPT ;).

**So we need to be a little more subtle.**

#### Idea 3: have you heard of embeddings?

LLM models represent text data in a space of fixed size (1536 for the model OpenAI uses). If normalized, values are always between -1 and 1.

They allow us to compute distances between bits of text. Shall we use that to filter down our processes?

**To keep it simple, we'll only get embeddings of process names.**

In [51]:
response = openai.Embedding.create(
    input="But what the heck is an embedding",
    model="text-embedding-ada-002"
)
embeddings = response['data'][0]['embedding']
print(f"Size: {len(embeddings)}")
embeddings[:20]

Size: 1536


[-0.025320781394839287,
 -0.003034152090549469,
 -0.003502800129354,
 -0.004955264739692211,
 -0.01615457609295845,
 -0.005720263812690973,
 -0.013721740804612637,
 -0.02195754274725914,
 -0.030544830486178398,
 -0.018594304099678993,
 0.023639163002371788,
 0.03051726333796978,
 -0.01849781721830368,
 0.010199988260865211,
 -0.017050521448254585,
 0.027140239253640175,
 0.04270211234688759,
 0.01796025037765503,
 0.005489385686814785,
 -0.01779484562575817]

In [52]:
def compute_embeddings(processes_by_uuid, cache_fp: str=None):
    if cache_fp and os.path.exists(cache_fp):
        with open(cache_fp) as cache_f:
            return json.load(cache_f)
    
    embeddings_by_process_uuid = {}
    for process_id, process in tqdm(processes_by_uuid.items()):
        response = openai.Embedding.create(
            input=process["name"],
            model="text-embedding-ada-002"
        )
        embeddings = response['data'][0]['embedding']
        embeddings_by_process_uuid[process_id] = embeddings

    if cache_fp:
        with open(cache_fp, "w") as cache_f:
            json.dump(embeddings_by_process_uuid, cache_f)
    return embeddings_by_process_uuid

In [53]:
embeddings_cache_fp = os.path.join(DSROOT, "elcd_embeddings.json")
embeddings_by_process_uuid = compute_embeddings(processes_by_uuid, cache_fp=embeddings_cache_fp)

In [54]:
len(embeddings_by_process_uuid)

609

In [55]:
from typing import List

import pandas as pd

def make_embeddings_df(embeddings_dict: Dict[str, List[float]], processes_dict: Dict[str, Dict[str, Any]]) -> pd.DataFrame:
    return pd.DataFrame([
        {"id": key, "embedding": value, "name": processes_dict[key]["name"]}
        for key, value in embeddings_dict.items()
    ])

In [56]:
embeddings_df = make_embeddings_df(embeddings_by_process_uuid, processes_by_uuid)
embeddings_df

Unnamed: 0,id,embedding,name
0,952ccd7f-dc94-4f84-b2cf-007ef682f3f6,"[-0.0014295520959421992, -0.009667548350989819...","Process steam from light fuel oil, consumption..."
1,ceeb4a1a-3f76-4a0c-bf9e-032275044947,"[0.004919763654470444, -0.009595843963325024, ...","Process steam from natural gas 90%, consumptio..."
2,83c1f02c-f2ef-4ac4-9a57-ac2172c38d15,"[0.024013634771108627, -0.014745914377272129, ...","Electricity Mix, consumption mix, at consumer,..."
3,339b2536-c881-409d-ac71-49ab0d228fe3,"[-0.014571606181561947, -0.005526041146367788,...","Steel hot dip galvanized (ILCD), production mi..."
4,9a97696f-2ad0-4f29-8adc-774f42a91056,"[0.006041690707206726, -0.01525551825761795, -...","Process steam from Light fuel oil 90 %, consum..."
...,...,...,...
604,c84025a7-1442-4200-b1d7-16b650017e6d,"[-0.0014295520959421992, -0.009667548350989819...","Process steam from light fuel oil, consumption..."
605,df59cb17-abe4-4856-8355-630f4b002e8c,"[-0.02165055274963379, -0.006130277179181576, ...",Dummy_secondary fuel
606,4a1ebe7c-6835-4a22-8b2e-3201f1cd32e8,"[-0.014447399415075779, -0.011586869135499, -0...","Kaolin coarse filler , at plant, Production"
607,047bd537-3e15-4490-aa5a-e7205b314a59,"[-0.002389002125710249, -0.0075097838416695595...",Dummy_Hydrogen


In [57]:
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def search_embeddings(embeddings_df: pd.DataFrame, query: str, n: int=10):
    response = openai.Embedding.create(
        input=query,
        model="text-embedding-ada-002"
    )
    query_embeddings = response['data'][0]['embedding']
    embeddings_df['similarities'] = embeddings_df["embedding"].apply(lambda process_embedding: cosine_similarity(process_embedding, query_embeddings))
    return embeddings_df.sort_values('similarities', ascending=False).head(n)

In [58]:
print(f"We search in a DB of size: {len(embeddings_df)}")

embedding_query_results = search_embeddings(embeddings_df, user_query)
embedding_query_results

We search in a DB of size: 609


Unnamed: 0,id,embedding,name,similarities
545,b444f4d3-3393-11dd-bd11-0800200c9a66,"[0.026943303644657135, -0.02280818298459053, 0...","Lorry transport, Euro 0, 1, 2, 3, 4 mix, 22 t ...",0.854273
238,b444f4d2-3393-11dd-bd11-0800200c9a66,"[0.02788308635354042, -0.02220512367784977, 0....","Lorry transport, Euro 0, 1, 2, 3, 4 mix, 22 t ...",0.854104
127,40d64e6d-a823-471e-b759-339027c0a4f7,"[0.009484073147177696, -0.024378350004553795, ...","Mining Truck, technology mix, 220 t payload, 1...",0.851621
289,6b0d5bef-6e69-4b8c-98e2-31a67fe890bc,"[0.009484073147177696, -0.024378350004553795, ...","Mining Truck, technology mix, 220 t payload, 1...",0.851621
425,b444f4d0-3393-11dd-bd11-0800200c9a66,"[0.01915021985769272, -0.021432770416140556, 0...","Articulated lorry transport, Euro 0, 1, 2, 3, ...",0.849613
290,b444f4d1-3393-11dd-bd11-0800200c9a66,"[0.01915021985769272, -0.021432770416140556, 0...","Articulated lorry transport, Euro 0, 1, 2, 3, ...",0.849613
432,b4451be0-3393-11dd-bd11-0800200c9a66,"[0.025345314294099808, -0.014712837524712086, ...","Small lorry transport, Euro 0, 1, 2, 3, 4 mix,...",0.842276
551,b444f4d4-3393-11dd-bd11-0800200c9a66,"[0.02534065581858158, -0.014704645611345768, 0...","Small lorry transport, Euro 0, 1, 2, 3, 4 mix,...",0.842215
6,6efc3814-2cf9-4d18-894f-c869c3505341,"[0.009053954854607582, -0.012108300812542439, ...","Container ship ocean, technology mix, 27.500 d...",0.833929
187,7d4c6dee-3d6b-4fbf-a496-2354450a1a14,"[0.008841284550726414, -0.012290543876588345, ...","Container ship ocean, technology mix, 27.500 d...",0.833725


In [59]:
def search_2(user_query: str, processes_by_uuid: Dict[str, Dict[str, Any]], embeddings_df: pd.DataFrame, n: int=10) -> str:
    embedding_query_results = search_embeddings(embeddings_df, user_query, n=n)

    potential_processes = {}
    for process_id in embedding_query_results["id"]:
        potential_processes[process_id] = processes_by_uuid[process_id]
    
    processes_prompt = generate_processes_prompt(potential_processes)
    print(f"Prompt size: {len(processes_prompt)}")
        
    prompt = f"""
    Between the triple backticks is your knowledge about industrial processes. Each process has an ID, a Name, and a Description, in the format
    ID: `the ID`, Name: `the name`, Description: `the description`
    
    ```
    {processes_prompt}
    ```

    Which of these process is the most likely to represent: `{user_query}`?
    Return its ID, Name, and Description.
    """
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )
    output_str = response["choices"][0]["message"]["content"]
    
    return output_str

In [60]:
search_2(user_query, processes_by_uuid, embeddings_df, n=10)

Prompt size: 10788


'The process that is most likely to represent "Truck transport of 20 tonnes of aluminum" is:\n\nID: `b444f4d3-3393-11dd-bd11-0800200c9a66`\nName: `Lorry transport, Euro 0, 1, 2, 3, 4 mix, 22 t total weight, 17,3t max payload`\nDescription: `Weighted average of lorries with 22t total weight for emission standards from to EURO 0 to Euro 4. Total payload of lorry is 17.3t; its utilization ratio is 85%. The following combustion emissions (measured data) of the lorry are taken into account: ammonia, benzene, carbon dioxide, carbon monoxide, methane, nitrogen oxides, nitrous oxide, NMVOC, particulate PM 2.5, sulphur dioxide, toluene, xylene. NMVOC, toluene and xylene emissions of the truck result from imperfect combustion and evaporation losses via diffusion through the tank. Lorry fueled by diesel. Data set includes the whole fuel supply chain from exploration and extraction of crude oil over preparation to transportation to consumer. The background system is addressed as follows: Refinery 

In [61]:
class SearchResponse2(pydantic.BaseModel):
    process_id: str = pydantic.Field("process ID")
    process_name: str = pydantic.Field("process name")


def search_3(user_query: str, processes_by_uuid: Dict[str, Dict[str, Any]], embeddings_df: pd.DataFrame, n: int=10) -> str:
    embedding_query_results = search_embeddings(embeddings_df, user_query, n=n)

    response_formatter = PYDANTIC_FORMAT_INSTRUCTIONS.format(
        schema=describe_pydantic_schema_as_str(SearchResponse2)
    )
    

    potential_processes = {}
    for process_id in embedding_query_results["id"]:
        potential_processes[process_id] = processes_by_uuid[process_id]
    
    processes_prompt = generate_processes_prompt(potential_processes)
    print(f"Prompt size: {len(processes_prompt)}")
        
    prompt = f"""
    Between the triple backticks is your knowledge about industrial processes. Each process has an ID, a Name, and a Description, in the format
    ID: `the ID`, Name: `the name`, Description: `the description`
    
    ```
    {processes_prompt}
    ```

    Which of these process is the most likely to represent: `{user_query}`?
    {response_formatter}
    """
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0
    )

    output_str = response["choices"][0]["message"]["content"]
    print(f"Output: {output_str}")
    
    output_json_str = extract_json(output_str)
    print(f"Output JSON: {output_str}")

    parsed_output = json.loads(output_json_str)

    assert SearchResponse2.parse_obj(parsed_output)
    
    return parsed_output

In [62]:
search_3(user_query, processes_by_uuid, embeddings_df, n=5)

/tmp/ipykernel_17791/2251163081.py:2: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  schema = model.schema()


Prompt size: 4843
Output: {
  "process_id": "b444f4d2-3393-11dd-bd11-0800200c9a66",
  "process_name": "Lorry transport, Euro 0, 1, 2, 3, 4 mix, 22 t total weight, 17,3 t max payload"
}
Output JSON: {
  "process_id": "b444f4d2-3393-11dd-bd11-0800200c9a66",
  "process_name": "Lorry transport, Euro 0, 1, 2, 3, 4 mix, 22 t total weight, 17,3 t max payload"
}


/tmp/ipykernel_17791/2171983129.py:50: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.3/migration/
  assert SearchResponse2.parse_obj(parsed_output)


{'process_id': 'b444f4d2-3393-11dd-bd11-0800200c9a66',
 'process_name': 'Lorry transport, Euro 0, 1, 2, 3, 4 mix, 22 t total weight, 17,3 t max payload'}

## This has a name: RAG = Retrieval Augmented Generation

You can also think of it as a more complex chain:

- User query
- Embedding
- Cosinus similarity to the world of embeddings (in production, do not use Pandas, but a vector database, there are tons)
- Pick first N (N=5..20)
- (Please apply other filters based on business logic)
- Build a "sort of structured" prompt from these candidates
- Add the output parser definition
- Parse the output

# Conclusion / Next steps

- The danger is to find this magical and use it as a SISP "Solution In Search of a Problem". A good framework is:
  - Understand thoroughly your problem
  - Build your ideal engineering system to solve it
  - If it has bottlenecks that can be resolved with an LLM, try to find the tiniest application and use it there.
  - The smaller the task, the easier to understand and debug.
- If you're building a frontend, always give your user your option to go back to "manual" mode / "stick shift".

## Gift

[My AI reading list](https://public.3.basecamp.com/p/RUgMdhPpg72dPP5Y5MNDMXHm)