# Data

This notebook collects the different datasets into a new dataset with a common format. Furthermore, it also embeds the tool descriptions.

Currently collects ToolE, ToolLens & Reverse-chain

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import os
import sys
sys.path.append(os.path.abspath(os.path.join('..')))
import lib
from lib.data import Datasets

## Collecting Data

### Collect Data

In [None]:
dataset_main = Datasets.Main(load=False)

parsed_datasets: list[tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]] = [
    Datasets.ToolE().parse_to_main(),
    Datasets.ToolLens().parse_to_main(),
    Datasets.ReverseChain().parse_to_main(),
    Datasets.Berkeley().parse_to_main(),
]

for t in parsed_datasets:
    dataset_main.queries = pd.concat([dataset_main.queries, t[0]])
    dataset_main.tool_descriptions = pd.concat([dataset_main.tool_descriptions, t[1]])
    dataset_main.tool_embeddings = pd.concat([dataset_main.tool_embeddings, t[2]])
del t

dataset_main.queries.reset_index(drop=True, inplace=True)
dataset_main.tool_descriptions.reset_index(drop=True, inplace=True)
dataset_main.tool_embeddings.reset_index(drop=True, inplace=True)

In [4]:
dataset_main.tool_descriptions

Unnamed: 0,tool,tool_name,description,source,metadata
0,tool_e.timeport,timeport,"Begin an exciting journey through time, intera...",ToolE,{'source': 'ToolE'}
1,tool_e.airqualityforeast,airqualityforeast,Planning something outdoors? Get the 2-day air...,ToolE,{'source': 'ToolE'}
2,tool_e.copilot,copilot,"Searches every dealer, analyzes & ranks every ...",ToolE,{'source': 'ToolE'}
3,tool_e.tira,tira,Shop Tira for top beauty brands! Explore cosme...,ToolE,{'source': 'ToolE'}
4,tool_e.calculator,calculator,A calculator app that executes a given formula...,ToolE,{'source': 'ToolE'}
...,...,...,...,...,...
2119,berkeley.hotel_book,hotel_book,"Book a hotel room given the location, room typ...",Berkeley,"{'name': 'hotel.book', 'description': 'Book a ..."
2120,berkeley.hotel_room_pricing_get,hotel_room_pricing_get,Get pricing for a specific type of hotel room ...,Berkeley,"{'name': 'hotel_room_pricing.get', 'descriptio..."
2121,berkeley.car_rental_pricing_get,car_rental_pricing_get,Get pricing for a specific type of rental car ...,Berkeley,"{'name': 'car_rental_pricing.get', 'descriptio..."
2122,berkeley.flight_ticket_pricing_get,flight_ticket_pricing_get,Get pricing for a specific type of flight tick...,Berkeley,"{'name': 'flight_ticket_pricing.get', 'descrip..."


### Check for duplicates

We check for duplicates in the tool description data before we start to embed

In [5]:
duplicates = dataset_main.tool_descriptions[dataset_main.tool_descriptions.duplicated(subset='tool', keep=False)]
duplicates

Unnamed: 0,tool,tool_name,description,source,metadata


### Embedding


[OpenAI - Vector embeddings](https://platform.openai.com/docs/guides/embeddings?lang=python)

[Embeddings pricing](https://platform.openai.com/docs/pricing#embeddings)



In [6]:
embedding_client = lib.embedding.OpenAIEmbeddings('text-embedding-ada-002')

We load and check previous embedding file

In [7]:
embedding_filepath = os.path.join(dataset_main.path, Datasets.Main.FileNames.tool_embeddings) 

if os.path.exists(embedding_filepath):
    print(f"Loading existing embeddings from {embedding_filepath}")
    tool_embeddings = Datasets.Main.read_embeddings(embedding_filepath)
    dataset_main.tool_embeddings.set_index('tool', inplace=True)
    tool_embeddings.set_index('tool', inplace=True)
    dataset_main.tool_embeddings.fillna(tool_embeddings, inplace=True)
    dataset_main.tool_embeddings.reset_index(inplace=True)
    del tool_embeddings

Loading existing embeddings from d:\Github\KU-2025-Master-Thesis\lib\..\data\main\tool_embeddings.pkl


Embed tools without embeddings

In [8]:
df_embeddings_with_description = pd.merge(dataset_main.tool_embeddings, dataset_main.tool_descriptions, on='tool', how='left')
df_embeddings_with_description_missing = df_embeddings_with_description[df_embeddings_with_description['embedding'].isna()].reset_index(drop=True)

new_embeddings = embedding_client.get_embeddings(df_embeddings_with_description_missing['description'].tolist(), verbose=True)
for i in range(len(df_embeddings_with_description_missing)):
    df_embeddings_with_description_missing.at[i, 'embedding'] = new_embeddings[i]
df_embeddings_with_description_missing = df_embeddings_with_description_missing[['tool', 'embedding']]

dataset_main.tool_embeddings.set_index('tool', inplace=True)
df_embeddings_with_description_missing.set_index('tool', inplace=True)
dataset_main.tool_embeddings.fillna(df_embeddings_with_description_missing, inplace=True)
dataset_main.tool_embeddings.reset_index(inplace=True)

0it [00:00, ?it/s]


Then we saving the data

In [9]:
# Save the data
dataset_main.save_data()

### Inspecting data

In [10]:
dataset_main.queries

Unnamed: 0,query,tool,source
0,Can I find academic research papers on this to...,[tool_e.research_helper],ToolE
1,Can I find any peer-reviewed papers?,[tool_e.research_helper],ToolE
2,Can I generate bibtex bibliographies?,[tool_e.research_helper],ToolE
3,Can I use Crossref with Chatbot?,[tool_e.research_helper],ToolE
4,Can you answer a question about a research pap...,[tool_e.research_helper],ToolE
...,...,...,...
42387,Use a tool (or multiple if needed) to assist w...,"[berkeley.currency_exchange_convert, berkeley....",Berkeley
42388,Use a tool (or multiple if needed) to assist w...,"[berkeley.park_information, berkeley.legal_cas...",Berkeley
42389,Use a tool (or multiple if needed) to assist w...,"[berkeley.grocery_store_find_best, berkeley.ca...",Berkeley
42390,Use a tool (or multiple if needed) to assist w...,"[berkeley.sentiment_analysis, berkeley.psych_r...",Berkeley


In [11]:
dataset_main.tool_descriptions

Unnamed: 0,tool,tool_name,description,source,metadata
0,tool_e.timeport,timeport,"Begin an exciting journey through time, intera...",ToolE,{'source': 'ToolE'}
1,tool_e.airqualityforeast,airqualityforeast,Planning something outdoors? Get the 2-day air...,ToolE,{'source': 'ToolE'}
2,tool_e.copilot,copilot,"Searches every dealer, analyzes & ranks every ...",ToolE,{'source': 'ToolE'}
3,tool_e.tira,tira,Shop Tira for top beauty brands! Explore cosme...,ToolE,{'source': 'ToolE'}
4,tool_e.calculator,calculator,A calculator app that executes a given formula...,ToolE,{'source': 'ToolE'}
...,...,...,...,...,...
2119,berkeley.hotel_book,hotel_book,"Book a hotel room given the location, room typ...",Berkeley,"{'name': 'hotel.book', 'description': 'Book a ..."
2120,berkeley.hotel_room_pricing_get,hotel_room_pricing_get,Get pricing for a specific type of hotel room ...,Berkeley,"{'name': 'hotel_room_pricing.get', 'descriptio..."
2121,berkeley.car_rental_pricing_get,car_rental_pricing_get,Get pricing for a specific type of rental car ...,Berkeley,"{'name': 'car_rental_pricing.get', 'descriptio..."
2122,berkeley.flight_ticket_pricing_get,flight_ticket_pricing_get,Get pricing for a specific type of flight tick...,Berkeley,"{'name': 'flight_ticket_pricing.get', 'descrip..."


In [12]:
dataset_main.tool_embeddings

Unnamed: 0,tool,embedding
0,tool_e.timeport,"[-0.002328580478206277, -0.025154519826173782,..."
1,tool_e.airqualityforeast,"[0.0024755524937063456, 0.001670109573751688, ..."
2,tool_e.copilot,"[-0.013786195777356625, 0.011559398844838142, ..."
3,tool_e.tira,"[-0.0014366412069648504, -0.004678442142903805..."
4,tool_e.calculator,"[-0.005550655536353588, 0.014075472950935364, ..."
...,...,...
2121,berkeley.hotel_book,"[0.016627123579382896, 0.003058888018131256, -..."
2122,berkeley.hotel_room_pricing_get,"[0.01577910967171192, 0.002552404534071684, 0...."
2123,berkeley.car_rental_pricing_get,"[0.006298906169831753, -0.005386214703321457, ..."
2124,berkeley.flight_ticket_pricing_get,"[0.005622945725917816, -0.005735338665544987, ..."
