# 🎓 FrugalGPT Experiment on 5 Dataset: Performance and Cost Tradeoffs against ThriftLLM

This notebook illustrates the FrugalGPT framework for _building LLM Applications with budget constraints._

In particular, we will focus on evaluating the performance and cost tradeoffs enabled by FrugalGPT.

NB: You are highly suggested to use accelerated hardware (GPU/TPU) to run this notebook.

## Installation

In [2]:
%load_ext autoreload
%autoreload 2
import sys, json, copy
import pandas as pd
import logging
logging.disable(logging.CRITICAL)
sys.path.append("src/")

## Setup
Next, let us set up the environment and API keys. You do _not_ need API keys to run the notebook! They are only needed if you want to use FrugalGPT for your own queries.

NB: For your own queries, not all API keys are needed, too. If you only want to leverage LLMs from, e.g., OpenAI and AI21, setting up API keys for them is sufficient.

In [22]:
import os
from IPython.display import display
import FrugalGPT
import numpy
from tqdm import tqdm

supported_LLM = FrugalGPT.getservicename()
print("supported LLMs:",supported_LLM)
supported_LLM_names = [llm.split("/")[1] for llm in supported_LLM]
print("supported_LLM_names:", supported_LLM_names)

supported LLMs: ['google/gemini-1.5-flash-002', 'google/gemini-1.5-pro-002', 'google/gemini-1.0-pro', 'openaichat/gpt-4o-mini', 'openaichat/gpt-4o', 'azure/Phi-3-mini-4k-instruct', 'azure/Phi-3.5-mini-instruct', 'azure/Phi-3-small-8k-instruct', 'azure/Phi-3-medium-4k-instruct', 'deepinfra/llama-3-8B', 'deepinfra/llama-3-70B', 'deepinfra/mixtral-8x7B']
supported_LLM_names: ['gemini-1.5-flash-002', 'gemini-1.5-pro-002', 'gemini-1.0-pro', 'gpt-4o-mini', 'gpt-4o', 'Phi-3-mini-4k-instruct', 'Phi-3.5-mini-instruct', 'Phi-3-small-8k-instruct', 'Phi-3-medium-4k-instruct', 'llama-3-8B', 'llama-3-70B', 'mixtral-8x7B']


## Generating the tradeoffs involves three major steps: (i) prepare the dataset, (ii) train the FrugalGPT strategy, and (iii) evaluate and save the performance.

## Step 1: Prepare the dataset

In [6]:
# dataname = "HEADLINES"
dataname = "OVERRULING"


In [7]:
# read from data/{dataname}/Queried_{dataname}_all_models_clean_train.csv and data/{dataname}/Queried_{dataname}_all_models_clean_test.csv
dataset_df = pd.read_csv(f'data/{dataname}/Queried_{dataname}_all_models_clean.csv', header=0)
dataset_df.head()

Unnamed: 0,query_raw,query,ref_answer,gpt-4o-mini,gpt-4o,llama-3-8B,llama-3-70B,mixtral-8x7B,gemini-1.5-flash-002,gemini-1.0-pro,gemini-1.5-pro-002,Phi-3.5-mini-instruct,Phi-3-small-8k-instruct,Phi-3-mini-4k-instruct,Phi-3-medium-4k-instruct
0,Context: section 3553(c) and this court's case...,Please determine whether a sentence is overrul...,no,no,no,no,no,no,no,no,no,no,yes,no,no
1,"Context: pockman v. leonard, supra, 39 cal.2d ...",Please determine whether a sentence is overrul...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
2,"Context: see, e.g., r.j. griffin & co. v. beac...",Please determine whether a sentence is overrul...,no,no,no,no,no,no,no,no,no,no,no,no,no
3,Context: appellate review that is not founded ...,Please determine whether a sentence is overrul...,no,no,no,no,no,no,no,no,no,no,no,no,no
4,"Context: see also williams v. state, 268 ga. 4...",Please determine whether a sentence is overrul...,no,no,no,no,no,no,no,no,no,no,no,no,no


In [8]:
# reformat the dataset into a list 
# print(train_data[0][0]) # this is the query column
# print(train_data[0][1]) # this is the ref_answer column
# print(train_data[0][2]) # this is the _id, just put the index of the row in the csv file
# print(train_data[0][3]) # this is the models' answer column, concatenate all the models' answer into a list of strings here
# but the key is, you need to make sure you can get train_data[0][2].model_name to get the answer of corresponding model

train_data = []
for index, row in dataset_df.iterrows():
    query = row['query']
    ref_answer = row['ref_answer']
    _id = index
    model_answer = {}
    for model_name in supported_LLM_names:
        model_answer[model_name] = row[model_name]
    train_data.append([query, ref_answer, _id, model_answer])

In [9]:
train_data[3]

['Please determine whether a sentence is overruling a prior decision (Yes or No) in the following statements.\n\nContext: because jones/walker relates only to sufficiency of the evidence, we hereby disavow the language holding otherwise in sandoval.\nQuestion: Is it overruling?\nAnswer: Yes\n\nContext: according to napa auto parts, the straws drove the vehicle """"for approximately six [] weeks and [] for between 500 to 600 miles prior to the accident with no incidents.""""\nQuestion: Is it overruling?\nAnswer: No\n\nContext: appellate review that is not founded upon any factual findings made at the trial court level, but is based upon an independent review and analysis of the contract within the four corners of the document, is not subject to the manifest error rule of law.\nQuestion: Is it overruling?\nAnswer:',
 'no',
 3,
 {'gemini-1.5-flash-002': 'no',
  'gemini-1.5-pro-002': 'no',
  'gemini-1.0-pro': 'no',
  'gpt-4o-mini': 'no',
  'gpt-4o': 'no',
  'Phi-3-mini-4k-instruct': 'no',


In [10]:
# get the answer of the model llama-3-8B
train_data[3][3]['llama-3-8B']

'no'

## Step 2: Train the FrugalGPT strategy for different budgets

In [11]:
service_names = ['openaichat/gpt-4o-mini',
                'openaichat/gpt-4o',
                'google/gemini-1.5-flash-002',
                'google/gemini-1.5-pro-002',
                'google/gemini-1.0-pro',
                'azure/Phi-3-mini-4k-instruct',
                'azure/Phi-3.5-mini-instruct',
                'azure/Phi-3-small-8k-instruct',
                'azure/Phi-3-medium-4k-instruct',
                'deepinfra/llama-3-8B',
                'deepinfra/llama-3-70B',
                'deepinfra/mixtral-8x7B',
                ]

### 1. Let us first evaluate individual models.

In [6]:
import pandas as pd

def generate_dataframe(service_names, train_data, test_data, genparams,db_path="db/SCIQ.sqlite",
                       max_workers=2):
    # Initialize an empty list to store the rows for the DataFrame
    data = []
    MyLLMforAll = FrugalGPT.LLMforAll(
                     db_path=db_path,
                     max_workers=max_workers,

)
    # Dictionary to keep track of markers for each provider
    provider_marker = {}

    # Iterate through the service names
    for name in service_names:
        # Extract provider and method
        provider = name.split('/')[0]
        method = name.split('/')[-1]

        # If the provider is seen for the first time, initialize its marker
        if provider not in provider_marker:
            provider_marker[provider] = 1
        else:
            provider_marker[provider] += 1
        # Get the completion batch for train and test data
        r1_train = MyLLMforAll.get_completion_batch(queries=train_data, genparams=genparams, service_name=name)
        r2_train = FrugalGPT.compute_score(r1_train)
        r1_test = MyLLMforAll.get_completion_batch(queries=test_data, genparams=genparams, service_name=name)
        r2_test = FrugalGPT.compute_score(r1_test)

        # Extract accuracy and cost
        train_acc = r2_train['em']
        train_cost = r2_train['cost']
        test_acc = r2_test['em']
        test_cost = r2_test['cost']

        # Create a row with the schema
        row = {
            "Test_acc": test_acc,
            "Test_cost": test_cost,
            "Test_size": len(test_data),
            "Train_acc": train_acc,
            "Train_cost": train_cost,
            "Train_size": len(train_data),
            "Budget": 10,
            "Method": method,
            "Provider": provider,
            "Marker": provider_marker[provider],
        }

        # Append the row to the data list
        data.append(row)

    # Create the DataFrame from the data list
    df = pd.DataFrame(data)

    return df

In [None]:
sample_size = 500 # 10000
individualmodel_df = generate_dataframe(service_names,
                                        train_data[0:sample_size], test_data[0:sample_size],
                                        genparams,
                                        db_path=f"db/{dataname}.sqlite",
                                        max_workers=4)
display(individualmodel_df)
individualmodel_df.to_csv(f"summary_{dataname}_e8_2024.csv")


In [48]:
# show the dataframe
display(individualmodel_df)

Unnamed: 0,Test_acc,Test_cost,Test_size,Train_acc,Train_cost,Train_size,Budget,Method,Provider,Marker
0,0.838,3.3e-05,500,0.882,3.3e-05,500,10,gpt-4o-mini,openaichat,1


### 2. Now let us train FrugalGPT on this dataset.

In [26]:
genparams=FrugalGPT.GenerationParameter(max_tokens=50, temperature=0.1, stop=['\n'])

In [19]:
def compute_tradeoffs(
    train_data,
    budget_list,
    name = "HEADLINES", # test
    service_names = ['openaichat/gpt-4o-mini',
                      'openaichat/gpt-4o',
                      'openaichat/gpt-4-turbo',
                      'togetherai/meta-llama/Meta-Llama-3-70B-Instruct-Turbo',
                      'togetherai/google/gemma-2-9b-it',
                    ],
    prefix="",
    skip=0,
    MyCascade = FrugalGPT.LLMCascade(
          score_noise_injection=False,
          db_path="db/SCIQ.sqlite",
          ),
    cascade_depth=3,
    ):

  for idx,budget in tqdm(enumerate(budget_list)):
    # train the model
    user_budget = budget
    # MyCascade.load(loadpath=f"strategy/{name}/",budget=user_budget)

    try:
      MyCascade.load(loadpath=f"strategy/{name}/",budget=user_budget)
      print("Already trained. Skipped.")
      continue
    except:
      print("cannot find, start new training")
    if(idx<skip):
      continue
    if(idx==0):
        result = MyCascade.train(train_data,budget=user_budget,
                                 service_names=service_names,
                                 no_scorer_train=False,
                                 prefix=prefix,
                                 cascade_depth=cascade_depth,
                                 )
    else:
      result = MyCascade.train(train_data,budget=user_budget,
                               service_names=service_names,
                               no_scorer_train=True,
                               prefix=prefix,
                               cascade_depth=cascade_depth,
                               )
    MyCascade.save(savepath=f"strategy/{name}/")
  return

In [66]:
# start_budget = 5e-05 # 0.0035 
# end_budget = 0.0001
# num_eval = 2
# budget_list = numpy.linspace(start_budget, end_budget, num_eval)

name = f'{dataname}_10152024'
budget_list = [0.00005, 0.0001, 0.0005, 0.0015]

# load data
# dev = FrugalGPT.loadcsvdata(f"data/{dataname}/train.csv")

# train_data = FrugalGPT.formatdata(dev,prefix)
MyCascade= FrugalGPT.LLMCascade(
          score_noise_injection=False,
  db_path=f"db/{dataname}.sqlite",
  batch_build=True,
  )
# MyCascade.load(loadpath=f"strategy/{name}/",budget=0.00017)

In [21]:
train_data_sample = train_data[0:] # [0:100]
print(len(train_data_sample))

In [None]:
compute_tradeoffs(train_data=train_data_sample,
                  budget_list=budget_list,
                  name=name,
                  service_names=service_names,
                #   prefix=prefix,
                  skip=0, # you can manually skip the first few budgets if they have already been trained.
                  MyCascade=MyCascade,
                  cascade_depth=3,
                  )

0it [00:00, ?it/s]

cannot find, start new training
train and test size 80 20


## Step 3: Evaluate and save the performance

In [60]:
# read from data/{dataname}/Queried_{dataname}_all_models_clean_train.csv and data/{dataname}/Queried_{dataname}_all_models_clean_test.csv
dataset_df_test = pd.read_csv(f'data/{dataname}/Queried_{dataname}_all_models_clean_test.csv', header=0)
dataset_df_test.head()

Unnamed: 0,query_raw,query,ref_answer,gpt-4o-mini,gpt-4o,llama-3-8B,llama-3-70B,mixtral-8x7B,gemini-1.5-flash-002,gemini-1.0-pro,gemini-1.5-pro-002,Phi-3.5-mini-instruct,Phi-3-small-8k-instruct,Phi-3-mini-4k-instruct,Phi-3-medium-4k-instruct
0,Context: we disapprove orange county v. sealy ...,Please determine whether a sentence is overrul...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
1,"Context: he also left the scene of the crime, ...",Please determine whether a sentence is overrul...,no,no,no,no,no,no,no,no,no,no,no,no,no
2,Context: contrary statements in our opinions a...,Please determine whether a sentence is overrul...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,no,yes
3,"Context: """"[a] prima facie case of good faith ...",Please determine whether a sentence is overrul...,no,no,no,no,no,no,no,no,no,no,no,no,no
4,"Context: as an intermediate appellate court, w...",Please determine whether a sentence is overrul...,no,no,no,no,no,no,no,no,no,no,no,no,no


In [61]:
test_data = []
for index, row in dataset_df_test.iterrows():
    query = row['query']
    ref_answer = row['ref_answer']
    _id = index
    model_answer = {}
    for model_name in supported_LLM_names:
        model_answer[model_name] = row[model_name]
    test_data.append([query, ref_answer, _id, model_answer])

In [62]:
test_data[3]

['Please determine whether a sentence is overruling a prior decision (Yes or No) in the following statements.\n\nContext: because jones/walker relates only to sufficiency of the evidence, we hereby disavow the language holding otherwise in sandoval.\nQuestion: Is it overruling?\nAnswer: Yes\n\nContext: according to napa auto parts, the straws drove the vehicle """"for approximately six [] weeks and [] for between 500 to 600 miles prior to the accident with no incidents.""""\nQuestion: Is it overruling?\nAnswer: No\n\nContext: ""[a] prima facie case of good faith purpose is achieved by the mere allegation . . . that the information sought is for a proper purpose.""\nQuestion: Is it overruling?\nAnswer:',
 'no',
 3,
 {'gemini-1.5-flash-002': 'no',
  'gemini-1.5-pro-002': 'no',
  'gemini-1.0-pro': 'no',
  'gpt-4o-mini': 'no',
  'gpt-4o': 'no',
  'Phi-3-mini-4k-instruct': 'no',
  'Phi-3.5-mini-instruct': 'no',
  'Phi-3-small-8k-instruct': 'no',
  'Phi-3-medium-4k-instruct': 'no',
  'llama-

In [63]:
# get the answer of the model llama-3-8B
test_data[3][3]['llama-3-8B']

'no'

In [64]:
print(len(test_data))

432


In [68]:
def generate_dataframe_from_cascade(MyCascade,budget_list, train_data, test_data, genparams,name):
    # Initialize an empty list to store the rows for the DataFrame
    data = []

    # Iterate through the budget list
    for budget in tqdm(budget_list):
        # Load the strategy for the given budget
        MyCascade.load(loadpath=f"strategy/{name}/", budget=budget)
        print("loaded from path:",f"strategy/{name}/")
        print("now the budget is:",budget)

        # Get the completion batch for train data
        print("start train data")
        train_result = MyCascade.get_completion_batch(queries=train_data, genparams=genparams)
        print("train_result:",train_result)
        # Compute the ACC and cost for train data
        train_acc_cost = FrugalGPT.compute_score(train_result)

        # Get the completion batch for test data
        test_result = MyCascade.get_completion_batch(queries=test_data, genparams=genparams)

        # Compute the ACC and cost for test data
        test_acc_cost = FrugalGPT.compute_score(test_result)

        # Create a row with the schema
        row = {
            "Test_acc": test_acc_cost['em'],
            "Test_cost": test_acc_cost['cost'],
            "Test_size": len(test_data),
            "Train_acc": train_acc_cost['em'],
            "Train_cost": train_acc_cost['cost'],
            "Train_size": len(train_data),
            "Budget": budget,
            "Method": "FrugalGPT",
            "Provider": "FrugalGPT",
            "Marker": 1,  # Marker is always 1 for this function
        }

        # Append the row to the data list
        data.append(row)
        display(row)

    # Create the DataFrame from the data list
    df = pd.DataFrame(data)

    return df

In [69]:
MyCascade_eval = FrugalGPT.LLMCascade()
# MyCascade_eval.prefix = prefix

# just a demo, so only select first 500 samples
# train_data = train_data[0:500]
# test_data = test_data[0:500]

frugalgpt_df = generate_dataframe_from_cascade(MyCascade_eval,
                                               budget_list, train_data, test_data, genparams,
                                               name)
display(frugalgpt_df)
frugalgpt_df.to_csv(f"summary/summary_{dataname}_e8_frugalgpt_2024.csv")



loaded from path: strategy/OVERRULING_1015/
now the budget is: 5e-05
start train data


Collecting results:   0%|          | 0/2160 [00:00<?, ?it/s]

train_result:        _id answer ref_answer  cost
0        0     no         no     0
1        1    yes        yes     0
2        2     no         no     0
3        3     no         no     0
4        4     no         no     0
...    ...    ...        ...   ...
2155  2155     no         no     0
2156  2156    yes        yes     0
2157  2157     no         no     0
2158  2158    yes        yes     0
2159  2159    yes        yes     0

[2160 rows x 4 columns]


Collecting results:   0%|          | 0/432 [00:00<?, ?it/s]

{'Test_acc': 0.9606481481481481,
 'Test_cost': 0.0,
 'Test_size': 432,
 'Train_acc': 0.9699074074074074,
 'Train_cost': 0.0,
 'Train_size': 2160,
 'Budget': 5e-05,
 'Method': 'FrugalGPT',
 'Provider': 'FrugalGPT',
 'Marker': 1}



loaded from path: strategy/OVERRULING_1015/
now the budget is: 0.0001
start train data


Collecting results:   0%|          | 0/2160 [00:00<?, ?it/s]

train_result:        _id answer ref_answer  cost
0        0     no         no     0
1        1    yes        yes     0
2        2     no         no     0
3        3     no         no     0
4        4     no         no     0
...    ...    ...        ...   ...
2155  2155     no         no     0
2156  2156    yes        yes     0
2157  2157     no         no     0
2158  2158    yes        yes     0
2159  2159    yes        yes     0

[2160 rows x 4 columns]


Collecting results:   0%|          | 0/432 [00:00<?, ?it/s]

{'Test_acc': 0.9606481481481481,
 'Test_cost': 0.0,
 'Test_size': 432,
 'Train_acc': 0.9699074074074074,
 'Train_cost': 0.0,
 'Train_size': 2160,
 'Budget': 0.0001,
 'Method': 'FrugalGPT',
 'Provider': 'FrugalGPT',
 'Marker': 1}

100%|██████████| 2/2 [44:34<00:00, 1337.15s/it]


Unnamed: 0,Test_acc,Test_cost,Test_size,Train_acc,Train_cost,Train_size,Budget,Method,Provider,Marker
0,0.960648,0.0,432,0.969907,0.0,2160,5e-05,FrugalGPT,FrugalGPT,1
1,0.960648,0.0,432,0.969907,0.0,2160,0.0001,FrugalGPT,FrugalGPT,1
