# 🎓 FrugalGPT: Performance and Cost Tradeoffs

This notebook illustrates the FrugalGPT framework for _building LLM Applications with budget constraints._

In particular, we will focus on evaluating the performance and cost tradeoffs enabled by FrugalGPT.

NB: You are highly suggested to use accelerated hardware (GPU/TPU) to run this notebook.

## Installation
Let us start by installing FrugalGPT (if you haven't yet!).

In [1]:
# # set up the environment
# %%capture
# ! git clone https://github.com/stanford-futuredata/FrugalGPT
# %cd FrugalGPT
# ! pip install git+https://github.com/stanford-futuredata/FrugalGPT
# !mkdir -p strategy
# ! wget  https://github.com/lchen001/DataHolder/releases/download/v0.0.2/HEADLINES_Model2024.zip
# ! unzip HEADLINES_Model2024.zip -d strategy/
# ! rm HEADLINES_Model2024.zip
# ! wget -P db/ https://github.com/lchen001/DataHolder/releases/download/v0.0.2/HEADLINES.sqlite

In [2]:
%reload_ext autoreload
%autoreload 2
import sys, json, copy
import pandas as pd
import logging
logging.disable(logging.CRITICAL)
sys.path.append("src/")

## Setup
Next, let us set up the environment and API keys. You do _not_ need API keys to run the notebook! They are only needed if you want to use FrugalGPT for your own queries.

NB: For your own queries, not all API keys are needed, too. If you only want to leverage LLMs from, e.g., OpenAI and AI21, setting up API keys for them is sufficient.

In [3]:
import os
# os.environ['OPENAI_API_KEY'] = os.environ.get('OPENAI_API_KEY')
# #'OPENAI_API_KEY'
# os.environ['AI21_STUDIO_API_KEY'] = 'AI21_STUDIO_API_KEY'
# os.environ['COHERE_STUDIO_API_KEY'] = 'COHERE_STUDIO_API_KEY'
# os.environ['TEXTSYNTH_API_SECRET_KEY'] = 'TEXTSYNTH_API_SECRET_KEY'
# os.environ['ANTHROPIC_API_KEY'] = 'ANTHROPIC_API_KEY'
# os.environ['TOGETHER_API_KEY'] = 'TOGETHER_API_KEY'
# os.environ["GOOGLE_API_KEY"] = 'GOOGLE_API_KEY'

from IPython.display import display
import FrugalGPT
supported_LLM = FrugalGPT.getservicename()
print("supported LLMs:",supported_LLM)



supported LLMs: ['google/gemini-1.5-flash-002', 'google/gemini-1.5-pro-002', 'google/gemini-1.0-pro', 'openaichat/gpt-4o-mini', 'openaichat/gpt-4o', 'azure/Phi-3-mini-4k-instruct', 'azure/Phi-3.5-mini-instruct', 'azure/Phi-3-small-8k-instruct', 'azure/Phi-3-medium-4k-instruct', 'deepinfra/llama-3-8B', 'deepinfra/llama-3-70B', 'deepinfra/mixtral-8x7B']


In [4]:
supported_LLM_names = [llm.split("/")[1] for llm in supported_LLM]
print("supported_LLM_names:", supported_LLM_names)

supported_LLM_names: ['gemini-1.5-flash-002', 'gemini-1.5-pro-002', 'gemini-1.0-pro', 'gpt-4o-mini', 'gpt-4o', 'Phi-3-mini-4k-instruct', 'Phi-3.5-mini-instruct', 'Phi-3-small-8k-instruct', 'Phi-3-medium-4k-instruct', 'llama-3-8B', 'llama-3-70B', 'mixtral-8x7B']


In [18]:
# print(os.environ['ANTHROPIC_API_KEY'])
# print(os.environ['OPENAI_API_KEY'])
# print(os.environ['GOOGLE_API_KEY'])

## Generating the tradeoffs involves three major steps: (i) prepare the dataset, (ii) train the FrugalGPT strategy, and (iii) evaluate and save the performance.

## Step 1: Prepare the dataset

In [5]:
# dataname = "HEADLINES"
dataname = "OVERRULING"

# test_data = FrugalGPT.loadcsvdata(f"data/{dataname}/test.csv")
# prefix = open(f'config/prompt/{dataname}/prefix_e8.txt').read()
# test_data = FrugalGPT.formatdata(test_data,prefix)

# train_data = FrugalGPT.loadcsvdata(f"data/{dataname}/train.csv")
# prefix = open(f'config/prompt/{dataname}/prefix_e8.txt').read()
# train_data = FrugalGPT.formatdata(train_data,prefix)


In [6]:
# read from data/{dataname}/Queried_{dataname}_all_models_clean_train.csv and data/{dataname}/Queried_{dataname}_all_models_clean_test.csv
dataset_df = pd.read_csv(f'data/{dataname}/Queried_{dataname}_all_models_clean.csv', header=0)
dataset_df.head()

Unnamed: 0,query_raw,query,ref_answer,gpt-4o-mini,gpt-4o,llama-3-8B,llama-3-70B,mixtral-8x7B,gemini-1.5-flash-002,gemini-1.0-pro,gemini-1.5-pro-002,Phi-3.5-mini-instruct,Phi-3-small-8k-instruct,Phi-3-mini-4k-instruct,Phi-3-medium-4k-instruct
0,Context: to the extent that these cases are in...,Please determine whether a sentence is overrul...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
1,Context: we therefore reverse the order denyin...,Please determine whether a sentence is overrul...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
2,"Context: see brown v. state,\nQuestion: Is it ...",Please determine whether a sentence is overrul...,no,no,no,no,no,no,no,no,no,no,no,no,no
3,"Context: at the very least, this court ought t...",Please determine whether a sentence is overrul...,no,no,no,no,no,no,no,no,no,yes,no,no,no
4,Context: the federal immigration judge and the...,Please determine whether a sentence is overrul...,yes,no,no,yes,no,no,yes,no,no,yes,no,yes,yes


In [7]:
# reformat the dataset into a list 
# print(train_data[0][0]) # this is the query column
# print(train_data[0][1]) # this is the ref_answer column
# print(train_data[0][2]) # this is the _id, just put the index of the row in the csv file
# print(train_data[0][3]) # this is the models' answer column, concatenate all the models' answer into a list of strings here
# but the key is, you need to make sure you can get train_data[0][2].model_name to get the answer of corresponding model

train_data = []
for index, row in dataset_df.iterrows():
    query = row['query']
    ref_answer = row['ref_answer']
    _id = index
    model_answer = {}
    for model_name in supported_LLM_names:
        model_answer[model_name] = row[model_name]
    train_data.append([query, ref_answer, _id, model_answer])

In [8]:
train_data[3]

['Please determine whether a sentence is overruling a prior decision (Yes or No) in the following statements.\n\nContext: because jones/walker relates only to sufficiency of the evidence, we hereby disavow the language holding otherwise in sandoval.\nQuestion: Is it overruling?\nAnswer: Yes\n\nContext: according to napa auto parts, the straws drove the vehicle """"for approximately six [] weeks and [] for between 500 to 600 miles prior to the accident with no incidents.""""\nQuestion: Is it overruling?\nAnswer: No\n\nContext: at the very least, this court ought to address the problem created by kar because, as this case illustrates, kar is distorting the burden of proof in this important area of the law.\nQuestion: Is it overruling?\nAnswer:',
 'no',
 3,
 {'gemini-1.5-flash-002': 'no',
  'gemini-1.5-pro-002': 'no',
  'gemini-1.0-pro': 'no',
  'gpt-4o-mini': 'no',
  'gpt-4o': 'no',
  'Phi-3-mini-4k-instruct': 'no',
  'Phi-3.5-mini-instruct': 'yes',
  'Phi-3-small-8k-instruct': 'no',
  '

In [9]:
# get the answer of the model llama-3-8B
train_data[3][3]['llama-3-8B']

'no'

## Step 2: Train the FrugalGPT strategy for different budgets

In [25]:
service_names = ['openaichat/gpt-4o-mini',
                'openaichat/gpt-4o',
                'google/gemini-1.5-flash-002',
                'google/gemini-1.5-pro-002',
                'google/gemini-1.0-pro',
                'azure/Phi-3-mini-4k-instruct',
                'azure/Phi-3.5-mini-instruct',
                'azure/Phi-3-small-8k-instruct',
                'azure/Phi-3-medium-4k-instruct',
                'deepinfra/llama-3-8B',
                'deepinfra/llama-3-70B',
                'deepinfra/mixtral-8x7B',
                ]

### 1. Let us first evaluate individual models.

In [6]:
import pandas as pd

def generate_dataframe(service_names, train_data, test_data, genparams,db_path="db/SCIQ.sqlite",
                       max_workers=2):
    # Initialize an empty list to store the rows for the DataFrame
    data = []
    MyLLMforAll = FrugalGPT.LLMforAll(
                     db_path=db_path,
                     max_workers=max_workers,

)
    # Dictionary to keep track of markers for each provider
    provider_marker = {}

    # Iterate through the service names
    for name in service_names:
        # Extract provider and method
        provider = name.split('/')[0]
        method = name.split('/')[-1]

        # If the provider is seen for the first time, initialize its marker
        if provider not in provider_marker:
            provider_marker[provider] = 1
        else:
            provider_marker[provider] += 1
        # Get the completion batch for train and test data
        r1_train = MyLLMforAll.get_completion_batch(queries=train_data, genparams=genparams, service_name=name)
        r2_train = FrugalGPT.compute_score(r1_train)
        r1_test = MyLLMforAll.get_completion_batch(queries=test_data, genparams=genparams, service_name=name)
        r2_test = FrugalGPT.compute_score(r1_test)

        # Extract accuracy and cost
        train_acc = r2_train['em']
        train_cost = r2_train['cost']
        test_acc = r2_test['em']
        test_cost = r2_test['cost']

        # Create a row with the schema
        row = {
            "Test_acc": test_acc,
            "Test_cost": test_cost,
            "Test_size": len(test_data),
            "Train_acc": train_acc,
            "Train_cost": train_cost,
            "Train_size": len(train_data),
            "Budget": 10,
            "Method": method,
            "Provider": provider,
            "Marker": provider_marker[provider],
        }

        # Append the row to the data list
        data.append(row)

    # Create the DataFrame from the data list
    df = pd.DataFrame(data)

    return df

In [None]:
genparams=FrugalGPT.GenerationParameter(max_tokens=50, temperature=0.1, stop=['\n'])

sample_size = 500 # 10000
individualmodel_df = generate_dataframe(service_names,
                                        train_data[0:sample_size], test_data[0:sample_size],
                                        genparams,
                                        db_path=f"db/{dataname}.sqlite",
                                        max_workers=4)
display(individualmodel_df)
individualmodel_df.to_csv(f"summary_{dataname}_e8_2024.csv")


In [48]:
# show the dataframe
display(individualmodel_df)

Unnamed: 0,Test_acc,Test_cost,Test_size,Train_acc,Train_cost,Train_size,Budget,Method,Provider,Marker
0,0.838,3.3e-05,500,0.882,3.3e-05,500,10,gpt-4o-mini,openaichat,1


### 2. Now let us train FrugalGPT on this dataset.

In [26]:
import numpy
from tqdm import tqdm
def compute_tradeoffs(
    train_data,
    budget_list,
    name = "HEADLINES", # test
    service_names = ['openaichat/gpt-4o-mini',
                      'openaichat/gpt-4o',
                      'openaichat/gpt-4-turbo',
                      'togetherai/meta-llama/Meta-Llama-3-70B-Instruct-Turbo',
                      'togetherai/google/gemma-2-9b-it',
                    ],
    prefix="",
    skip=0,
    MyCascade = FrugalGPT.LLMCascade(
          score_noise_injection=False,
          db_path="db/SCIQ.sqlite",
          ),
    cascade_depth=3,
    ):

  for idx,budget in tqdm(enumerate(budget_list)):
    # train the model
    user_budget = budget
    # MyCascade.load(loadpath=f"strategy/{name}/",budget=user_budget)

    try:
      MyCascade.load(loadpath=f"strategy/{name}/",budget=user_budget)
      print("Already trained. Skipped.")
      continue
    except:
      print("cannot find, start new training")
    if(idx<skip):
      continue
    if(idx==0):
        result = MyCascade.train(train_data,budget=user_budget,
                                 service_names=service_names,
                                 no_scorer_train=False,
                                 prefix=prefix,
                                 cascade_depth=cascade_depth,
                                 )
    else:
      result = MyCascade.train(train_data,budget=user_budget,
                               service_names=service_names,
                               no_scorer_train=True,
                               prefix=prefix,
                               cascade_depth=cascade_depth,
                               )
    MyCascade.save(savepath=f"strategy/{name}/")
  return

In [30]:
start_budget = 5e-05 # 0.0035 
end_budget = 0.0001
num_eval = 2

name = f'{dataname}_1015'
budget_list = numpy.linspace(start_budget, end_budget, num_eval)

# load data
# dev = FrugalGPT.loadcsvdata(f"data/{dataname}/train.csv")

# train_data = FrugalGPT.formatdata(dev,prefix)
MyCascade= FrugalGPT.LLMCascade(
          score_noise_injection=False,
  db_path=f"db/{dataname}.sqlite",
  batch_build=True,
  )
# MyCascade.load(loadpath=f"strategy/{name}/",budget=0.00017)

In [31]:
train_data_sample = train_data[0:100]

In [33]:
compute_tradeoffs(train_data=train_data_sample,
                  budget_list=budget_list,
                  name=name,
                  service_names=service_names,
                #   prefix=prefix,
                  skip=0, # you can manually skip the first few budgets if they have already been trained.
                  MyCascade=MyCascade,
                  cascade_depth=3,
                  )

0it [00:00, ?it/s]

cannot find, start new training
train and test size 80 20


  0%|          | 0/36 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:05, ?it/s]

{'eval_loss': 0.6226165294647217, 'eval_accuracy': 0.90625, 'eval_runtime': 0.3357, 'eval_samples_per_second': 95.31, 'eval_steps_per_second': 2.978, 'epoch': 1.0}


0it [00:08, ?it/s]

{'loss': 0.6241, 'grad_norm': 4.7993879318237305, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.67}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:09, ?it/s]

{'eval_loss': 0.6012663841247559, 'eval_accuracy': 0.90625, 'eval_runtime': 0.3151, 'eval_samples_per_second': 101.553, 'eval_steps_per_second': 3.174, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:14, ?it/s]

{'eval_loss': 0.5654211044311523, 'eval_accuracy': 0.90625, 'eval_runtime': 0.3673, 'eval_samples_per_second': 87.13, 'eval_steps_per_second': 2.723, 'epoch': 3.0}


0it [00:19, ?it/s]

{'loss': 0.5924, 'grad_norm': 3.268127202987671, 'learning_rate': 2.0000000000000003e-06, 'epoch': 3.33}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:20, ?it/s]

{'eval_loss': 0.5171278715133667, 'eval_accuracy': 0.90625, 'eval_runtime': 0.3729, 'eval_samples_per_second': 85.816, 'eval_steps_per_second': 2.682, 'epoch': 4.0}


0it [00:24, ?it/s]

{'loss': 0.5124, 'grad_norm': 2.804896593093872, 'learning_rate': 3e-06, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:25, ?it/s]

{'eval_loss': 0.461524099111557, 'eval_accuracy': 0.90625, 'eval_runtime': 0.3962, 'eval_samples_per_second': 80.76, 'eval_steps_per_second': 2.524, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:30, ?it/s]

{'eval_loss': 0.4039931297302246, 'eval_accuracy': 0.90625, 'eval_runtime': 0.4717, 'eval_samples_per_second': 67.836, 'eval_steps_per_second': 2.12, 'epoch': 6.0}


0it [00:32, ?it/s]

{'train_runtime': 29.5958, 'train_samples_per_second': 9.731, 'train_steps_per_second': 1.216, 'train_loss': 0.551682538456387, 'epoch': 6.0}




  0%|          | 0/36 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:37, ?it/s]

{'eval_loss': 0.7153152823448181, 'eval_accuracy': 0.03125, 'eval_runtime': 0.3737, 'eval_samples_per_second': 85.639, 'eval_steps_per_second': 2.676, 'epoch': 1.0}


0it [00:39, ?it/s]

{'loss': 0.7165, 'grad_norm': 4.007298946380615, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.67}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:39, ?it/s]

{'eval_loss': 0.6869325637817383, 'eval_accuracy': 0.78125, 'eval_runtime': 0.3535, 'eval_samples_per_second': 90.514, 'eval_steps_per_second': 2.829, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:42, ?it/s]

{'eval_loss': 0.6432575583457947, 'eval_accuracy': 0.96875, 'eval_runtime': 0.3412, 'eval_samples_per_second': 93.796, 'eval_steps_per_second': 2.931, 'epoch': 3.0}


0it [00:44, ?it/s]

{'loss': 0.6818, 'grad_norm': 4.821831703186035, 'learning_rate': 2.0000000000000003e-06, 'epoch': 3.33}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:46, ?it/s]

{'eval_loss': 0.593876838684082, 'eval_accuracy': 0.96875, 'eval_runtime': 0.3619, 'eval_samples_per_second': 88.424, 'eval_steps_per_second': 2.763, 'epoch': 4.0}


0it [00:48, ?it/s]

{'loss': 0.6122, 'grad_norm': 4.149214267730713, 'learning_rate': 3e-06, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:48, ?it/s]

{'eval_loss': 0.5465281009674072, 'eval_accuracy': 0.96875, 'eval_runtime': 0.3482, 'eval_samples_per_second': 91.891, 'eval_steps_per_second': 2.872, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:53, ?it/s]

{'eval_loss': 0.49087440967559814, 'eval_accuracy': 0.96875, 'eval_runtime': 0.4573, 'eval_samples_per_second': 69.976, 'eval_steps_per_second': 2.187, 'epoch': 6.0}


0it [00:54, ?it/s]

{'train_runtime': 19.2733, 'train_samples_per_second': 14.943, 'train_steps_per_second': 1.868, 'train_loss': 0.6509347690476311, 'epoch': 6.0}




  0%|          | 0/36 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:58, ?it/s]

{'eval_loss': 0.7150681018829346, 'eval_accuracy': 0.03125, 'eval_runtime': 0.568, 'eval_samples_per_second': 56.341, 'eval_steps_per_second': 1.761, 'epoch': 1.0}


0it [01:01, ?it/s]

{'loss': 0.7226, 'grad_norm': 5.161259651184082, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.67}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:02, ?it/s]

{'eval_loss': 0.6861512660980225, 'eval_accuracy': 0.8125, 'eval_runtime': 0.6587, 'eval_samples_per_second': 48.583, 'eval_steps_per_second': 1.518, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:06, ?it/s]

{'eval_loss': 0.6417638063430786, 'eval_accuracy': 0.96875, 'eval_runtime': 0.7077, 'eval_samples_per_second': 45.219, 'eval_steps_per_second': 1.413, 'epoch': 3.0}


0it [01:07, ?it/s]

{'loss': 0.6787, 'grad_norm': 4.811600685119629, 'learning_rate': 2.0000000000000003e-06, 'epoch': 3.33}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:09, ?it/s]

{'eval_loss': 0.5914919376373291, 'eval_accuracy': 0.96875, 'eval_runtime': 0.7195, 'eval_samples_per_second': 44.476, 'eval_steps_per_second': 1.39, 'epoch': 4.0}


0it [01:11, ?it/s]

{'loss': 0.5959, 'grad_norm': 4.121401309967041, 'learning_rate': 3e-06, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:12, ?it/s]

{'eval_loss': 0.542277991771698, 'eval_accuracy': 0.96875, 'eval_runtime': 0.6506, 'eval_samples_per_second': 49.189, 'eval_steps_per_second': 1.537, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:17, ?it/s]

{'eval_loss': 0.4839601516723633, 'eval_accuracy': 0.96875, 'eval_runtime': 0.7152, 'eval_samples_per_second': 44.744, 'eval_steps_per_second': 1.398, 'epoch': 6.0}


0it [01:18, ?it/s]

{'train_runtime': 22.7478, 'train_samples_per_second': 12.661, 'train_steps_per_second': 1.583, 'train_loss': 0.6420808368259006, 'epoch': 6.0}




  0%|          | 0/36 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:23, ?it/s]

{'eval_loss': 0.7151697874069214, 'eval_accuracy': 0.03125, 'eval_runtime': 0.7391, 'eval_samples_per_second': 43.297, 'eval_steps_per_second': 1.353, 'epoch': 1.0}


0it [01:27, ?it/s]

{'loss': 0.7175, 'grad_norm': 4.002816200256348, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.67}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:27, ?it/s]

{'eval_loss': 0.6864740252494812, 'eval_accuracy': 0.8125, 'eval_runtime': 0.3651, 'eval_samples_per_second': 87.653, 'eval_steps_per_second': 2.739, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:30, ?it/s]

{'eval_loss': 0.642462432384491, 'eval_accuracy': 0.96875, 'eval_runtime': 0.4171, 'eval_samples_per_second': 76.719, 'eval_steps_per_second': 2.397, 'epoch': 3.0}


0it [01:33, ?it/s]

{'loss': 0.6816, 'grad_norm': 4.8195719718933105, 'learning_rate': 2.0000000000000003e-06, 'epoch': 3.33}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:34, ?it/s]

{'eval_loss': 0.5927524566650391, 'eval_accuracy': 0.96875, 'eval_runtime': 0.3573, 'eval_samples_per_second': 89.55, 'eval_steps_per_second': 2.798, 'epoch': 4.0}


0it [01:37, ?it/s]

{'loss': 0.6081, 'grad_norm': 4.138127326965332, 'learning_rate': 3e-06, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:37, ?it/s]

{'eval_loss': 0.5447189807891846, 'eval_accuracy': 0.96875, 'eval_runtime': 0.3558, 'eval_samples_per_second': 89.948, 'eval_steps_per_second': 2.811, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:41, ?it/s]

{'eval_loss': 0.4882139265537262, 'eval_accuracy': 0.96875, 'eval_runtime': 0.4411, 'eval_samples_per_second': 72.546, 'eval_steps_per_second': 2.267, 'epoch': 6.0}


0it [01:43, ?it/s]

{'train_runtime': 22.1839, 'train_samples_per_second': 12.982, 'train_steps_per_second': 1.623, 'train_loss': 0.6478457715776231, 'epoch': 6.0}




  0%|          | 0/36 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:47, ?it/s]

{'eval_loss': 0.7128154635429382, 'eval_accuracy': 0.09375, 'eval_runtime': 0.3571, 'eval_samples_per_second': 89.621, 'eval_steps_per_second': 2.801, 'epoch': 1.0}


0it [01:49, ?it/s]

{'loss': 0.7165, 'grad_norm': 4.007298946380615, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.67}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:50, ?it/s]

{'eval_loss': 0.688246488571167, 'eval_accuracy': 0.71875, 'eval_runtime': 0.4059, 'eval_samples_per_second': 78.835, 'eval_steps_per_second': 2.464, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [01:56, ?it/s]

{'eval_loss': 0.6505973935127258, 'eval_accuracy': 0.90625, 'eval_runtime': 0.3996, 'eval_samples_per_second': 80.075, 'eval_steps_per_second': 2.502, 'epoch': 3.0}


0it [01:59, ?it/s]

{'loss': 0.6818, 'grad_norm': 4.822581768035889, 'learning_rate': 2.0000000000000003e-06, 'epoch': 3.33}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:00, ?it/s]

{'eval_loss': 0.6082526445388794, 'eval_accuracy': 0.90625, 'eval_runtime': 0.3603, 'eval_samples_per_second': 88.822, 'eval_steps_per_second': 2.776, 'epoch': 4.0}


0it [02:05, ?it/s]

{'loss': 0.6122, 'grad_norm': 4.149751663208008, 'learning_rate': 3e-06, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:06, ?it/s]

{'eval_loss': 0.5681744813919067, 'eval_accuracy': 0.90625, 'eval_runtime': 0.3617, 'eval_samples_per_second': 88.471, 'eval_steps_per_second': 2.765, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:15, ?it/s]

{'eval_loss': 0.5221556425094604, 'eval_accuracy': 0.90625, 'eval_runtime': 2.9405, 'eval_samples_per_second': 10.882, 'eval_steps_per_second': 0.34, 'epoch': 6.0}


0it [02:18, ?it/s]

{'train_runtime': 32.7146, 'train_samples_per_second': 8.803, 'train_steps_per_second': 1.1, 'train_loss': 0.6509523921542697, 'epoch': 6.0}




  0%|          | 0/36 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:25, ?it/s]

{'eval_loss': 0.7111203670501709, 'eval_accuracy': 0.125, 'eval_runtime': 0.6292, 'eval_samples_per_second': 50.856, 'eval_steps_per_second': 1.589, 'epoch': 1.0}


0it [02:29, ?it/s]

{'loss': 0.72, 'grad_norm': 5.1706156730651855, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.67}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:30, ?it/s]

{'eval_loss': 0.6890625357627869, 'eval_accuracy': 0.6875, 'eval_runtime': 0.3965, 'eval_samples_per_second': 80.699, 'eval_steps_per_second': 2.522, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:33, ?it/s]

{'eval_loss': 0.6556410789489746, 'eval_accuracy': 0.875, 'eval_runtime': 0.3838, 'eval_samples_per_second': 83.387, 'eval_steps_per_second': 2.606, 'epoch': 3.0}


0it [02:35, ?it/s]

{'loss': 0.6822, 'grad_norm': 3.70916485786438, 'learning_rate': 2.0000000000000003e-06, 'epoch': 3.33}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:37, ?it/s]

{'eval_loss': 0.6177351474761963, 'eval_accuracy': 0.875, 'eval_runtime': 0.3576, 'eval_samples_per_second': 89.491, 'eval_steps_per_second': 2.797, 'epoch': 4.0}


0it [02:39, ?it/s]

{'loss': 0.6132, 'grad_norm': 4.161294460296631, 'learning_rate': 3e-06, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:39, ?it/s]

{'eval_loss': 0.5816041231155396, 'eval_accuracy': 0.875, 'eval_runtime': 0.343, 'eval_samples_per_second': 93.289, 'eval_steps_per_second': 2.915, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:44, ?it/s]

{'eval_loss': 0.5414260625839233, 'eval_accuracy': 0.875, 'eval_runtime': 0.6089, 'eval_samples_per_second': 52.554, 'eval_steps_per_second': 1.642, 'epoch': 6.0}


0it [02:45, ?it/s]

{'train_runtime': 23.3531, 'train_samples_per_second': 12.332, 'train_steps_per_second': 1.542, 'train_loss': 0.654221687051985, 'epoch': 6.0}




  0%|          | 0/36 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:50, ?it/s]

{'eval_loss': 0.7114455699920654, 'eval_accuracy': 0.125, 'eval_runtime': 0.5165, 'eval_samples_per_second': 61.951, 'eval_steps_per_second': 1.936, 'epoch': 1.0}


0it [02:53, ?it/s]

{'loss': 0.7226, 'grad_norm': 5.16126012802124, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.67}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:54, ?it/s]

{'eval_loss': 0.6884192228317261, 'eval_accuracy': 0.71875, 'eval_runtime': 0.3489, 'eval_samples_per_second': 91.711, 'eval_steps_per_second': 2.866, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [02:58, ?it/s]

{'eval_loss': 0.6533030271530151, 'eval_accuracy': 0.875, 'eval_runtime': 0.3586, 'eval_samples_per_second': 89.244, 'eval_steps_per_second': 2.789, 'epoch': 3.0}


0it [03:01, ?it/s]

{'loss': 0.6787, 'grad_norm': 4.811611175537109, 'learning_rate': 2.0000000000000003e-06, 'epoch': 3.33}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:02, ?it/s]

{'eval_loss': 0.6137349605560303, 'eval_accuracy': 0.875, 'eval_runtime': 0.3599, 'eval_samples_per_second': 88.918, 'eval_steps_per_second': 2.779, 'epoch': 4.0}


0it [03:05, ?it/s]

{'loss': 0.5959, 'grad_norm': 4.121437072753906, 'learning_rate': 3e-06, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:05, ?it/s]

{'eval_loss': 0.5758599638938904, 'eval_accuracy': 0.875, 'eval_runtime': 0.3769, 'eval_samples_per_second': 84.902, 'eval_steps_per_second': 2.653, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:09, ?it/s]

{'eval_loss': 0.5327675342559814, 'eval_accuracy': 0.875, 'eval_runtime': 0.397, 'eval_samples_per_second': 80.597, 'eval_steps_per_second': 2.519, 'epoch': 6.0}


0it [03:10, ?it/s]

{'train_runtime': 22.4522, 'train_samples_per_second': 12.827, 'train_steps_per_second': 1.603, 'train_loss': 0.6420841614405314, 'epoch': 6.0}




  0%|          | 0/36 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:15, ?it/s]

{'eval_loss': 0.7096818685531616, 'eval_accuracy': 0.15625, 'eval_runtime': 0.3827, 'eval_samples_per_second': 83.626, 'eval_steps_per_second': 2.613, 'epoch': 1.0}


0it [03:17, ?it/s]

{'loss': 0.7182, 'grad_norm': 2.8785722255706787, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.67}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:18, ?it/s]

{'eval_loss': 0.6889460682868958, 'eval_accuracy': 0.6875, 'eval_runtime': 0.3955, 'eval_samples_per_second': 80.91, 'eval_steps_per_second': 2.528, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:21, ?it/s]

{'eval_loss': 0.6574291586875916, 'eval_accuracy': 0.84375, 'eval_runtime': 0.3753, 'eval_samples_per_second': 85.268, 'eval_steps_per_second': 2.665, 'epoch': 3.0}


0it [03:23, ?it/s]

{'loss': 0.6827, 'grad_norm': 4.828229904174805, 'learning_rate': 2.0000000000000003e-06, 'epoch': 3.33}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:24, ?it/s]

{'eval_loss': 0.6232607364654541, 'eval_accuracy': 0.84375, 'eval_runtime': 0.3396, 'eval_samples_per_second': 94.218, 'eval_steps_per_second': 2.944, 'epoch': 4.0}


0it [03:27, ?it/s]

{'loss': 0.6132, 'grad_norm': 4.1539201736450195, 'learning_rate': 3e-06, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:27, ?it/s]

{'eval_loss': 0.5918797254562378, 'eval_accuracy': 0.84375, 'eval_runtime': 0.3852, 'eval_samples_per_second': 83.081, 'eval_steps_per_second': 2.596, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:31, ?it/s]

{'eval_loss': 0.5564340353012085, 'eval_accuracy': 0.84375, 'eval_runtime': 0.4051, 'eval_samples_per_second': 79.002, 'eval_steps_per_second': 2.469, 'epoch': 6.0}


0it [03:32, ?it/s]

{'train_runtime': 19.7395, 'train_samples_per_second': 14.59, 'train_steps_per_second': 1.824, 'train_loss': 0.6521706779797872, 'epoch': 6.0}




  0%|          | 0/36 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:36, ?it/s]

{'eval_loss': 0.7124842405319214, 'eval_accuracy': 0.09375, 'eval_runtime': 0.3545, 'eval_samples_per_second': 90.265, 'eval_steps_per_second': 2.821, 'epoch': 1.0}


0it [03:38, ?it/s]

{'loss': 0.7216, 'grad_norm': 5.161988735198975, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.67}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:39, ?it/s]

{'eval_loss': 0.6877875328063965, 'eval_accuracy': 0.75, 'eval_runtime': 0.5131, 'eval_samples_per_second': 62.361, 'eval_steps_per_second': 1.949, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:42, ?it/s]

{'eval_loss': 0.650000274181366, 'eval_accuracy': 0.90625, 'eval_runtime': 0.6036, 'eval_samples_per_second': 53.012, 'eval_steps_per_second': 1.657, 'epoch': 3.0}


0it [03:44, ?it/s]

{'loss': 0.6788, 'grad_norm': 4.814874172210693, 'learning_rate': 2.0000000000000003e-06, 'epoch': 3.33}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:46, ?it/s]

{'eval_loss': 0.6073353290557861, 'eval_accuracy': 0.90625, 'eval_runtime': 0.5828, 'eval_samples_per_second': 54.904, 'eval_steps_per_second': 1.716, 'epoch': 4.0}


0it [03:48, ?it/s]

{'loss': 0.6001, 'grad_norm': 4.1302714347839355, 'learning_rate': 3e-06, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:49, ?it/s]

{'eval_loss': 0.5661964416503906, 'eval_accuracy': 0.90625, 'eval_runtime': 0.5349, 'eval_samples_per_second': 59.825, 'eval_steps_per_second': 1.87, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:53, ?it/s]

{'eval_loss': 0.5184752345085144, 'eval_accuracy': 0.90625, 'eval_runtime': 0.6056, 'eval_samples_per_second': 52.837, 'eval_steps_per_second': 1.651, 'epoch': 6.0}


0it [03:54, ?it/s]

{'train_runtime': 20.5565, 'train_samples_per_second': 14.01, 'train_steps_per_second': 1.751, 'train_loss': 0.6451709535386827, 'epoch': 6.0}




  0%|          | 0/36 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [03:59, ?it/s]

{'eval_loss': 0.7125990986824036, 'eval_accuracy': 0.09375, 'eval_runtime': 0.3433, 'eval_samples_per_second': 93.202, 'eval_steps_per_second': 2.913, 'epoch': 1.0}


0it [04:01, ?it/s]

{'loss': 0.7226, 'grad_norm': 5.16126012802124, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.67}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [04:02, ?it/s]

{'eval_loss': 0.6875611543655396, 'eval_accuracy': 0.75, 'eval_runtime': 0.3653, 'eval_samples_per_second': 87.6, 'eval_steps_per_second': 2.737, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [04:05, ?it/s]

{'eval_loss': 0.6492953300476074, 'eval_accuracy': 0.90625, 'eval_runtime': 0.4078, 'eval_samples_per_second': 78.47, 'eval_steps_per_second': 2.452, 'epoch': 3.0}


0it [04:06, ?it/s]

{'loss': 0.6787, 'grad_norm': 4.811611652374268, 'learning_rate': 2.0000000000000003e-06, 'epoch': 3.33}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [04:08, ?it/s]

{'eval_loss': 0.6061999797821045, 'eval_accuracy': 0.90625, 'eval_runtime': 0.3386, 'eval_samples_per_second': 94.52, 'eval_steps_per_second': 2.954, 'epoch': 4.0}


0it [04:11, ?it/s]

{'loss': 0.5959, 'grad_norm': 4.121437072753906, 'learning_rate': 3e-06, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [04:11, ?it/s]

{'eval_loss': 0.5645829439163208, 'eval_accuracy': 0.90625, 'eval_runtime': 0.3509, 'eval_samples_per_second': 91.195, 'eval_steps_per_second': 2.85, 'epoch': 5.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [04:15, ?it/s]

{'eval_loss': 0.5164252519607544, 'eval_accuracy': 0.90625, 'eval_runtime': 0.4531, 'eval_samples_per_second': 70.632, 'eval_steps_per_second': 2.207, 'epoch': 6.0}


0it [04:17, ?it/s]

{'train_runtime': 20.5044, 'train_samples_per_second': 14.046, 'train_steps_per_second': 1.756, 'train_loss': 0.642084174686008, 'epoch': 6.0}




  0%|          | 0/36 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [04:25, ?it/s]

{'eval_loss': 0.7137999534606934, 'eval_accuracy': 0.0625, 'eval_runtime': 0.3657, 'eval_samples_per_second': 87.506, 'eval_steps_per_second': 2.735, 'epoch': 1.0}


0it [04:27, ?it/s]

{'loss': 0.7175, 'grad_norm': 4.002816677093506, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.67}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [04:28, ?it/s]

{'eval_loss': 0.68710857629776, 'eval_accuracy': 0.78125, 'eval_runtime': 0.3646, 'eval_samples_per_second': 87.768, 'eval_steps_per_second': 2.743, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

0it [04:31, ?it/s]

{'eval_loss': 0.6462719440460205, 'eval_accuracy': 0.9375, 'eval_runtime': 0.3782, 'eval_samples_per_second': 84.604, 'eval_steps_per_second': 2.644, 'epoch': 3.0}


0it [04:33, ?it/s]


RuntimeError: MPS backend out of memory (MPS allocated: 14.99 GB, other allocations: 3.07 GB, max allowed: 18.13 GB). Tried to allocate 89.42 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

In [None]:
# 0it [01:01, ?it/s]{'eval_loss': 0.718047022819519, 'eval_accuracy': 0.0, 'eval_runtime': 1.7942, 'eval_samples_per_second': 2.229, 'eval_steps_per_second': 0.557, 'epoch': 6.0}
# 0it [01:03, ?it/s]{'train_runtime': 19.4003, 'train_samples_per_second': 1.237, 'train_steps_per_second': 0.309, 'train_loss': 0.7446078459421793, 'epoch': 6.0}
# scores {'openaichat/gpt-4o-mini': {5: 0.48940352, 6: 0.48761868, 1: 0.48318094, 4: 0.48321715, 3: 0.48975572, 9: 0.4886438, 0: 0.48646355, 8: 0.48747963}, 'openaichat/gpt-4o': {5: 0.48923066, 6: 0.48772165, 1: 0.483073, 4: 0.48356527, 3: 0.48954263, 9: 0.48880833, 0: 0.48624623, 8: 0.4873342}, 'google/gemini-1.5-flash-002': {5: 0.48986983, 6: 0.48849732, 1: 0.4837681, 4: 0.48490465, 3: 0.49043417, 9: 0.48936996, 0: 0.48699588, 8: 0.48852345}}
# missing answer 5
# missing answer 6
# missing answer 1
# missing answer 4
# missing answer 3
# missing answer 9
# missing answer 0
# missing answer 8
# first train
# 0it [01:10, ?it/s]responses {'openaichat/gpt-4o-mini': {'answer': {5: 'yes', 6: 'yes', 1: 'yes', 4: 'no', 3: 'no', 9: 'yes', 0: 'yes', 8: 'no'}, 'cost': {5: 0, 6: 0, 1: 0, 4: 0, 3: 0, 9: 0, 0: 0, 8: 0}, 'quality': {5: 1, 6: 1, 1: 1, 4: 0, 3: 1, 9: 1, 0: 1, 8: 0}, 'sp': {}, 'logprobs': {}}, 'openaichat/gpt-4o': {'answer': {5: 'yes', 6: 'yes', 1: 'yes', 4: 'no', 3: 'no', 9: 'yes', 0: 'yes', 8: 'no'}, 'cost': {5: 0, 6: 0, 1: 0, 4: 0, 3: 0, 9: 0, 0: 0, 8: 0}, 'quality': {5: 1, 6: 1, 1: 1, 4: 0, 3: 1, 9: 1, 0: 1, 8: 0}, 'sp': {}, 'logprobs': {}}, 'google/gemini-1.5-flash-002': {'answer': {5: 'yes', 6: 'yes', 1: 'yes', 4: 'yes', 3: 'no', 9: 'yes', 0: 'yes', 8: 'yes'}, 'cost': {5: 0, 6: 0, 1: 0, 4: 0, 3: 0, 9: 0, 0: 0, 8: 0}, 'quality': {5: 1, 6: 1, 1: 1, 4: 1, 3: 1, 9: 1, 0: 1, 8: 1}, 'sp': {}, 'logprobs': {}}}
# labels [{'_id': 5, 'answer': 'yes'}, {'_id': 6, 'answer': 'yes'}, {'_id': 1, 'answer': 'yes'}, {'_id': 4, 'answer': 'yes'}, {'_id': 3, 'answer': 'no'}, {'_id': 9, 'answer': 'yes'}, {'_id': 0, 'answer': 'yes'}, {'_id': 8, 'answer': 'yes'}]
# scores {'openaichat/gpt-4o-mini': {5: 0.48940352, 6: 0.4876187, 1: 0.4831809, 4: 0.48321706, 3: 0.48975572, 9: 0.48864377, 0: 0.48646355, 8: 0.48747963}, 'openaichat/gpt-4o': {5: 0.48923066, 6: 0.48772168, 1: 0.48307303, 4: 0.4835652, 3: 0.4895426, 9: 0.48880833, 0: 0.4862462, 8: 0.4873342}, 'google/gemini-1.5-flash-002': {5: 0.48986983, 6: 0.48849732, 1: 0.4837681, 4: 0.48490465, 3: 0.49043417, 9: 0.48936996, 0: 0.48699588, 8: 0.48852345}}
# missing answer 5
# missing answer 6
# missing answer 1
# missing answer 4
# missing answer 3
# missing answer 9
# missing answer 0
# missing answer 8



## Step 3: Evaluate and save the performance

In [54]:
def generate_dataframe_from_cascade(MyCascade,budget_list, train_data, test_data, genparams,name):
    # Initialize an empty list to store the rows for the DataFrame
    data = []

    # Iterate through the budget list
    for budget in tqdm(budget_list):
        # Load the strategy for the given budget
        MyCascade.load(loadpath=f"strategy/{name}/", budget=budget)

        # Get the completion batch for train data
        train_result = MyCascade.get_completion_batch(queries=train_data, genparams=genparams)

        # Compute the ACC and cost for train data
        train_acc_cost = FrugalGPT.compute_score(train_result)


        # Get the completion batch for test data
        test_result = MyCascade.get_completion_batch(queries=test_data, genparams=genparams)

        # Compute the ACC and cost for test data
        test_acc_cost = FrugalGPT.compute_score(test_result)

        # Create a row with the schema
        row = {
            "Test_acc": test_acc_cost['em'],
            "Test_cost": test_acc_cost['cost'],
            "Test_size": len(test_data),
            "Train_acc": train_acc_cost['em'],
            "Train_cost": train_acc_cost['cost'],
            "Train_size": len(train_data),
            "Budget": budget,
            "Method": "FrugalGPT",
            "Provider": "FrugalGPT",
            "Marker": 1,  # Marker is always 1 for this function
        }

        # Append the row to the data list
        data.append(row)
        display(row)

    # Create the DataFrame from the data list
    df = pd.DataFrame(data)

    return df

In [57]:
MyCascade_eval = FrugalGPT.LLMCascade()
MyCascade_eval.prefix = prefix

# just a demo, so only select first 500 samples
train_data = train_data[0:500]
test_data = test_data[0:500]

frugalgpt_df = generate_dataframe_from_cascade(MyCascade_eval,
                                               budget_list, train_data, test_data, genparams,
                                               name)
display(frugalgpt_df)
frugalgpt_df.to_csv(f"summary_{dataname}_e8_frugalgpt_2024.csv")

  0%|          | 0/2 [00:00<?, ?it/s]

{'Test_acc': 0.782,
 'Test_cost': 5.54532e-05,
 'Test_size': 500,
 'Train_acc': 0.786,
 'Train_cost': 5.5397999999999994e-05,
 'Train_size': 500,
 'Budget': 0.0035,
 'Method': 'FrugalGPT',
 'Provider': 'FrugalGPT',
 'Marker': 1}

 50%|█████     | 1/2 [55:34<55:34, 3334.89s/it]


SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /distilbert-base-uncased/resolve/main/tokenizer_config.json (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)')))"), '(Request ID: 15b1ae3d-9886-4920-8e4c-a564bfe15b73)')

Now let us save the results to local disk.

In [None]:
from google.colab import files
import copy
individualmodel_df2 = copy.copy(individualmodel_df)
#individualmodel_df2['Test_cost'] = individualmodel_df2['Test_cost'] * individualmodel_df2['Test_size']
full_pd = pd.concat([frugalgpt_df,individualmodel_df2,])
full_pd.to_csv(f"results/summary_{dataname}_e8_full_2024.csv")
files.download(f'results/summary_{dataname}_e8_full_2024.csv')
display(full_pd)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,Test_acc,Test_cost,Test_size,Train_acc,Train_cost,Train_size,Budget,Method,Provider,Marker
0,0.8478,3.3e-05,5000,0.8506,3.3e-05,5000,3.5e-05,FrugalGPT,FrugalGPT,1
1,0.8726,0.000183,5000,0.888,0.000221,5000,0.000223,FrugalGPT,FrugalGPT,1
2,0.878,0.000365,5000,0.8908,0.000372,5000,0.00041,FrugalGPT,FrugalGPT,1
3,0.878,0.000365,5000,0.8908,0.000372,5000,0.000598,FrugalGPT,FrugalGPT,1
4,0.878,0.000365,5000,0.8908,0.000372,5000,0.000786,FrugalGPT,FrugalGPT,1
5,0.878,0.000365,5000,0.8908,0.000372,5000,0.000973,FrugalGPT,FrugalGPT,1
6,0.878,0.000365,5000,0.8908,0.000372,5000,0.001161,FrugalGPT,FrugalGPT,1
7,0.878,0.000365,5000,0.8908,0.000372,5000,0.001348,FrugalGPT,FrugalGPT,1
8,0.878,0.000365,5000,0.8908,0.000372,5000,0.001536,FrugalGPT,FrugalGPT,1
9,0.878,0.000365,5000,0.8908,0.000372,5000,0.001724,FrugalGPT,FrugalGPT,1
