# 🎓 FrugalGPT: Generation Analysis

This notebook illustrates the FrugalGPT framework for _building LLM Applications with budget constraints._

In particular, we will focus on analyzing the generation heterogeneity among different models .


## Installation
Let us start by installing FrugalGPT (if you haven't yet!).

In [1]:
# set up the environment
%%capture
! git clone https://github.com/stanford-futuredata/FrugalGPT
! pip install git+https://github.com/stanford-futuredata/FrugalGPT
!mkdir -p FrugalGPT/strategy/
! wget  https://github.com/lchen001/DataHolder/releases/download/v0.0.3/HEADLINES_Model2024_New.zip
! unzip HEADLINES_Model2024_New.zip -d FrugalGPT/strategy/HEADLINES_Model2024_New
! rm HEADLINES_Model2024_New.zip
! wget -P FrugalGPT/db/ https://github.com/lchen001/DataHolder/releases/download/v0.0.3/HEADLINES.sqlite


In [2]:
%cd FrugalGPT

/content/FrugalGPT


In [3]:
%load_ext autoreload
%autoreload 2
import sys, json, copy
import logging
logging.disable(logging.CRITICAL)
sys.path.append("src/")

## Setup
Next, let us set up the environment and API keys. You do _not_ need API keys to run the notebook! They are only needed if you want to use FrugalGPT for your own queries.
#### NB: _For your own queries, not all API keys are needed, too. If you only want to leverage LLMs from, e.g., OpenAI and AI21, setting up API keys for them is sufficient._

In [4]:
import os
os.environ['OPENAI_API_KEY'] = 'OPENAI_API_KEY'
os.environ['AI21_STUDIO_API_KEY'] = 'AI21_STUDIO_API_KEY'
os.environ['COHERE_STUDIO_API_KEY'] = 'COHERE_STUDIO_API_KEY'
os.environ['TEXTSYNTH_API_SECRET_KEY'] = 'TEXTSYNTH_API_SECRET_KEY'
os.environ['ANTHROPIC_API_KEY'] = 'ANTHROPIC_API_KEY'
os.environ['TOGETHER_API_KEY'] = 'TOGETHER_API_KEY'
os.environ['GEMINI_API_KEY'] = 'GEMINI_API_KEY'

from IPython.display import display
import FrugalGPT
supported_LLM = FrugalGPT.getservicename()
print("supported LLMs:",supported_LLM)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

supported LLMs: ['textsynth/gptneox_20B', 'textsynth/fairseq_gpt_13B', 'textsynth/gptj_6B', 'openai/text-davinci-002', 'openai/text-davinci-003', 'openai/text-curie-001', 'openai/text-babbage-001', 'openai/text-ada-001', 'openaichat/gpt-4o-mini', 'openaichat/gpt-4o-mini-2024-07-18', 'openaichat/gpt-4o', 'openaichat/gpt-4o-2024-05-13', 'openaichat/gpt-4-turbo', 'openaichat/gpt-4o-2024-08-06', 'openaichat/gpt-3.5-turbo', 'openaichat/gpt-4', 'ai21/jamba-1.5-mini', 'ai21/jamba-1.5-large', 'ai21/j1-jumbo', 'ai21/j1-grande', 'ai21/j1-large', 'ai21/j2-ultra', 'ai21/j2-mid', 'ai21/j2-light', 'cohere/command', 'cohere/base', 'cohere/xlarge', 'cohere/medium', 'togetherai/Qwen/Qwen2-72B-Instruct', 'togetherai/mistralai/Mistral-7B-Instruct-v0.1', 'togetherai/google/gemma-2b-it', 'togetherai/google/gemma-2-9b-it', 'togetherai/google/gemma-2-27b-it', 'togetherai/meta-llama/Meta-Llama-3-8B-Instruct-Lite', 'togetherai/Qwen/Qwen1.5-110B-Chat', 'togetherai/mistralai/Mistral-7B-Instruct-v0.3', 'togethera

## Analyzing the overlap contains two steps. First is to extract the generations and answers from all models as a dataframe. The second is to compute the pairwise comparison.

## Step 1: Extract generations

In [9]:
import pandas as pd

def extract_dataframe(
    service_name,
    data,
    genparams,
    db_path="db/SCIQ.sqlite",
    max_workers = 1,
):
  MyLLMforAll = FrugalGPT.LLMforAll(
                     db_path=db_path,
                     max_workers=max_workers,

  )
  provider = service_name.split('/')[0]
  method = service_name.split('/')[-1]
  r1_train = MyLLMforAll.get_completion_batch(queries=data, genparams=genparams, service_name=service_name)
  columns = ['ID', 'Name', 'Age']
  r1_train['model'] = service_name
  return r1_train

def extract_dataframe_batch(
    service_names,
    data,
    genparams,
    db_path="db/SCIQ.sqlite",
    max_workers = 1,
    metric='em',
    ):
    # List to hold all DataFrames
    dfs = []

    # Loop through each string and call the original function
    for name in service_names:
        df =  extract_dataframe(
    service_name=name,
    data=data,
    genparams=genparams,
    db_path=db_path,
    max_workers = max_workers,
)
        dfs.append(df)
    # Concatenate all DataFrames in the list along rows (axis=0)
    result = pd.concat(dfs, ignore_index=True)
    FrugalGPT.compute_score_full(result,metric=metric)
    return result

## Step 2: Conduct pairwise comparison

In [10]:
import pandas as pd

def compare_models(df, model_1, model_2):
    """
    Compare two models in the DataFrame `df` to compute the required fractions.

    Args:
        df (pd.DataFrame): The DataFrame containing the data.
        model_1 (str): The name of the first model.
        model_2 (str): The name of the second model.

    Returns:
        dict: A dictionary with the fractions calculated for the three conditions.
    """

    # Filter rows where both models exist (assuming `model` column contains the model names)
    df_model_1 = df[df['model'] == model_1]
    df_model_2 = df[df['model'] == model_2]
    #print(df_model_1)
    #print(df_model_2)

    # Merge the dataframes on _id to compare em values for each model
    merged_df = pd.merge(df_model_1[['_id', 'em']], df_model_2[['_id', 'em']], on='_id', suffixes=(f'_{model_1}', f'_{model_2}'))
    #print(merged_df)

    # (i) Fraction of _ids where both models' em are the same
    same_em = (merged_df[f'em_{model_1}'] == merged_df[f'em_{model_2}']).mean()

    # (ii) Fraction of _ids where the first model's em is 1 and the second model's em is 0
    first_1_second_0 = ((merged_df[f'em_{model_1}'] == 1) & (merged_df[f'em_{model_2}'] == 0)).mean()

    # (iii) Fraction of _ids where the first model's em is 0 and the second model's em is 1
    first_0_second_1 = ((merged_df[f'em_{model_1}'] == 0) & (merged_df[f'em_{model_2}'] == 1)).mean()

    # Return the results in a dictionary
    return {
        'fraction_same_em': same_em,
        'fraction_first_1_second_0': first_1_second_0,
        'fraction_first_0_second_1': first_0_second_1
    }

## Analysis on Headlines

In [11]:
dataname = "HEADLINES"

test_data = FrugalGPT.loadcsvdata(f"data/{dataname}/test.csv")
prefix = open(f'config/prompt/{dataname}/prefix_e8.txt').read()
test_data = FrugalGPT.formatdata(test_data,prefix)

train_data = FrugalGPT.loadcsvdata(f"data/{dataname}/train.csv")
prefix = open(f'config/prompt/{dataname}/prefix_e8.txt').read()
train_data = FrugalGPT.formatdata(train_data,prefix)

In [12]:
sample_size = 10000
service_names = [
    'ai21/jamba-1.5-large',

    'togetherai/meta-llama/Meta-Llama-3-70B-Instruct-Turbo',
    'togetherai/google/gemma-2-9b-it',

    'google/gemini-1.5-flash',
    'google/gemini-1.5-pro',
    'google/gemini-1.5-flash-8b',

    'openaichat/gpt-4o-2024-05-13',
    'openaichat/gpt-4o-mini',
    'openaichat/gpt-4-turbo',

    'anthropic/claude-3-5-sonnet-20240620',
                 ]
genparams=FrugalGPT.GenerationParameter(max_tokens=50,
                                        temperature=0.1,
                                        stop=['\n'])
individualmodel_df = extract_dataframe_batch(service_names=service_names,
                                        data=test_data[0:sample_size],
                                        genparams=genparams,
                                        db_path=f"db/{dataname}.sqlite",
                                        max_workers=2)
individualmodel_df

5000it [00:08, 590.79it/s]
5000it [00:08, 613.85it/s]
5000it [00:07, 626.96it/s]
5000it [00:08, 606.14it/s]
5000it [00:08, 616.14it/s]
5000it [00:08, 618.90it/s]
5000it [00:07, 639.99it/s]
5000it [00:07, 647.12it/s]
5000it [00:07, 650.60it/s]
5000it [00:08, 589.10it/s]


Unnamed: 0,_id,answer,ref_answer,cost,model,em
0,6556,up.,up,0.000514,ai21/jamba-1.5-large,1
1,5832,up.,none,0.000518,ai21/jamba-1.5-large,0
2,5618,up.,none,0.000518,ai21/jamba-1.5-large,0
3,4205,down,down,0.000506,ai21/jamba-1.5-large,1
4,842,up.,down,0.000510,ai21/jamba-1.5-large,0
...,...,...,...,...,...,...
49995,8376,neutral,up,0.000792,anthropic/claude-3-5-sonnet-20240620,0
49996,4242,down,down,0.000795,anthropic/claude-3-5-sonnet-20240620,1
49997,6890,neutral,neutral,0.000789,anthropic/claude-3-5-sonnet-20240620,1
49998,663,down,up,0.000801,anthropic/claude-3-5-sonnet-20240620,0


In [13]:
individualmodel_df

result = compare_models(individualmodel_df, 'togetherai/meta-llama/Meta-Llama-3-70B-Instruct-Turbo', 'openaichat/gpt-4-turbo')
print(result)

{'fraction_same_em': 0.8682, 'fraction_first_1_second_0': 0.0512, 'fraction_first_0_second_1': 0.0806}
