Copyright (c) Microsoft Corporation. All rights reserved. 

Licensed under the MIT License.

# Use FLAML to Tune ChatGPT

FLAML offers a cost-effective hyperparameter optimization technique [EcoOptiGen](https://arxiv.org/abs/2303.04673) for tuning Large Language Models. Our study finds that tuning hyperparameters can significantly improve the utility of LLMs.

In this notebook, we tune OpenAI ChatGPT (both GPT-3.5 and GPT-4) models for math problem solving. We use [the MATH benchmark](https://crfm.stanford.edu/helm/latest/?group=math_chain_of_thought) for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning. 

## Requirements

FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the [openai] option:
```bash
pip install flaml[openai]==1.2.0
```

In [1]:
# %pip install flaml[openai]==1.2.0 datasets

Set your OpenAI key:

In [2]:
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = "<your OpenAI API key here>"
import openai
openai.api_key_path = "key.txt"

Uncomment the following to use Azure OpenAI:

In [3]:
# openai.api_type = "azure"
# openai.api_base = "https://<your_endpoint>.openai.azure.com/"
# openai.api_version = "2023-03-15-preview"

## Load dataset

First, we load the competition_math dataset. The dataset contains 201 "Level 2" Algebra examples. We use a random sample of 20 examples for tuning the generation hyperparameters and the remaining for evaluation. We use one demonstration example in the prompt.

In [4]:
import datasets

seed = 41
data = datasets.load_dataset("competition_math")
train_data = data["train"].shuffle(seed=seed)
test_data = data["test"].shuffle(seed=seed)
n_tune_data = 20
tune_data = [
    {
        "problem": train_data[x]["problem"],
        "solution": train_data[x]["solution"],
    }
    for x in range(len(train_data)) if train_data[x]["level"] == "Level 2" and train_data[x]["type"] == "Algebra"
][:n_tune_data]
test_data = [
    {
        "problem": test_data[x]["problem"],
        "solution": test_data[x]["solution"],
    }
    for x in range(len(test_data)) if test_data[x]["level"] == "Level 2" and test_data[x]["type"] == "Algebra"
]
print(len(tune_data), len(test_data))


Using custom data configuration default
Found cached dataset competition_math (/home/vscode/.cache/huggingface/datasets/competition_math/default/1.0.0/2a2a2995c2847186883ecd64f69be7d602b8a6f6b51950624d4dc2263f93333b)


  0%|          | 0/2 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /home/vscode/.cache/huggingface/datasets/competition_math/default/1.0.0/2a2a2995c2847186883ecd64f69be7d602b8a6f6b51950624d4dc2263f93333b/cache-f1cfe8228271b121.arrow
Loading cached shuffled indices for dataset at /home/vscode/.cache/huggingface/datasets/competition_math/default/1.0.0/2a2a2995c2847186883ecd64f69be7d602b8a6f6b51950624d4dc2263f93333b/cache-d155a2d38c23bd53.arrow


20 201


Check a tuning example:

In [5]:
print(tune_data[1]["problem"])

Rationalize the denominator of $\displaystyle\frac{21}{\sqrt{21}}$.


Here is one example of the canonical solution:

In [6]:
print(tune_data[1]["solution"])

$\dfrac{21}{\sqrt{21}} = \dfrac{21}{\sqrt{21}} \cdot \dfrac{\sqrt{21}}{\sqrt{21}} = \dfrac{21\sqrt{21}}{21} = \boxed{\!\sqrt{21}}$.


## Define Success Metric

Before we start tuning, we need to define the success metric we want to optimize. For each math task, we use voting to select a response with the most common answers out of all the generated responses. If it has an equivalent answer to the canonical solution, we consider the task as successfully solved. Then we can optimize the mean success rate of a collection of tasks.

In [7]:
from flaml.autogen.math_utils import success_metrics

## Use the tuning data to find a good configuration

### Import the oai and tune subpackages from flaml.

FLAML has provided an API for hyperparameter optimization of OpenAI ChatGPT models: `oai.ChatCompletion.tune` and to make a request with the tuned config: `oai.ChatCompletion.create`. First, we import oai from flaml:

In [8]:
from flaml import oai

For (local) reproducibility and cost efficiency, we cache responses from OpenAI.

In [9]:
oai.ChatCompletion.set_cache(seed)

This will create a disk cache in ".cache/{seed}". You can change `cache_path` in `set_cache()`. The cache for different seeds are stored separately.

### Perform tuning

The tuning will take a while to finish, depending on the optimization budget. The tuning will be performed under the specified optimization budgets.

* `inference_budget` is the target average inference budget per instance in the benchmark. For example, 0.004 means the target inference budget is 0.004 dollars, which translates to 2000 tokens (input + output combined) if the gpt-3.5-turbo model is used.
* `optimization_budget` is the total budget allowed to perform the tuning. For example, 1 means 1 dollars are allowed in total, which translates to 500K tokens for the gpt-3.5-turbo model.
* `num_sumples` is the number of different hyperparameter configurations which is allowed to try. The tuning will stop after either num_samples trials or after optimization_budget dollars spent, whichever happens first. -1 means no hard restriction in the number of trials and the actual number is decided by `optimization_budget`.

Users can specify tuning data, optimization metric, optimization mode, evaluation function, search spaces etc.. The default search space is:

```python
default_search_space = {
    "model": tune.choice([
        "gpt-3.5-turbo",
        "gpt-4",
    ]),
    "temperature_or_top_p": tune.choice(
        [
            {"temperature": tune.uniform(0, 1)},
            {"top_p": tune.uniform(0, 1)},
        ]
    ),
    "max_tokens": tune.lograndint(50, 1000),
    "n": tune.randint(1, 100),
    "prompt": "{prompt}",
}
```

The default search space can be overridden by users' input.
For example, the following code specifies a fixed prompt template. For hyperparameters which don't appear in users' input, the default search space will be used.

In [10]:
import logging

prompts = ["{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}."]
config, analysis = oai.ChatCompletion.tune(
    data=tune_data,  # the data for tuning
    metric="success_vote",  # the metric to optimize
    mode="max",  # the optimization mode
    eval_func=success_metrics,  # the evaluation function to return the success metrics
    # log_file_name="logs/math.log",  # the log file name
    inference_budget=0.02,  # the inference budget (dollar)
    optimization_budget=1,  # the optimization budget (dollar)
    # num_samples can further limit the number of trials for different hyperparameter configurations;
    # -1 means decided by the optimization budget only
    num_samples=-1,
    # model="chatgpt-35-turbo-0301",  # uncomment if using Azure OpenAI
    # model="gpt-3-turbo",  # uncomment if you don't have access to gpt-4
    prompt=prompts,  # the prompt templates to choose from
    # stop="###",  # the stop sequence
    logging_level=logging.INFO,  # the logging level
)


[32m[I 2023-03-29 22:15:13,167][0m A new study created in memory with name: optuna[0m
[32m[I 2023-03-29 22:15:13,169][0m A new study created in memory with name: optuna[0m


[flaml.tune.tune: 03-29 22:15:13] {832} INFO - trial 1 config: {'model': 'gpt-4', 'temperature_or_top_p': {'temperature': 0.36865945026811975}, 'max_tokens': 347, 'n': 1, 'prompt': 0}
[flaml.tune.tune: 03-29 22:19:26] {215} INFO - result: {'expected_success': 0.8, 'success': 0.8, 'success_vote': 0.8, 'voted_answer': 'We use the distance formula to find the distance between the points $(0,4)$ and $(3,0)$: $$\\sqrt{(3-0)^2+(0-4)^2}=\\sqrt{3^2+(-4)^2}=\\sqrt{9+16}=\\sqrt{25}=\\boxed{5}.$$', 'total_cost': 0.14595, 'cost': 0.14595, 'inference_cost': 0.007297499999999999, 'training_iteration': 0, 'config': {'model': 'gpt-4', 'temperature_or_top_p': {'temperature': 0.36865945026811975}, 'max_tokens': 347, 'n': 1, 'prompt': 0}, 'config/model': 'gpt-4', 'config/temperature_or_top_p': {'temperature': 0.36865945026811975}, 'config/max_tokens': 347, 'config/n': 1, 'config/prompt': 0, 'experiment_tag': 'exp', 'time_total_s': 252.8926329612732}
[flaml.tune.tune: 03-29 22:19:26] {832} INFO - trial 2 

### Output tuning results

After the tuning, we can print out the config and the result found by FLAML:

In [11]:
print("optimized config", config)
print("best result on tuning data", analysis.best_result)

optimized config {'model': 'gpt-3.5-turbo', 'max_tokens': 424, 'n': 54, 'prompt': '{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}.', 'stop': None, 'temperature': 0.9177741225129434}
best result on tuning data {'expected_success': 0.9999999951313461, 'success': 1.0, 'success_vote': 0.95, 'voted_answer': 'Using the distance formula, we have \\begin{align*}\n\\text{distance} &= \\sqrt{(3-0)^2 + (0-4)^2} \\\\\n&= \\sqrt{9+16} \\\\\n&= \\sqrt{25} \\\\\n&= \\boxed{5}.\n\\end{align*}', 'total_cost': 0.6470319999999998, 'cost': 0.319376, 'inference_cost': 0.0153156, 'training_iteration': 0, 'config': {'model': 'gpt-3.5-turbo', 'temperature_or_top_p': {'temperature': 0.9177741225129434}, 'max_tokens': 424, 'n': 54, 'prompt': 0}, 'config/model': 'gpt-3.5-turbo', 'config/temperature_or_top_p': {'temperature': 0.9177741225129434}, 'config/max_tokens': 424, 'config/n': 54, 'config/prompt': 0, 'experiment_tag': 'exp', 'time_total_

### Make a request with the tuned config

We can apply the tuned config on the request for an example task:

In [12]:
responses = oai.ChatCompletion.create(context=tune_data[1], **config)
print(responses)
print(success_metrics([response["message"]["content"].rstrip() for response in responses["choices"]], **tune_data[1]))

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "We want to get rid of the square root in the denominator. We can do this by multiplying both the numerator and denominator of the fraction by $\\sqrt{21}$: $$\\frac{21}{\\sqrt{21}}\\cdot\\frac{\\sqrt{21}}{\\sqrt{21}}=\\frac{21\\sqrt{21}}{21}=\\sqrt{21}.$$ Thus, $\\displaystyle\\frac{21}{\\sqrt{21}}=\\boxed{\\sqrt{21}}$.",
        "role": "assistant"
      }
    },
    {
      "finish_reason": "stop",
      "index": 1,
      "message": {
        "content": "We have $$\\frac{21}{\\sqrt{21}} = \\frac{21}{\\sqrt{21}} \\cdot \\frac{\\sqrt{21}}{\\sqrt{21}} = \\frac{21\\sqrt{21}}{21} = \\sqrt{21}.$$Therefore, the answer is $\\boxed{\\sqrt{21}}$.",
        "role": "assistant"
      }
    },
    {
      "finish_reason": "stop",
      "index": 2,
      "message": {
        "content": "Using the definition of square roots, we see that $\\sqrt{21}\\cdot\\sqrt{21}=21$. Therefore, we can wr

### Evaluate the success rate on the test data

You can use flaml's `oai.ChatCompletion.eval` to evaluate the performance of an entire dataset with the tuned config. To do that you need to set `oai.ChatCompletion.data` to the data to evaluate. The following code will take a while (30 mins to 1 hour) to evaluate all the test data instances if uncommented and run. It will cost roughly $3. 

In [13]:
oai.ChatCompletion.data = test_data
result = oai.ChatCompletion.eval(analysis.best_config, prune=False, eval_only=True)
print(result)

{'expected_success': 0.9889659482835421, 'success': 0.9950248756218906, 'success_vote': 0.9154228855721394, 'voted_answer': "We can start by setting up a proportion: $$\\frac{\\text{power in kilowatts}}{\\text{power in metric horsepower}}=\\frac{1}{1.36}$$ Let $P$ be the power in kilowatts. Then, we can write: $$\\frac{P}{500}= \\frac{1}{1.36}$$ Cross-multiplying, we have: $$P=\\frac{500}{1.36} \\approx \\boxed{368}$$ Therefore, Eric's car's engine can generate approximately $\\boxed{368}$ kilowatts of power.", 'total_cost': 4.208937999999998, 'cost': 3.186843999999999, 'inference_cost': 0.015838416417910447}


What about the default, untuned gpt-4 config (with the same prompt as the tuned config)? We can evaluate it and compare:

In [14]:
# assuming you have access to gpt-4; otherwise use gpt-3.5-turbo
# the following code will cost roughly $2 if uncommented and run.

default_config = {"model": 'gpt-4', "prompt": 0}
default_result = oai.ChatCompletion.eval(default_config, prune=False, eval_only=True)
print(default_result)
print("default config of GPT-4 succeeds in {:.1f}% test cases".format(default_result["success_vote"] * 100))
print("tuned config succeeds in {:.1f}% test cases".format(result["success_vote"] * 100))


retrying in 10 seconds...
Traceback (most recent call last):
  File "/home/vscode/.local/lib/python3.9/site-packages/openai/api_requestor.py", line 669, in _interpret_response_line
    data = json.loads(rbody)
  File "/usr/local/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspaces/FLAML/flaml/integrations/oai/completion.py", line 139, in _get_response
    response = openai_completion.create(**config)
  File "/home/vscode/.local/lib/python3.9/site-packages/openai/api_resources/chat_completion

{'expected_success': 0.7213930348258707, 'success': 0.7213930348258707, 'success_vote': 0.7213930348258707, 'voted_answer': "First, convert the $500$ horsepower to kilowatts by dividing by $1.36$. $500/1.46 \\approx 360.5$. Therefore Eric's car can generate about $\\boxed{361}$ kilowatts of power.", 'total_cost': 6.099327999999994, 'cost': 1.8903900000000013, 'inference_cost': 0.009186992537313433}
default config of GPT-4 succeeds in 72.1% test cases
tuned config succeeds in 91.5% test cases


The default use of GPT-4 has a much lower accuracy. Note that the default config has a lower inference cost. What if we heuristically increase the number of responses n?

In [15]:
# The following evaluation costs $3 and longer than one hour if you uncomment it and run it.

config_n2 = {"model": 'gpt-4', "prompt": 0, "n": 2}
n2_result = oai.ChatCompletion.eval(config_n2, prune=False, eval_only=True)
print(n2_result)

retrying in 10 seconds...
Traceback (most recent call last):
  File "/home/vscode/.local/lib/python3.9/site-packages/openai/api_requestor.py", line 669, in _interpret_response_line
    data = json.loads(rbody)
  File "/usr/local/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspaces/FLAML/flaml/integrations/oai/completion.py", line 139, in _get_response
    response = openai_completion.create(**config)
  File "/home/vscode/.local/lib/python3.9/site-packages/openai/api_resources/chat_completion

{'expected_success': 0.7699004975124378, 'success': 0.835820895522388, 'success_vote': 0.7064676616915423, 'voted_answer': "Since $1$ kilowatt is equivalent to $1.36$ horsepower, and Eric's sports car's engine has a power of $500$ metric horsepower, his car's engine can generate $\\frac{500}{1.36} \\approx \\boxed{368}$ kilowatts of power.", 'total_cost': 9.367137999999994, 'cost': 3.267810000000001, 'inference_cost': 0.016039828358208955}


The inference cost is doubled and matches the tuned config. But the success rate doesn't improve much. What if we further increase the number of responses n to 5?

In [16]:
# The following evaluation costs $8 and longer than one hour if you uncomment it and run it.

config_n5 = {"model": 'gpt-4', "prompt": 0, "n": 5}
n5_result = oai.ChatCompletion.eval(config_n5, prune=False, eval_only=True)
print(n5_result)

retrying in 10 seconds...
Traceback (most recent call last):
  File "/home/vscode/.local/lib/python3.9/site-packages/openai/api_requestor.py", line 669, in _interpret_response_line
    data = json.loads(rbody)
  File "/usr/local/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspaces/FLAML/flaml/integrations/oai/completion.py", line 139, in _get_response
    response = openai_completion.create(**config)
  File "/home/vscode/.local/lib/python3.9/site-packages/openai/api_resources/chat_completion

{'expected_success': 0.9197771144278613, 'success': 0.9552238805970149, 'success_vote': 0.8656716417910447, 'voted_answer': "Since a kilowatt is equivalent to $1.36$ horsepower, then Eric's $500$ horsepower engine can generate $\\frac{500}{1.36}$ kilowatts, or about $\\boxed{368}$ kilowatts.", 'total_cost': 16.885947999999996, 'cost': 7.518809999999999, 'inference_cost': 0.03718908208955224}


We find that the 'success_vote' metric is increased at the cost of exceeding the inference budget. But the tuned configuration has both higher 'success_vote' (92% vs. 87%) and lower average inference cost ($0.016 vs. $0.04 per instance).

A developer could use flaml to tune the configuration to satisfy the target inference budget while maximizing the value out of it.