# Enhanced Inference 

`autogen.OpenAIWrapper` provides LLM inference for *openai>=1*.
`autogen.Completion` is drop-in replacement of `openai.Completion` and `openai.ChatCompletion` for enhanced LLM inference using *openai<1* .

## Tune Inference Parameter(for openai<1)
1. Choices to optimize
    the cost of using foundation models for text generation is typically measured in terms of the number of tokens in the input and output combined.
    from the perspective of an application builder using foundation models the use case is to maximize the utility of the generated text under an inference budget constraints 
    this can be achieved by optimizing the hyperparameters of the inference which can signficantly affect both the utility and the cost of the generated text.

The tunable hyperparameters include:
1. model - required input specifying the model ID to use
2. prompt/message - input prompt/message to the model which provides the context for the text generation task 
3. max_tokens - maximum number of tokens to generate in the output
4. temperature - a value between 0 and 1 that controls the randomness of the text 
5. top_p - a value between 0 and 1 that controls the sampling probability mass for each token generation
6. n - number of response to generate for a given prompt
7. stop - list of string that when encountered in the generated text, will cause the generation to stop.
8. presence_penalty , frequency_penalty - values that control the relative importance of the presence and frequency of certain words or phrares in the generated text
9. best_of - number of response to generate server-side when selecting the best response for a given prompt 

The cost and utility of text generation are intertwined with the joint effect of these hyperparameters. There are also complex interactions among subsets of the hyperparameters.

Tuning can be performed with the following information:
1. Validation data :

    collect a diverse set of instances and can be stored in an iterable dict.

2. Evaluation function :

    evaluation function should take a list of responses, and other keyword arguments corresponding to the keys in each validation data instance as input, and output a dict of metrics. 

    `autogen.code_utils` and `autogen.math_utils` offer some example evaluation functions for code generation and math problem solving.

3. Metric to optimize

    metric to optimize is usually an aggregated metric over all the tuning data instances

4. Search space 

    i. **Model :** can be fixed string or multiple choices defined using `flaml.tune.choice`

    ii. **Prompt/messages :** prompt is a template for the model's input, which could be a single string or a list of strings. Messages are the model's instructions or context, which could be either a list of dictionaries or lists of lists.

    iii. **max_tokens,n ,best_of :** hyperparameters control how much text the model generates.
    can either be set to fixed values or searched within a range using `flaml.tune.randint`, `flaml.tune.qrandint`

    iv. **stop :** specifies where the text generation should stop
    it can be a string, a list of strings or `None`

    v. **temperature ot top_p :** settings control the randomness of the text. You can choose either `temperature` or `top_p`, but not both at the same time. 
    they can be set as constants or searched within a range using methods like `flaml.tune.uniform` or `flaml.tune.loguniform`. 
    The default range for both is between 0 and 1.

    vi. **presence_penalty, frequency_penalty :**  control penalties applied to repeated phrases. 
    By default, these are not tuned, but you can specify them as constants or search for them using methods like `flaml.tune.uniform`.

5. Budget: inference budget refers to the average inference cost per data instance. 
The optimization budget refers to the total budget allowed in the tuning process. 
Both are measured by dollars and follow the price per 1000 tokens.




### Performance tuning

In [None]:
import autogen

config, analysis = autogen.Completion.tune(
    data=tune_data,
    metric="success",
    mode="max",
    eval_func=eval_func,
    inference_budget=0.05,
    optimization_budget=3,
    num_samples=-1,
)

### API unification

`autogen.OpenAIWrapper.create()` can be used to create completions for both chat and non-chat models and both OpenAI API and Azure OpenAI API

For local LLMs one can spin up an endpoint using a package like FastChat and then use same API to send a request. for custom model clients one can register the client with `autogen.OpenAIWrapper.register_model_client` and then use the same API to send the request.


In [None]:
from autogen import OpenAIWrapper
# OpenAI endpoint
client = OpenAIWrapper()
# ChatCompletion
response = client.create(messages=[{"role": "user", "content": "2+2="}], model="gpt-3.5-turbo")
# extract the response text
print(client.extract_text_or_completion_object(response))
# get cost of this completion
print(response.cost)
# Azure OpenAI endpoint
client = OpenAIWrapper(api_key=..., base_url=..., api_version=..., api_type="azure")
# Completion
response = client.create(prompt="2+2=", model="gpt-3.5-turbo-instruct")
# extract the response text
print(client.extract_text_or_completion_object(response))


# Usage Summary

`OpenAIWrapper` from autogen tracks token counts and costs of your API calls.
use the `create()` method to initiate requests and `print_usage_summary` to retrive detailed usage reports, including total cost and token usage for both cached and actualed requests.

`mode=["actual","total"]` default: print usage summary for all completions and non-caching completions

`mode="actual"` : only print non-cached usage

`mode="total"` : only print all usage (including cache).