# Model Optimization

© Advanced Analytics, Amir Ben Haim, 2024

<font color="yellow">UPDATED</font>

<br>
<br>
<hr class="dotted">
<br>
<br>

## Setup

<br></br>

### <u>Resetting OpenAI API **EVALS & FINE-TUNING**</u>

<p style="background-color:blue; font-size:30px; color:yellow"> It's easier to follow the notebook if you reset (delete) OpenAI API <b>FILES & EVALS & FINE-TUNING</b>
<br>Use at your own discretion</p>

[OpenAI Storage](https://platform.openai.com/storage)
<br>
[OpenAI Evals](https://platform.openai.com/evaluations?tab=evaluations)
<br>
[OpenAI Eval Runs](https://platform.openai.com/evaluations?tab=runs)
<br>
[OpenAI Fine-tuning](https://platform.openai.com/finetune)


<br></br>

### <u>API Keys</u>

In order to use the OpenAI language model, users are required to generate a token.
<br></br>
<u>Follow these simple steps to generate a token with openai:</u>
- Go to <a href="url">https://platform.openai.com/apps</a>  and signup with your email address or connect your Google Account.
- Go to View API Keys on left side of your Personal Account Settings
- Select Create new Secret key
- The API access to OPENAI is a paid service
- You have to set up billing
- You don’t need ChatGPT Plus - The API and ChatGPT subscriptions are billed separately
<br></br>
<p style="background-color:Tomato"> Make sure you read the Pricing information before experimenting</p>
<p style="background-color:Tomato">Once you add your API key, make sure to not share it with anyone! The API key should remain private</p>
<p style="background-color:Tomato">Use the <code>.env</code> file for you API key</p>

<br></br>

### <u>pip install</u>

```powershell
pip install openai
pip install python-dotenv
pip install scikit-learn
pip install pandas
pip install matplotlib
pip install seaborn
```

<br></br>

### <u>API Key Setup</u>

Before using LangChain with OpenAI, set your API key:

In [2]:
from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()  # Loads variables from .env
openai_key = os.getenv("OPENAI_API_KEY")
client = OpenAI()

<br>
<br>
<hr class="dotted">
<br>
<br>

## Evals

<br></br>

### <u>Create an eval for a task</u>

<br></br>

#### Exercise 1 - Describe a task to be done by a model

In the example below we want the model <u>classify Hotel Reviews</u>
> **Goal**: Classify customer review as `Positive`, `Negative`, or `Neutral`

In [3]:
instructions = """
You are an expert in categorizing  Hotel Guest Complaints. Given the complaint
below, categorize the complaint into one of "Positive", "Negative", "Neutral". Respond with only one of those words.
"""

complaint = "Had a great time. Everything from the service to the bed was perfect"


completion = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "developer", "content": instructions},
        {"role": "user", "content": complaint}
    ]
)


print(completion.choices[0].message.content)

Positive


<br></br>

#### Exercise 2 - Create an eval

Let's set up an eval to test this behavior

In [4]:
eval_obj = client.evals.create(

    name="Hotel Guests Complaint Routing",

    data_source_config={
        "type": "custom",
        "item_schema": {
            "type": "object",
            "properties": {
                "hotel_review": {"type": "string"},
                "correct_label": {"type": "string"},
            },
            "required": ["hotel_review", "correct_label"],
        },
        "include_sample_schema": True,

    },
    testing_criteria=[
        {
            "type": "string_check",
            "name": "Match output to human label",
            "input": "{{ sample.output_text }}",
            "operation": "eq",
            "reference": "{{ item.correct_label }}",
        }
    ],
)


print(eval_obj)
print(eval_obj.id)

EvalCreateResponse(id='eval_68582dfa15388191a483ca7bbdf10b90', created_at=1750609402, data_source_config=EvalCustomDataSourceConfig(schema_={'type': 'object', 'properties': {'item': {'type': 'object', 'properties': {'hotel_review': {'type': 'string'}, 'correct_label': {'type': 'string'}}, 'required': ['hotel_review', 'correct_label']}, 'sample': {'type': 'object', 'properties': {'model': {'type': 'string'}, 'choices': {'type': 'array', 'items': {'type': 'object', 'properties': {'message': {'type': 'object', 'properties': {'role': {'type': 'string', 'enum': ['assistant']}, 'content': {'type': ['string', 'null']}, 'refusal': {'type': ['boolean', 'null']}, 'tool_calls': {'type': ['array', 'null'], 'items': {'type': 'object', 'properties': {'type': {'type': 'string', 'enum': ['function']}, 'function': {'type': 'object', 'properties': {'name': {'type': 'string'}, 'arguments': {'type': 'string'}}, 'required': ['name', 'arguments']}, 'id': {'type': 'string'}}, 'required': ['type', 'function',

<br></br>

### <u>Test a prompt with your eval</u>

We've created an eval that describes the desired behavior of our application, let's test a prompt with a set of test data

<br></br>

#### Exercise 3 - Uploading test data

- Use the  **JSONL** file `hotel_review_sentiment_test.jsonl`

- Now, upload the test data file to the OpenAI platform so we can reference it later [OpenAI Storage](https://platform.openai.com/storage)

In [5]:
file = client.files.create(
    file=open("hotel_review_sentiment_test.jsonl", "rb"),
    purpose="evals"
)


print(file)
print(file.id)

FileObject(id='file-9U3oiMqWpGEwvcWNHJ4aZu', bytes=10450, created_at=1750609497, filename='hotel_review_sentiment_test.jsonl', object='file', purpose='evals', status='processed', expires_at=None, status_details=None)
file-9U3oiMqWpGEwvcWNHJ4aZu


<br>

<p style="background-color:blue; font-size:20px; color:yellow"> Very Important!</b></p>

To evaluate the effectiveness of model fine-tuning, <u>**I deliberately included the made-up word "xxx" in several complaint examples**.</u>


<br></br>

#### Exercise 4 - Creating an eval run

With our test data in place, let's evaluate a prompt and see how it performs against our test criteria

In [6]:
instructions = """
You are an expert in categorizing  Hotel Guest Complaints. Given the complaint
below, categorize the complaint into one of "Positive", "Negative", "Neutral". Respond with only one of those words.
"""


run = client.evals.runs.create(

    eval_obj.id, # YOUR_EVAL_ID

    name="Categorization text run",

    data_source={
        "type": "completions",
        "model": "gpt-4.1",
        "input_messages": {
            "type": "template",
            "template": [
                {"role": "developer", "content": instructions},
                {"role": "user", "content": "{{ item.hotel_review }}"},
            ],
        },
        "source": {"type": "file_id", "id": file.id}, # YOUR_FILE_ID
    },
)



print(run)
print(run.id)

RunCreateResponse(id='evalrun_68582ebf20648191bcc89eda10c732f9', created_at=1750609599, data_source=CreateEvalCompletionsRunDataSource(source=SourceFileID(id='file-9U3oiMqWpGEwvcWNHJ4aZu', type='file_id'), type='completions', input_messages=InputMessagesTemplate(template=[InputMessagesTemplateTemplateMessage(content=ResponseInputText(text='\nYou are an expert in categorizing  Hotel Guest Complaints. Given the complaint\nbelow, categorize the complaint into one of "Positive", "Negative", "Neutral". Respond with only one of those words.\n', type='input_text'), role='developer', type='message'), InputMessagesTemplateTemplateMessage(content=ResponseInputText(text='{{ item.hotel_review }}', type='input_text'), role='user', type='message')], type='template'), model='gpt-4.1', sampling_params=None), error=None, eval_id='eval_68582dfa15388191a483ca7bbdf10b90', metadata={}, model='gpt-4.1', name='Categorization text run', object='eval.run', per_model_usage=None, per_testing_criteria_results=Non

<br></br>

### <u>Analyze the results</u>

We've created an eval that describes the desired behavior of our application, let's test a prompt with a set of test data

<br></br>

#### Exercise 5 - Run has now been queued

Fetch the current status of an eval run via API

In [7]:
run_retrieve = client.evals.runs.retrieve(
    eval_id=eval_obj.id, # YOUR_EVAL_ID
    run_id=run.id # YOUR_RUN_ID
    )


print(run_retrieve)
print(run_retrieve.status)

RunRetrieveResponse(id='evalrun_68582ebf20648191bcc89eda10c732f9', created_at=1750609599, data_source=CreateEvalCompletionsRunDataSource(source=SourceFileID(id='file-9U3oiMqWpGEwvcWNHJ4aZu', type='file_id'), type='completions', input_messages=InputMessagesTemplate(template=[InputMessagesTemplateTemplateMessage(content=ResponseInputText(text='\nYou are an expert in categorizing  Hotel Guest Complaints. Given the complaint\nbelow, categorize the complaint into one of "Positive", "Negative", "Neutral". Respond with only one of those words.\n', type='input_text'), role='developer', type='message'), InputMessagesTemplateTemplateMessage(content=ResponseInputText(text='{{ item.hotel_review }}', type='input_text'), role='user', type='message')], type='template'), model='gpt-4.1', sampling_params=None), error=None, eval_id='eval_68582dfa15388191a483ca7bbdf10b90', metadata={}, model='gpt-4.1', name='Categorization text run', object='eval.run', per_model_usage=[PerModelUsage(cached_tokens=0, comp

<br></br>

####  Exercise 6 - Report

Return the API response as the folowing format:

In [12]:
print(run_retrieve.status)
print(run_retrieve.model)
print(run_retrieve.name)
from datetime import datetime
timestamp = run_retrieve.created_at
readable = datetime.fromtimestamp(timestamp)
print(timestamp)
print(readable)
print(run_retrieve.result_counts)

completed
gpt-4.1
Categorization text run
1750609599
2025-06-22 19:26:39
ResultCounts(errored=0, failed=8, passed=97, total=105)


<br></br>

#### Exercise 7 - Metrics

Fetch the raw evaluation items and compute the folowing metrics yourself

In [13]:
passed = run_retrieve.result_counts.passed
total = run_retrieve.result_counts.total

print('Accuracy')
print(len('Accuracy')*'-')
print(f'{(passed / total) :.2%}')

Accuracy
--------
92.38%
