# Fine Tunning the model

Another approach is to fine tune ChatGPT by pre-training a model specifically for our case escenario and adding the example inputs and outputs there instead of doing it in the example selector.

In this notebook we are going to see how that can be accomplished.

In [None]:
%pip install openai
%pip install scipy
%pip install matplotlib
%pip install numpy
%pip install pandas
%pip install langchain
%pip install langsmith
%pip install unstructured
%pip install chromadb
%pip install tiktoken
%pip install ipywidgets

### Setup

Let's start by importing environment variables

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

api_key=os.environ['OPENAI_API_KEY']
base_url=os.environ['OPENAI_BASE_URL']
api_version=os.environ['OPENAI_API_VERSION']

print(base_url)

https://devsquad-eastus-2.openai.azure.com/


### Prepare the dataset

First step from fine-tunning is to prepare a dataset, using [OpenAI Docs](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset) for reference. 

In [3]:
def list_files_in_folders(path):
    return os.listdir(path)

def read_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
    return content

basePath = "../go-examples/examples/gin/"
pathList = list_files_in_folders(basePath)
print(pathList)

['basic-conversion-1', 'basic-conversion-2', 'basic-conversion-3', 'basic-conversion-4', 'basic-conversion-5', 's3-conversion-1']


In [4]:
system_prompt = '''
You are an AI that only responds with in code, NOT ENGLISH. 
You will be given a lambda function code. 
Rewrite the code using azure function code without using lambda code.

Use a code block to write your response. For example:

```go
func main() {{
        fmt.Println(\"Hello, World!\")
}}
```
'''

In [6]:
dataset = []

for path in pathList:
    input = "```go\n" + read_file(basePath + path + "/input/main.go") + "\n```\n"
    output = "```go\n" + read_file(basePath + path + "/output/main.go")
    exampleName = path
    messages = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": input},
            {"role": "assistant", "content": output},
        ]
    }
    dataset.append(messages)

Quote from the actual documentation:

> To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with gpt-3.5-turbo but the right number varies greatly based on the exact use case.

Also, its extremely important to be aware of token limits and token costs, given that this will be a paid API.

Each training example will be limited to the context length of ChatGPT (therefore 4096 tokens), so its good to set some good practices in place to avoid issues with large inputs, the OpenAI documentation recommends you check that the total token count in the message contents are under 4000 as a good rule of thumb.

In [9]:
import json

# Writing to a file
data_path = "./dataset.jsonl"
with open(data_path, 'w') as f:
    for entry in dataset:
        f.write(json.dumps(entry) + '\n')

Let's estimate the total number of tokens required

In [None]:
import json
import tiktoken
import numpy as np
from collections import defaultdict

encoding = tiktoken.get_encoding("cl100k_base") # default encoding used by gpt-4, turbo, and text-embedding-ada-002 models

def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

files = ['training_set.jsonl', 'validation_set.jsonl']

for file in files:
    print(f"Processing file: {file}")
    with open(file, 'r', encoding='utf-8') as f:
        dataset = [json.loads(line) for line in f]

    total_tokens = []
    assistant_tokens = []

    for ex in dataset:
        messages = ex.get("messages", {})
        total_tokens.append(num_tokens_from_messages(messages))
        assistant_tokens.append(num_assistant_tokens_from_messages(messages))
    
    print_distribution(total_tokens, "total tokens")
    print_distribution(assistant_tokens, "assistant tokens")
    print('*' * 50)

Now, let's upload our newly generated dataset file.

Let's follow this [tutorial](https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/fine-tune?tabs=python%2Ccommand-line)

In [10]:
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=api_key,
    azure_endpoint=base_url, 
    api_version=api_version,
    azure_deployment="gpt-4",
)


In [20]:
# Let's see if it works
client.chat.completions.create(
    model="gpt-4",
    messages= [{"role":"user", "content":"what is the capital of france?"}],
    temperature=0
    )

ChatCompletion(id='chatcmpl-8i6Xzg9h6gVTIkGFQzESKAN7ScwnP', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content='The capital of France is Paris.', role='assistant', function_call=None, tool_calls=None), content_filter_results={'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}})], created=1705522291, model='gpt-4', object='chat.completion', system_fingerprint='fp_6d044fb900', usage=CompletionUsage(completion_tokens=7, prompt_tokens=14, total_tokens=21), prompt_filter_results=[{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}])

In [22]:
client.files.create(
  file=open(data_path, "rb"),
  purpose="fine-tune"
)

## there's an issue uploading the file, because fine tunning is only available on region 
## North Central US, and the endpoint is in West US 2. Additionally, fine tunning is only
## available for GPT-3.5, not GPT-4.
## https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#fine-tuning-models
## https://help.openai.com/en/articles/7127982-can-i-fine-tune-on-gpt-4

NotFoundError: Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}