This notebook provides a method to compare the similarity of the Json output between Llama 3, Llama 3 - one shot, GPT 3.5, and my model while dealing with API endpoints IE (information extraction) task

# Load test dataset

In [1]:
# use the same test/eval data while in training
from datasets import load_dataset

dataset = load_dataset('billyfin/doc2json')
# delete the last line for future one-shot test
one_shot_example = dataset['train'][166]
dataset = dataset.filter(lambda example, idx: idx != 166, with_indices=True)
dataset = dataset["train"].train_test_split(test_size=0.2, seed=42)
test_dataset = dataset['test']

In [3]:
print(test_dataset['json_form'][0])
print(test_dataset['text_content'][0])

{
    "title": "MyIP.com JSON API Documentation",
    "endpoints": [
        {
            "name": "Get IP Information",
            "description": "Retrieves information about the IP address making the request.",
            "method": "GET",
            "url": "https://api.myip.com",
            "headers": [],
            "required_parameters": [],
            "optional_parameters": []
        }
    ]
}
JSON API | MyIP.com JSON API Contact JSON API You can make automated requests to the site using the API . Access URL: https://api.myip.com Response example: {"ip":"66.249.75.9","country":"United States","cc":"US"} Response elements: ip: IP address country: IP country location in English language cc: Two-letter country code in ISO 3166-1 alpha-2 format If there is no location data for an IP address cc will return "XX" and country "Unknown". Is this a free service? Yes. What are the API usage limits? There is no request limit, the only restriction is the server capacity which I will try 

# Preparation

In [3]:
from transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer, default_data_collator, get_linear_schedule_with_warmup, BitsAndBytesConfig
from huggingface_hub import notebook_login
from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType, PeftModel, PeftConfig
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch
import transformers

torch.manual_seed(42)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Llama 3 outputs

In [12]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"quantization_config": quantization_config},
    device_map="auto",
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [48]:
import json

count = 1
for test_sample in test_dataset['text_content']:
    messages = [
        {"role": "system", "content": "You will be given an API documentation. Extract the endpoints and output in JSON format."},
        {"role": "user", "content": "API text content: " + test_sample + "\n\nJson: "},
    ]
    outputs = pipeline(
        messages,
        max_new_tokens=1024,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.1,
        return_full_text=False,
    )
    
    result = outputs[0]["generated_text"]
    with open("./model_outputs/llama3/" + str(count) + ".txt", 'w') as file:
        file.write(result)
    
    count+=1

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for

# Llama 3 - one shot outputs

In [None]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"quantization_config": quantization_config},
    device_map="auto",
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

In [49]:
import json

count = 1
for test_sample in test_dataset['text_content']:
    messages = [
        {"role": "user", "content": "You will be given an API documentation. Extract the endpoints and output in JSON format.\n\nAPI text content: " + one_shot_example['text_content'] + "\n\nJson: "},
        {"role": "assistant", "content": one_shot_example['json_form']},
        {"role": "user", "content": "You will be given an API documentation. Extract the endpoints and output in JSON format.\n\nAPI text content: " + test_sample + "\n\nJson: "},
    ]
    outputs = pipeline(
        messages,
        max_new_tokens=1024,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.1,
        return_full_text=False,
    )
    
    result = outputs[0]["generated_text"]
    with open("./model_outputs/llama3_one_shot/" + str(count) + ".txt", 'w') as file:
        file.write(result)
    count+=1

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for

# GPT3.5 - one shot outputs

In [50]:
from openai import OpenAI
OPENAI_API_KEY = str(input('Please type in your api key: '))

count = 1
client = OpenAI(api_key=OPENAI_API_KEY)
for test_sample in test_dataset['text_content']:
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        # model="gpt-4-turbo",
        messages=[
            {"role": "user", "content": "You will be given an API documentation. Extract the endpoints and output in JSON format.\n\nAPI text content: " + one_shot_example['text_content'] + "\n\nJson: "},
            {"role": "assistant", "content": one_shot_example['json_form']},
            {"role": "user", "content": "You will be given an API documentation. Extract the endpoints and output in JSON format.\n\nAPI text content: " + test_sample + "\n\nJson: "},
        ],
        temperature=0,
    )
    result = str(completion.choices[0].message.content)
    with open("./model_outputs/gpt3.5_one_shot/" + str(count) + ".txt", 'w') as file:
        file.write(result)
    count+=1

Please type in your api key:  sk-None-LBwUJe7KgakZQCd1sFS2T3BlbkFJGZlBKtOqC13W19K504OG


# My model outputs

In [4]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available")
    current_device = torch.cuda.current_device()
    device_name = torch.cuda.get_device_name(current_device)
    print("Current CUDA Device:", device_name)
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU instead")

GPU is available
Current CUDA Device: NVIDIA L40


In [5]:
peft_model_id = "billyfin/llama_3_prompt_tuning_api2json_v4"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             quantization_config=quantization_config,
                                             low_cpu_mem_usage=True,
                                            )
model = PeftModel.from_pretrained(model, peft_model_id)

adapter_config.json:   0%|          | 0.00/585 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/328k [00:00<?, ?B/s]

In [6]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B-Instruct')

# set pad_token_id equal to the eos_token_id if not set
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
max_length = 10240

def format(example):
    input_messages = [
        {"role":"user", "content": one_shot_example['text_content']},
        {"role":"assistant", "content": one_shot_example['json_form']},
        {"role":"user", "content": example},
    ]
    example = tokenizer.apply_chat_template(input_messages, tokenize=False) + "<|start_header_id|>assistant<|end_header_id|>\n\n"
    return example
    
def preprocess_for_inference(examples):
    inputs = f"{examples}"
    
    model_inputs = tokenizer(inputs)
    model_inputs['input_ids'] += [tokenizer.pad_token_id]
    model_inputs["attention_mask"] = [1] * len(model_inputs["input_ids"])
    
    sample_input_ids = model_inputs["input_ids"]
    model_inputs["input_ids"] = [tokenizer.pad_token_id] * (
        max_length - len(sample_input_ids)
    ) + sample_input_ids
    model_inputs["attention_mask"] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
        "attention_mask"
    ]
    model_inputs["input_ids"] = torch.tensor(model_inputs["input_ids"][:max_length])
    model_inputs["attention_mask"] = torch.tensor(model_inputs["attention_mask"][:max_length])
    return model_inputs

In [8]:
count = 1
for test_sample in test_dataset['text_content']:
    test_sample = format(test_sample)
    test_input = preprocess_for_inference(test_sample)
    inputs = {k: v.unsqueeze(0).to(device) for k, v in test_input.items()}
    prompt = inputs['input_ids'].shape[1]
    
    model.eval()
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"], 
            attention_mask=inputs["attention_mask"],
            max_new_tokens=1024,
            temperature=0.1
        )
    
    result = tokenizer.decode(outputs[0, prompt:], skip_special_tokens=True)
    with open("./model_outputs/my_model/" + str(count) + ".txt", 'w') as file:
        file.write(result)
    count+=1

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (8192). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for o

# Evaluation

In [5]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("infgrad/stella_en_400M_v5", trust_remote_code=True).cuda()

def semantic_similarity(generated, truth):
    global model
    docs = [
        generated,
        truth
    ]
    doc_embeddings = model.encode(docs)
    similarities = model.similarity(doc_embeddings, doc_embeddings)
    return similarities[0][1].item()

def structure_similarity(generated, truth):
    keys1 = set(generated.keys())
    keys2 = set(truth.keys())
    intersection_keys = keys1.intersection(keys2)
    union_keys = keys1.union(keys2)
    if len(union_keys) == 0:
        return 0
    iou = len(intersection_keys) / len(union_keys)
    return iou

  warn(
A matching Triton is not available, some optimizations will not be enabled
Traceback (most recent call last):
  File "D:\Anaconda3\envs\summer_research\Lib\site-packages\xformers\__init__.py", line 57, in _is_triton_available
    import triton  # noqa
    ^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'triton'
Some weights of the model checkpoint at infgrad/stella_en_400M_v5 were not used when initializing NewModel: ['new.pooler.dense.bias', 'new.pooler.dense.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [17]:
# Set up the evaluation output format

import pandas as pd

columns = ["API", "Endpoint_Name", "URL", "Semantic_Similarity:Name&Description", "Method", "Required_Param", "Optional_Param", "Notes"]
df = pd.DataFrame(columns=columns)

def insert(api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes):
    global df
    df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
    df.index = df.index + 1
    df = df.sort_index()

In [7]:
import json

# Filter and extract json only
def clean_json(str):
    json_str = str.strip()
    start_index = json_str.find('{')
    json_type = 'object' if start_index != -1 else 'array'
    end_index = json_str.rfind('}') if json_type == 'object' else json_str.rfind(']')
    if start_index == -1:
        start_index = json_str.find('[')
        if start_index == -1:
            raise ValueError("No JSON object or array found in the text")
    if end_index == -1:
        raise ValueError("Incomplete JSON structure, no closing bracket found")
        
    return json_str[start_index:end_index+1]

# Recursively search for a value in a nested JSON object and return the path
def find_value_path(obj, value, path=None):
    if path is None:
        path = []
    if isinstance(obj, dict):
        for k, v in obj.items():
            result = find_value_path(v, value, path + [k])
            if result:
                return result
    elif isinstance(obj, list):
        for i, item in enumerate(obj):
            result = find_value_path(item, value, path + [i])
            if result:
                return result
    else:
        if obj == value:
            return path
    return None

In [8]:
def evaluate_params(generated, truth):
    num_params = len(truth)

    if num_params == 0:
        return 1.0
    rate = 1 / num_params
    score = 0

    for param in truth:
        truth_name = param['name'] 
        matching_param = next((param for param in generated if param['name'] == truth_name), None)
        
        if matching_param:
            iou = structure_similarity(matching_param, param)
            if iou == 1.0:
                if param['type'] == matching_param['type']:
                    score += rate * 0.5
                score += rate * 0.5 * (semantic_similarity(matching_param['description'], param['description']))

    return score

In [19]:
json_truths = test_dataset['json_form']
for i in range(34):

    # if i == 1:
    #     break
    
    with open("./model_outputs/llama3_one_shot/" + str(i + 1) + ".txt", 'r', encoding='utf-8') as file:
        content = file.read()
    try:
        json_content = json.loads(clean_json(content))
        truth = json.loads(json_truths[i])
        if 'endpoints' in json_content:
            endpoints = json_content['endpoints']
            truth_endpoints = truth['endpoints']
        else:
            # The extraction does not have a good performance
            insert(str(i + 1), None, None, None, None, None, None, "Structure not matched")
            continue
    except Exception as e:
        print("File {} has an error: {}".format(i+1, e))
        continue
    
    for truth_endpoint in truth_endpoints:
        truth_url = truth_endpoint['url']
        path = find_value_path(endpoints, truth_url)
        if path is not None:
            endpoint = endpoints[path[0]]
            for i in range(1, len(path) - 2):
                endpoint = endpoint[path[i]]
        else:
            # url of one endpoint does not match
            insert(str(i + 1), truth_endpoint['name'], False, None, None, None, None, "This endpoint does not match its URL")
            continue
        iou = structure_similarity(endpoint, truth_endpoint)
        if iou == 1.0:
            generated = endpoint['name'] + ": " + endpoint['description']
            ground_truth = truth_endpoint['name'] + ": " + truth_endpoint['description']
            similarity = semantic_similarity(generated, ground_truth)
            method = endpoint['method'] == truth_endpoint['method']
        else:
            # IoU shows that the structure of this endpoint is not the same. Cannot do further actions.
            insert(str(i + 1), truth_endpoint['name'], True, None, None, None, None, "IoU shows that the structure of this endpoint is not the same.")
            continue
        required_param_score = evaluate_params(endpoint['required_parameters'], truth_endpoint['required_parameters'])
        optional_param_score = evaluate_params(endpoint['optional_parameters'], truth_endpoint['optional_parameters'])
        # A complete evaluation
        insert(str(i + 1), truth_endpoint['name'], True, similarity, method, required_param_score, optional_param_score, "")

    print(str(i + 1) + " completed!")

df

1 completed!
File 2 has an error: Expecting ',' delimiter: line 13 column 24 (char 427)
3 completed!
4 completed!
5 completed!
6 completed!
File 7 has an error: Extra data: line 18 column 1 (char 396)


  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
 

8 completed!
9 completed!
10 completed!
11 completed!
12 completed!
File 13 has an error: Expecting value: line 21 column 20 (char 876)
14 completed!
15 completed!
16 completed!
17 completed!


  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]


18 completed!
19 completed!
20 completed!
21 completed!
File 22 has an error: Expecting value: line 42 column 33 (char 1464)
23 completed!
24 completed!
File 25 has an error: No JSON object or array found in the text
26 completed!
27 completed!
File 28 has an error: No JSON object or array found in the text
File 29 has an error: Extra data: line 47 column 1 (char 1706)
File 30 has an error: No JSON object or array found in the text
31 completed!
File 32 has an error: Expecting value: line 10 column 21 (char 224)
33 completed!
File 34 has an error: Expecting ',' delimiter: line 16 column 38 (char 634)


  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
  df.loc[-1] = [api, endpoint_name, url, similarity, method, requred_param_score, optional_param_score, notes]
 

Unnamed: 0,API,Endpoint_Name,URL,Semantic_Similarity:Name&Description,Method,Required_Param,Optional_Param,Notes
0,33,Convert JSON to JSONP,False,,,,,This endpoint does not match its URL
1,31,Generate Placeholder Text,False,,,,,This endpoint does not match its URL
2,27,Generate Chart,True,0.749484,False,0,0.0,
3,26,List All Asteroids,True,0.863466,True,1.0,0.0,
4,24,Get COVID-19 Cases for a Specific Country,False,,,,,This endpoint does not match its URL
...,...,...,...,...,...,...,...,...
71,4,Get Contributors for Recipe,False,,,,,This endpoint does not match its URL
72,4,Get Random Taco,False,,,,,This endpoint does not match its URL
73,3,Get Supported Color Name Lists,True,,,,,IoU shows that the structure of this endpoint ...
74,3,Get Color Names,False,,,,,This endpoint does not match its URL


In [21]:
df.to_csv("./results/llama3_one_shot_results.csv", index=False)