# Step 1:

In this step we carry out the following:

1. We load the spreadsheet from Sharepoint which includes the list of
   submissions, the human defined categories for submitters of interest and any
   further annotations such as whether we are removing submissions from
   analysis. We format this data into a json file, saved in
   `./data/step1/list.json`. This JSON file allows for easier manipulation and
   handling.
2. Using the JSON file from step 1, we create a number of jsonl files that are
   in the correct format for processing by OpenAI's batch API. We create
   multiple jsonl files as each has to be less than 100mb in size.
3. We upload the jsonl files, and trigger the batch processing of them. This can
   take upwards of 24 hours.
4. Once processing is complete, we download the completed responses for each
   request, and update the json file from step 1 to include the AI returned
   data. We also export this data in a spreadsheet for review
   (`./data/step1/review1.xlsx`)

After this step, not only do we have preliminary data for all submissions
(answers to questions below), but we can also check if the AI has catagorised
any of the unlabeled submissions (that we may have missed) into the categories
of interest. Once we are settled on the categories, we can move onto `Step 2`
which involves asking the AI the specific questions for each category.

This step asks the AI to evaluate each submission based on the following
questions:

- **substantive_submission**: Does the submission provide a substantive response
  to the consultation (True or False)? Substantive submissions are those that
  provide a detailed or well reasoned response to the consultation. Screenshots
  of memes or other pre-existing content are not considered substantive.
  Submissions relying on conspiracy theories are not considered substantive. A
  submission expressing purely personal opinion, without any supporting
  argument, is not considered substantive. Consider the supplementary materials
  and determine if the submission substantively answers the issues, concerns and
  questions raised in the consultation.

- **responder_category**: One aspect of the research is looking how different
  categories of responders respond to the consultation. Based on the response
  (especially the name of the responder - e.g. if it is a company or group),
  please select the category that best describes the responder. Options are: 1.
  Individual, 2. Political (e.g. politician or political part), 3. Digital
  Platform (e.g. Meta, Microsoft, Google, etc.), 4. Civil Society (e.g. NGO,
  advocacy group, etc.), 5. Academic, 6. News (e.g. a news company such as News
  Corp, ABC or Nine News, or an industry association representing news
  organisations such as Australian Press Council, Commerical Radio Australia or
  FreeTV), 7. Government (e.g. government agencies such as Victorian Electoral
  Commission or Australian Human Rights Commission), 8. Industry (An industry
  body that does not neatly fit within the predefined categories (e.g. while
  Commerical Radio Australia represents news broadcasters and as such fits
  within News, an industry body such as Communications Alliance, which
  represents communications providers such as a telcos and broadband companies
  would fit here)), 9. Other (please specify). Only return the category that
  best describes the responder. E.g. for a submission from the UTS Centre for
  Media Transition, you would return: 'Academic'.

- **support**: Overall, considering the whole of the submission, does the
  submission support, oppose, or have a neutral stance towards the proposed
  laws? Make sure you truly understand the submission's position taking into
  account the whole document. Look for express statement's expressing support or
  opposition. If no express statements are present, weigh up the arguments
  against and arguments in favour of the Bill and specific aspects of the
  proposed changes. Some confusion may arise where the submission states it is
  supporting another submission – this does not mean the submission is in
  support of the proposed changes. In these circumstances, if you are unsure
  (i.e. the submission is not clear and you do not have access to the submission
  being referred to), respond with 'unsure'. Return only one of the following
  options: 'support', 'oppose', 'neutral', 'unsure'.

- **motivations**: Number the top 3 motivations or concerns underpinning the
  submission's viewpoint. Keep each motivation general and brief (under five
  words). If there are only one or two key motivations in the submission, return
  only these one or two.

- **changes**: Does this submission provide suggestions on what changes need to
  be made to the Bill? If so, please list each suggested change with a very
  brief summary (no more than a line in total), with the most important changes
  first. You do not need to describe each change in detail; just give a one-line
  statement of what the submitter requests to be changed. If there is no comment
  on this aspect, return 'No comment'..

- **regulation**: Does this submission make any comment on the form of
  regulation provided by the Bill? For example, does the submission comment on
  the practicality, feasibility or merits of self-regulation, codes of practice,
  industry standards or direct government regulation (i.e. in the form of direct
  legislation)? This question is only interested in the submission's view on the
  FORM OF REGULATION and NOT whether or not the issue should be regulated. It is
  also not interested in the submission's view as to what the impacts of
  regulation may be. We are also not interested in the submission's view of the
  impacts that regulation, in general, will have. WE ARE ONLY interested in
  specific comments as to the form of regulation (e.g. self-regulation,
  quasi-regulation, co-regulation, industry codes of practice,
  industry-regulator collaboration, industry standards, Direct government
  regulation, etc.). If the submission makes explicit comments as to the form of
  regulation, please provide a brief summary of the comments and the reasoning
  behind their belief. If there is no comment on this aspect, return 'No
  comment'.

- **perceived_societal_impact**: How does the submitter perceive the impact of
  the proposed laws on combating misinformation and disinformation and the
  broader digital ecosystem, including social media platforms, content creators,
  and the general public? Restrict your summary on this point purely to what
  impacts the submission has highlighted relating to combating misinformation
  and disinformation and the broader digital ecosystem, including social media
  platforms, content creators, and the general public. Do not include comments
  on any aspects covered by other questions. If there is no comment on this
  aspect, return 'No comment'.

- **regulator_trust**: Does the submission express trust or scepticism towards
  the Australian Media and Communications Authority (ACMA)'s ability to
  impartially and effectively use the new powers? Are there any suggestions in
  the submission for ensuring accountability and oversight? Please only
  highlight express statements as to the ability of the ACMA to impartially and
  effectively use the new powers or any express suggestions for ensuring
  accountability and oversight of the ACMA exercising its power. If there is no
  comment on this aspect, return 'No comment'.

- **definitions**: What does the submitter think about the definitions of
  'misinformation', 'disinformation' and 'serious harm'? Only consider specific
  comments on the definitions of 'misinformation', 'disinformation' and 'serious
  harm'. If the submitter makes no comment on the definitions of these specific
  terms, return 'No comment'.

Each call to the AI is formatted with the below prompt. The supplementary
materials,
[the online page](https://www.infrastructure.gov.au/have-your-say/new-acma-powers-combat-misinformation-and-disinformation)
and
[fact sheet](https://www.infrastructure.gov.au/department/media/publications/communications-legislation-amendment-combatting-misinformation-and-disinformation-bill-2023-fact)
provided as part of the consultation are inserted during the process (replacing
the placeholders: `|issues|` & `|fact_sheet|`). `submission_eval` is a reference
to the formatted questions above which are provided as a `tool` (a `tool` just
ensures responses are in the correct format). The submission is added, clearly
noted as such, to the base of the prompt (as per function `prompt_formatted` in
step 2.).

---

```text
LLM is a highly paid professor who is world-renowned for their expertise in the regulation of misinformation and disinformation in Australia.

LLM is working on a research project looking into different attitudes to the Australian Government's proposed laws to provide the independent regulator, the Australian Communications and Media Authority (ACMA), with new powers to combat online misinformation and disinformation.

LLM's first task is to carefully read and understand the supplementary material. The supplementary material will be provided below. It is information provided by the Australian Government in its consultation process to gather input from citizens. Each document will be numbered as SUPPLEMENTARY DOCUMENT 1. SUPPLEMENTARY DOCUMENT 2. etc. A small note introducing each supplementary document will be provided.

Once LLM has understood the supplementary material, LLM will then move on to a public submission in response to the proposed new powers. LLM must carefully read this submission. The start and end of the submission will be clearly labelled. Once LLM has read and understood the submission, LLM must answer the questions noted in the function `submission_eval` and return its response in valid JSON format. Each question in the function definition should first be read so that LLM knows not to repeat itself in different questions. The response to each function argument must be carefully considered. Before returning its response, LLM must first formulate its response, reconsider if the response answers the question to the quality expected of a world-renowed expert on the matter and reformulate the response if necessary to meet these quality expectations. Only after following this process should LLM return its response to each function argument.

LLM is working on this task as part of a research group. LLM's role in carrying out this task to the highest standard of academic excellence is vital in achieving the project goals and will ensure LLM and the rest of the project group not only make a large positive societal impact but will significantly advance LLM and each member of the research group's academic careers.

############################## SUPPLEMENTARY MATERIALS ##############################

SUPPLEMENTARY DOCUMENT 1.

Description: This document is the text provided on the public website that the Australian Government set up to provide information to citizens to inform their submissions.

-------------------------------- SUPPLEMENTARY DOCUMENT 1. START --------------------------------

|issues|

-------------------------------- SUPPLEMENTARY DOCUMENT 1. END --------------------------------

SUPPLEMENTARY DOCUMENT 2.

Description: This document is a fact sheet provided by the Australian Government. It briefly explains some of the key elements of the proposed law changes and the issues they want input on.

-------------------------------- SUPPLEMENTARY DOCUMENT 2. START --------------------------------

|fact_sheet|

-------------------------------- SUPPLEMENTARY DOCUMENT 2. END --------------------------------

############################## SUPPLEMENTARY MATERIALS END ##############################
```

---


## 1. Loading spreadsheet

[The spreadsheet](https://studentutsedu.sharepoint.com/:x:/r/sites/CentreforMediaTransition76/_layouts/15/doc2.aspx?sourcedoc=%7B26015E46-DC17-46F8-85CB-8FF7601BB93E%7D&file=List%20of%20all%20submissions.xlsx&action=default&mobileredirect=true&DefaultItemOpen=1&ct=1715733887026&wdOrigin=OFFICECOM-WEB.START.REC&cid=c80abce9-d7d3-419b-8b78-2504ae4ce71a&wdPreviousSessionSrc=HarmonyWeb&wdPreviousSession=a8d0354e-a231-454b-a8a5-18124a5f1983)


In [8]:
import json
import pandas as pd
import json
import os
from datetime import datetime
import pytz

# Load the Excel spreadsheet into a pandas DataFrame
df = pd.read_excel('./data/step1/list.xlsx')

# Convert the DataFrame to a list of dictionaries
data = df.to_dict(orient='records')

def extract_name_from_filename(filename):
    parts = filename.split('-')
    name_parts = parts[1:]
    name = ' '.join(name_parts).split('.')[0]
    if name.find('anonymous') != -1:
        name = 'anonymous'
    return name.lower()

missing_files = []
manual_files = []

list_of_files_manually_text_extracted = ['e656', '14193', '18110', '19712', '26222', '33824', '34418', '34756', 'e597' ]

def check_file_exists(doc_id, folder_path = './data/files'):
    for file_name in os.listdir(folder_path):
        if file_name.startswith(doc_id) and doc_id not in list_of_files_manually_text_extracted:
            return True
    if doc_id not in list_of_files_manually_text_extracted:
        missing_files.append(doc_id)
        return False
    manual_files.append(doc_id)
    return False

formatted_data = []
# Convert empty cells in 'Group', 'Comments', and 'Removed (Y)' columns to None
for row in data:
    if pd.isnull(row['Group']):
        row['Group'] = None
    if pd.isnull(row['Comments']):
        row['Comments'] = None
    if pd.isnull(row['Removed (Y)']):
        row['Removed (Y)'] = None

    exist = check_file_exists(row['UniqueID'])
    
    formatted_row = {
        'uniqueId': row['UniqueID'],
        'group': row['Group'],
        'submitter': extract_name_from_filename(row['doc']),        
        "metadata": {
            "groupDefinedBy": "human" if row['Group'] else "AI",
            "removed": row['Removed (Y)'],
            "comments": row['Comments'],
            "text_extraction_method": "Marker2" if exist else 'manual'
        }
    }
    formatted_data.append(formatted_row)
    
# Save the data as a JSON file if it doesn't exist
json_file = './data/step1/list.json'

local_timezone = pytz.timezone('Australia/Sydney')  # Adjust to your local timezone
current_time = datetime.now(local_timezone).strftime("%Y-%m-%d %H:%M:%S %Z")

output = {
    "data": formatted_data,
    "metadata": {
        "timestamp": current_time,
        "submissions_missing_files": missing_files,
        "submissions_using_manually_extracted_files": manual_files
    }
}

with open(json_file, 'w') as f:
    json.dump(output, f)

print(F'{len(missing_files)} submissions missing files: {missing_files}')
print(F'{len(manual_files)} submissions using manually extracted files: {manual_files}')

0 submissions missing files: []
9 submissions using manually extracted files: ['14193', '18110', '26222', '34756', 'e656', '34418', '33824', 'e597', '19712']


## 2. We now have a JSON file of objects with key value pairs in the below form. We now process this to jsonl form for batch processing

```json
{
  "group": "string | null",
  "submitter": "string",
  "uniqueId": "string",
  "metadata": {
    "groupDefinedBy": "human or AI",
    "removed": "string | null",
    "comments": "string | null",
    "text_extraction_method": "Marker2 | manual"
  }
}
```

We ignore in this step any submissions we flag as removed.

jsonl files saved to `./data/step1/toProcess`


In [3]:
# TESTING
import json
from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv('OPENAI_KEY'), max_retries=3)

def parse_JSON(json_str: str) -> dict:        
    try: 
        return json.loads(json_str)
    except Exception as e:              
        messages = [
      {
        'role': 'system',
        'content':
          'Assistant is a large language model designed to fix and return correct JSON objects.',
      },
      {
        'role': 'user',
        'content': f'ORIGINAL ERROR CONTAINING JSON OBJECT:\n\n{json_str}\n\nERROR MESSAGE: {e}',
      },
    ]
        
        tool_choices = [{
      'type': 'function',
      'function': {
        'name': 'fix_object',
        'description':
          'You will be given an incorrectly formed JSON Object and a error message. You must fix the incorrect JSON Object and return the valid JSON object.',
        'parameters': {
          'type': 'object',
          'properties': {
            'fixedJSON': {
              'type': 'string',
              'description': 'The reformated and error free JSON object. Return the JSON object only!',
            },
          },
          'required': ['fixedJSON'],
        },
      },
    }]                
        additional_params = {
          'messages': messages,
          'tools': tool_choices,
          'temperature': 0,
          'max_tokens': 4096,
          'tool_choice':{ 'type': 'function', 'function': { 'name': 'fix_object' } }
          }
        params = {**additional_params} 
        response = client.chat.completions.create(params)
                
        second_test_json = response.choices[0].message.tool_calls[0].function.arguments 
                  
        to_return = json.loads(second_test_json)
        return json.loads(to_return['fixedJSON'])

promt_file = 'prompt_no_guidance.txt'
# Define the prompt for each individual request
def prompt_formatted() -> str:    
    # Read the first file and set a string variable
    with open(promt_file, 'r') as file:
        prompt = file.read()
        
    with open('prompt_issues.md', 'r') as file:
        issues = file.read()

    # with open('prompt_guidance_note.md', 'r') as file:
    #     guidance_note = file.read()

    with open('prompt_fact_sheet.md', 'r') as file:
        fact_sheet = file.read()

    prompt = prompt.replace('|issues|', issues)
    # prompt = prompt.replace('|guidance_note|', guidance_note)
    prompt = prompt.replace('|fact_sheet|', fact_sheet)    

    return prompt

def get_submission(submission_text: str, submission_author: str):
    prompt = ""
    prompt += "\n\n***************************************** SUBMISSION START *****************************************\n\n"

    prompt += f"Submission from: {submission_author}\n\n"
    
    prompt += submission_text

    prompt += "\n\n***************************************** SUBMISSION END *****************************************\n\n"
    return prompt

def get_function():
    with open('function.json', 'r') as f:
        function = json.load(f)
    return function

def get_file_path(doc_id, folder_path = './data/files'):    
    for file_name in os.listdir(folder_path):
        if file_name.startswith(doc_id):
            return os.path.join(folder_path, file_name)

with open('./data/step1/list.json', 'r') as f:
    list = json.load(f)

md_file_location = './data/files'

file_counter = 0
jsonl_file = f"./data/step1/toProcess/jsonl_{file_counter}.jsonl"

skipped_files = []
empty_files = []

os.makedirs('./data/step1/toProcess', exist_ok=True)

responses = []
# This loop takes each submission and adds it to the jsonl file in a format that can be used by the OpenAI API
for i in list["data"][:5]:      
    md_file_path = get_file_path(i.get("uniqueId"))       
    with open(md_file_path, 'r') as file:
        submission = file.read()        
    if len(submission.strip()) == 0:
        continue
    sub_author = i["submitter"]
    prompt = prompt_formatted()
    submission_formatted = get_submission(submission, sub_author)

    function = get_function()
    ldata = {"custom_id": i["uniqueId"], "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-2024-05-13", "messages": [{"role": "system", "content": prompt},{"role": "user", "content": submission_formatted}], "max_tokens": 4096, "temperature": 1e-9, "frequency_penalty": 0, "presence_penalty" :0, "top_p":0, "tools":[function], "tool_choice": { "type": "function", "function": { "name": "submission_eval" } }}}

    response = client.chat.completions.create(
                    model='gpt-4o-2024-05-13',
                    messages=[{"role": "system", "content": prompt},{"role": "user", "content": submission_formatted}],
                    max_tokens=4096,
                    temperature=1e-9,
                    tools=[function],
                    tool_choice={ 'type': 'function', 'function': { 'name': 'submission_eval' } },
                    frequency_penalty=0,
                    presence_penalty=0,
                    top_p=0
                )
    json_res = parse_JSON(response.choices[0].message.tool_calls[0].function.arguments)
    responses.append({i['uniqueId']: json_res, "meta": {"custom_id": i["uniqueId"], "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-2024-05-13", "messages": [{"role": "system", "content": prompt},{"role": "user", "content": submission_formatted}], "max_tokens": 4096, "temperature": 1e-9, "frequency_penalty": 0, "presence_penalty": 0, "top_p": 0, "tools":[function], "tool_choice": { "type": "function", "function": { "name": "submission_eval" } }}}})

with open('./data/test.json', 'w') as f:
    json.dump(responses, f)

In [6]:
import os
import json

promt_file = 'prompt_no_guidance.txt'
# Define the prompt for each individual request
def prompt_formatted(submission_string: str, submission_author: str) -> str:    
    # Read the first file and set a string variable
    with open(promt_file, 'r') as file:
        prompt = file.read()
        
    with open('prompt_issues.md', 'r') as file:
        issues = file.read()

    # with open('prompt_guidance_note.md', 'r') as file:
    #     guidance_note = file.read()

    with open('prompt_fact_sheet.md', 'r') as file:
        fact_sheet = file.read()

    prompt = prompt.replace('|issues|', issues)
    # prompt = prompt.replace('|guidance_note|', guidance_note)
    prompt = prompt.replace('|fact_sheet|', fact_sheet)

    prompt += "\n\n***************************************** SUBMISSION START *****************************************\n\n"

    prompt += f"Submission from: {submission_author}\n\n"
    
    prompt += submission_string

    prompt += "\n\n***************************************** SUBMISSION END *****************************************\n\n"

    return prompt

def get_function():
    with open('function.json', 'r') as f:
        function = json.load(f)
    return function

def get_file_path(doc_id, folder_path = './data/files'):    
    for file_name in os.listdir(folder_path):
        if file_name.startswith(doc_id):
            return os.path.join(folder_path, file_name)

with open('./data/step1/list.json', 'r') as f:
    list = json.load(f)

md_file_location = './data/files'

file_counter = 0
jsonl_file = f"./data/step1/toProcess/jsonl_{file_counter}.jsonl"

skipped_files = []
empty_files = []

os.makedirs('./data/step1/toProcess', exist_ok=True)

# This loop takes each submission and adds it to the jsonl file in a format that can be used by the OpenAI API
for i in list["data"]:   
    if i["metadata"]["removed"] == "Y":
        skipped_files.append(i["uniqueId"])
        continue
    try:
        md_file_path = get_file_path(i.get("uniqueId"))       
        with open(md_file_path, 'r') as file:
            submission = file.read()        
        if len(submission.strip()) == 0:
            i["metadata"]["removed"] = "Y"
            i["metadata"]["comments"] = f"{i["metadata"]["comments"]}\n\nRemoved due to empty file"
            empty_files.append(i["uniqueId"])
            continue
        sub_author = i["submitter"]
        prompt = prompt_formatted(submission, sub_author)
        function = get_function()
        ldata = {"custom_id": i["uniqueId"], "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-2024-05-13", "messages": [{"role": "user", "content": prompt}], "max_tokens": 4096, "temperature": 1e-9, "frequency_penalty": 0, "presence_penalty" :0, "top_p":0, "tools":[function], "tool_choice": { "type": "function", "function": { "name": "submission_eval" } }}}

        i["metadata"]["ai_parameters"] = {"custom_id": i["uniqueId"], "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-2024-05-13", "messages": [{"role": "user", "content": prompt}], "max_tokens": 4096, "temperature": 1e-9, "frequency_penalty": 0, "presence_penalty": 0, "top_p": 0, "tools":[function], "tool_choice": { "type": "function", "function": { "name": "submission_eval" } }}}
        
        if os.path.exists(jsonl_file) and os.path.getsize(jsonl_file) >= 0.5 * 1024 * 1024:  # 0.5MB (very small files as current restrictions on OpneAI API due to low tier)
            file_counter += 1
            jsonl_file = f"./data/step1/toProcess/jsonl_{file_counter}.jsonl"
        
        i["metadata"]["batch"] = f'jsonl_{file_counter}.jsonl'
        
        with open(jsonl_file, 'a') as f:
            json.dump(ldata, f)
            f.write('\n')

    except Exception as e:
        print(e)
        continue

with open('./data/step1/list.json', 'w') as f:
    json.dump(list, f)

print(f"Empty files: {empty_files}")

Empty files: []


## 3. We now have a folder with all the prepared files for OpenAI batch calls

We will upload each of these files to OpenAI and then trigger batch processing
of each.

**MAKE SURE TO RECORD BATCH IDs CREATED IN THIS STEP SO WE KNOW WHICH FILES TO
EVENTUALLY DOWNLOAD**


In [None]:
from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv('OPENAI_KEY'), max_retries=3)

jsonl_dir = './data/step1/toProcess'

jsonl_files = [f for f in os.listdir(jsonl_dir) if os.path.isfile(os.path.join(jsonl_dir, f)) and f.endswith('.jsonl')]

file_ids = []

for file in jsonl_files:
    file_object = client.files.create(
        file=open(f"{jsonl_dir}/{file}", "rb"),
        purpose="batch"
    )
    file_ids.append(file_object.id)

print(file_ids)

In [8]:
# We have now uploaded all the files and have their IDs, lets create a batch job for each
batch_ids = []

for file_id in file_ids:
    job = client.batches.create(
            input_file_id=file_id,
            endpoint="/v1/chat/completions",
            completion_window="24h"
          )
    batch_ids.append(job.id)

print('Record the following and make sure to add to `desired_batch_ids` in the following cells!')
print(batch_ids)

Record the following and make sure to add to `desired_batch_ids` in the following cells!
['batch_chdZHGDMh0lz2CSM4bsQ960a']


#### The batch processes should now be underway, they will take up to 24hrs

We can run the following cell to check on process


In [None]:
from openai import OpenAI

client = OpenAI(api_key=os.getenv('OPENAI_KEY'),max_retries=3)

batch_jobs = client.batches.list()

desired_batch_ids = ['', '']

for batch in batch_jobs.data:
    if batch.id in desired_batch_ids:
        print(batch.id, batch.status, batch.request_counts)

#### Once processing is done, we can download the completed files

Files are saved here: `./data/step1/output`


In [None]:
from openai import OpenAI

client = OpenAI(api_key=os.getenv('OPENAI_KEY'),max_retries=3)

batch_jobs = client.batches.list()

# we only want to download the batch jobs that were set up in cell 11
desired_batch_ids = ['batch_vdZSGRcPyfdMH8T0UCfssPpw']

for batch in batch_jobs.data:
    if batch.id in desired_batch_ids:        
        output_file = batch.output_file_id
        content = client.files.content(output_file)    

        jsonl_file_path = f'./data/step1/output/{output_file}.jsonl'
        content.write_to_file(jsonl_file_path)

## 4. Process AI responses and save data

Now we have all the AI responses, we need to process and save the results. This
will update the json file from step 2, and also export the responses as an Excel
file for review. The Excel file will be located: `./data/step1/review1/xlsx`


In [None]:
from openai import AzureOpenAI
from config import AZURE_OPENAI_KEY, AZURE_OPENAI_BASE_URL
import os
import json
import pandas as pd

client = OpenAI(api_key=os.getenv('OPENAI_KEY'),max_retries=3)

# Parses the JSON from a function call, if there is an error in JSON parsing, recalls the LLM with the fix json function to get a valid json response.
def parse_JSON(json_str: str) -> dict:        
    try: 
        return json.loads(json_str)
    except Exception as e:              
        messages = [
      {
        'role': 'system',
        'content':
          'Assistant is a large language model designed to fix and return correct JSON objects.',
      },
      {
        'role': 'user',
        'content': f'ORIGINAL ERROR CONTAINING JSON OBJECT:\n\n{json_str}\n\nERROR MESSAGE: {e}',
      },
    ]
        
        tool_choices = [{
      'type': 'function',
      'function': {
        'name': 'fix_object',
        'description':
          'You will be given an incorrectly formed JSON Object and a error message. You must fix the incorrect JSON Object and return the valid JSON object.',
        'parameters': {
          'type': 'object',
          'properties': {
            'fixedJSON': {
              'type': 'string',
              'description': 'The reformated and error free JSON object. Return the JSON object only!',
            },
          },
          'required': ['fixedJSON'],
        },
      },
    }]                
        response = client.chat.completions.create(
                    model='gpt-4o-2024-05-13',
                    messages=messages,                    
                    max_tokens=4096,
                    temperature=0,
                    tools=tool_choices,
                    tool_choice={ 'type': 'function', 'function': { 'name': 'fix_object' } },        
                )        
                
        second_test_json = response.choices[0].message.tool_calls[0].function.arguments 
                  
        to_return = json.loads(second_test_json)
        return json.loads(to_return['fixedJSON'])

output_folder = './data/step1/output'

jsonl_files = [f for f in os.listdir(output_folder) if os.path.isfile(os.path.join(output_folder, f)) and f.endswith('.jsonl')]

# Load original JSON list
with open('./data/step1/list.json', 'r') as f:
    list_data = json.load(f)

def get_correct_category(AI_category):
    AI_category = AI_category.lower()    
    if AI_category == 'digital platform':
        return 'platform'
    if AI_category == 'civil society':
        return 'civil'
    return AI_category

# Load the JSONL files
for file in jsonl_files:
    with open(f"{output_folder}/{file}", "r") as f:
        for line in f:
            item = json.loads(line)            
            item_key = item['custom_id']            
            json_res = parse_JSON(item['response']['body']['choices'][0]['message']['tool_calls'][0]['function']['arguments'])
            # grab the matching item in our list
            list_item = next((x for x in list_data if x['uniqueId'] == item_key), None)
            if list_item:
              json_res['responder_category'] = get_correct_category(json_res['responder_category'])
              if list_item['group'] == None:
                list_item['group'] = json_res['responder_category']
                list_item['metadata']['groupDefinedBy'] = 'AI'              
              list_item['AI_response_general'] = json_res
              list_item['AI_response_general']['system_fingerprint'] = item['response']['body']['system_fingerprint']

# Save the updated list back to the json file
with open('./data/step1/list.json', 'w') as f:
    json.dump(list_data, f)

# Export the list to an Excel file for review
# Convert JSON to DataFrame
df = pd.json_normalize(list_data)

# Save DataFrame to Excel
df.to_excel('./data/step1/review1.xlsx', index=False)