# Step 1:

In this step we carry out the following:

1. We load the spreadsheet from Sharepoint which includes the list of
   submissions, the human defined categories for submitters of interest and any
   furhter annotations such as whether we are removing submissions from
   analysis. We format this data into a json file, saved in
   `./data/step1/list.json`. This JSON file allows for easier manipulation and
   handling.
2. Using the JSON file from step 1, we create a number of jsonl files that are
   in the correct format for processing by OpenAI's batch API. We create
   multiple jsonl files as each has to be less than 100mb in size.
3. We upload the jsonl files, and trigger the batch processing of them. This can
   take upwards of 24 hours.
4. Once processing is complete, we download the completed responses for each
   request, and update the json file from step 1 to include the AI returned
   data. We also export this data in a spreadsheet for review
   (`./data/step1/review1.xlsx`)

With this, we can check if the AI has catagorised any of the unlabeled
submissions into the categories of interest that we may have missed. Once we are
settled on the categories, we can move onto `Step 2` which involves asking the
AI the specific questions for each category.

This step asks the AI to evaluate each submission by responding to the following
questions:

- **substantive_submission**: "Does the submission provide a substantive
  response to the consultation (True or False)? Substantive submissions are
  those that provide a detailed or well reasoned response to the consultation.
  Screenshots of memes or other pre-existing content are not considered
  substantive. Submissions relying on conspiracy theories are not considered
  substantive. A submission expressing purely personal opinio, without any
  supporting argument, is not considered substantive. Consider the supplementary
  materials and determine if the submission answers substantively answers the
  issues, concerns and questions raised in the consultation.
- **responder_category**: "One aspect of the research is looking how different
  categories of responders respond to the consultation. Based on the response
  (especially the name of the responder - e.g. if it is a company or group),
  please select the category that best describes the responder. Options are: 1.
  Individual 2. Political (e.g. politician or political part) 3. Digital
  Platform (e.g. Meta, Microsoft, Google, etc.) 4. Civil Society (e.g. NGO,
  advocacy group, etc.) 5. Academic 6. News (e.g. a news company such as News
  Corp, ABC or Nine News, or a industry association representing news
  organisations such as Australian Press Council or Commerical Radio Australia
  or FreeTV) 7. Government (e.g. government agencies such as Victorian Electoral
  Commission or Australian Human Rights Commission) 8. Industry: An industry
  body that does not neatly fit within the predefined categories (e.g. while
  Commerical Radio Australia represents news broadcasters and as such fits
  within News, an industry body such as Communications Alliance, which
  represents communications providers such as a telcos and broadband companies
  would fit here) 9. Other (please specify). Only return the category that best
  describes the responder. E.g. for a submission from the UTS Centre for Media
  Transition, you would return: 'Academic'.
- **support**: "Overall, considering the whole of the submission, does the
  submission support, oppose, or have a neutral stance towards the proposed
  laws? Make sure you truly understand the submission's position taking into
  account the whole document. Look for express statement's expressing support or
  opposition. If no express statements are present, weight up the arguments
  against and arguments in favour of the Bill and specific aspects of the
  proposed changes. Some confusion may arise where the submission states it is
  supporting another submission – this does not mean the submission is in
  support of the proposed changes. In these circumstances, if you are unsure
  (i.e. the submission is not clear and you do not have access to the submission
  being referred to), respond with 'unsure'. (return only one of the following
  options: 'support', 'oppose', 'neutral', 'unsure').
- **motivations**: "Number the top 3 motivations or concerns underpinning the
  submission's view point. Keep each motivation general, short and brief (under
  5 words). If only one or two key motivations, return only these one or two.
- **changes**: "Does this submission provide suggestions on what changes need to
  be made in the Bill? If so, please list each suggested change, with the most
  important changes first. You do not need to describe each change in detail,
  just give a onelight statement of what the submitter requests to be changed.
  If no comment on this aspect, return 'No comment'.
- **regulation**: "Does this submission make any comment on the form of
  regulation provided by the Bill. For example, does the submission comment on
  the practicality, feasibility or merits of self-regulation, codes of practice,
  industry standards or legislation? If co-regulation or self-regulation is
  mentioned highlight this. This question is only interested in the submission's
  view on the FORM OF REGULATION and NOT whether or not the issue should be
  regulated or what the impacts of regulation may be. Do not summarise the
  impacts the submission may believe regulation in general will have, WE ARE
  ONLY interested in specific comments as to the form of regulation (e.g.
  self-regulation, quasi-regulation, co-regulation, industry codes of practice,
  industry-regulator collaboration, industry standards, Direct government
  regulation, etc.). If the submission makes explicit comments as to the form of
  regulation, please provide a brief summary of the comment, and the reasoning
  behind their belief. If no comment on this aspect, return 'No comment'.
- **perceived_societal_impact**: "How does the submitter perceive the impact of
  the proposed laws on combating misinformation and disinformation and the
  broader digital ecosystem, including social media platforms, content creators,
  and the general public? Restrict your summary on this point purely to what
  impacts the submission has highlighted relating to combating misinformation
  and disinformation and the broader digital ecosystem, including social media
  platforms, content creators, and the general public. Do not include comments
  on feasibility or any other aspects covered by other questions. If no comment
  on this aspect, return 'No comment'.
- **regulator_trust**: "Does the submission express trust or skepticism towards
  the Australian Media and Communications Authority (ACMA)'s ability to
  impartially and effectively use the new powers? Are there any suggestions for
  ensuring accountability and oversight? Please only highlight express
  statements as to the ability of the ACMA to impartially and effectively use
  the new powers, or any suggestions for ensuring accountability and oversight
  of ACMA exercising its power. If no comment on this aspect, return 'No
  comment'.
- **definitions**: "What does the submitter think about the definitions of
  misinformation, disinformation and serious harm? Only consider comments on the
  definitions of 'misinformation', 'disinformation' and 'serious harm'. If the
  submitter makes no comment on the definitions of these specific terms, return
  'No comment'


## 1. Loading spreadsheet

[The spreadsheet](https://studentutsedu.sharepoint.com/:x:/r/sites/CentreforMediaTransition76/_layouts/15/doc2.aspx?sourcedoc=%7B26015E46-DC17-46F8-85CB-8FF7601BB93E%7D&file=List%20of%20all%20submissions.xlsx&action=default&mobileredirect=true&DefaultItemOpen=1&ct=1715733887026&wdOrigin=OFFICECOM-WEB.START.REC&cid=c80abce9-d7d3-419b-8b78-2504ae4ce71a&wdPreviousSessionSrc=HarmonyWeb&wdPreviousSession=a8d0354e-a231-454b-a8a5-18124a5f1983)


In [None]:
import json
import pandas as pd
import json

# Load the Excel spreadsheet into a pandas DataFrame
df = pd.read_excel('./data/step1/list.xlsx')

# Convert the DataFrame to a list of dictionaries
data = df.to_dict(orient='records')

def extract_name_from_filename(filename):
    parts = filename.split('-')
    name_parts = parts[1:]
    name = ' '.join(name_parts).split('.')[0]
    if name.find('anonymous') != -1:
        name = 'anonymous'
    return name.lower()

formatted_data = []
# Convert empty cells in 'Group', 'Comments', and 'Removed (Y)' columns to None
for row in data:
    if pd.isnull(row['Group']):
        row['Group'] = None
    if pd.isnull(row['Comments']):
        row['Comments'] = None
    if pd.isnull(row['Removed (Y)']):
        row['Removed (Y)'] = None

    file_name = row['doc'].replace('acma2023-', '').replace('.pdf', '')
    
    formatted_row = {
        'uniqueId': row['UniqueID'],
        'group': row['Group'],
        'submitter': extract_name_from_filename(file_name),
        'doc': file_name,
        "metadata": {
            "groupDefinedBy": "human" if row['Group'] else "AI",
            "removed": row['Removed (Y)'],
            "comments": row['Comments']            
        }
    }
    formatted_data.append(formatted_row)

# Save the data as a JSON file if it doesn't exist
json_file = './data/step1/list.json'
with open(json_file, 'w') as f:
    json.dump(formatted_data, f)

## 2. We now have a JSON file of objects with key value pairs in the below form. We now process this to jsonl form for batch processing

```json
{
  "group": "string | null",
  "submitter": "string",
  "doc": "string",
  "uniqueId": "string",
  "metadata": {
    "groupDefinedBy": "human or AI",
    "removed": "string | null",
    "comments": "string | null"
  }
}
```

We ignore in this step any submissions we flag as removed.

jsonl files saved to `./data/step1/toProcess`


In [None]:
import os
import json

# Define the prompt for each individual request
def prompt_formatted(submission_string: str, submission_author: str) -> str:    
    # Read the first file and set a string variable
    with open('prompt.txt', 'r') as file:
        prompt = file.read()
        
    with open('prompt_issues.md', 'r') as file:
        issues = file.read()

    with open('prompt_guidance_note.md', 'r') as file:
        guidance_note = file.read()

    with open('prompt_fact_sheet.md', 'r') as file:
        fact_sheet = file.read()

    prompt = prompt.replace('|issues|', issues)
    # prompt = prompt.replace('|guidance_note|', guidance_note)
    prompt = prompt.replace('|guidance_note|', '')
    prompt = prompt.replace('|fact_sheet|', fact_sheet)

    prompt += "\n\n***************************************** SUBMISSION START *****************************************\n\n"

    prompt += f"Submission from: {submission_author}\n\n"
    
    prompt += submission_string

    prompt += "\n\n***************************************** SUBMISSION END *****************************************\n\n"

    return prompt    

def get_function():
    with open('function.json', 'r') as f:
        function = json.load(f)
    return function

with open('./data/step1/list.json', 'r') as f:
    list = json.load(f)

md_file_location = './data/files/md_files'

file_counter = 0
jsonl_file = f"./data/step1/toProcess/jsonl_{file_counter}.jsonl"

counter = 0
# This loop takes each submission and adds it to the jsonl file in a format that can be used by the OpenAI API
for i in list:
    if counter >= 200:
        break
    if i["metadata"]["removed"] == "Y":
        continue
    try:
        md_file_path = f"{md_file_location}/{i["doc"]}/{i["doc"]}.md"        
        with open(md_file_path, 'r') as file:
            submission = file.read()
        sub_author = i["submitter"]
        prompt = prompt_formatted(submission, sub_author)
        function = get_function()
        ldata = {"custom_id": i["uniqueId"], "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-2024-05-13", "messages": [{"role": "user", "content": prompt}],"max_tokens": 4096,"temperature": 0, "tools":[function], "tool_choice":{ 'type': 'function', 'function': { 'name': 'submission_eval' } }}}        
        
        if os.path.exists(jsonl_file) and os.path.getsize(jsonl_file) >= 85 * 1024 * 1024:  # 90MB
            file_counter += 1
            jsonl_file = f"./data/step1/toProcess/jsonl_{file_counter}.jsonl"
        
        with open(jsonl_file, 'a') as f:
            json.dump(ldata, f)
            f.write('\n')
        counter += 1

    except Exception as e:
        print(e)
        continue

## 3. We now have a folder with all the prepared files for OpenAI batch calls

We will upload each of these files to OpenAI and then trigger batch processing
of each.

**MAKE SURE TO RECORD BATCH IDs CREATED IN THIS STEP SO WE KNOW WHICH FILES TO
EVENTUALLY DOWNLOAD**


In [None]:
from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv('OPENAI_KEY'), max_retries=3)

jsonl_dir = './data/step1/toProcess'

jsonl_files = [f for f in os.listdir(jsonl_dir) if os.path.isfile(os.path.join(jsonl_dir, f)) and f.endswith('.jsonl')]

file_ids = []

for file in jsonl_files:
    file_object = client.files.create(
        file=open(f"{jsonl_dir}/{file}", "rb"),
        purpose="batch"
    )
    file_ids.append(file_object.id)

# We have now uploaded all the files and have their IDs, lets create a batch job for each
batch_ids = []

for file_id in file_ids:
    job = client.batches.create(
            input_file_id=file_id,
            endpoint="/v1/chat/completions",
            completion_window="24h"
          )
    
    batch_ids.append(job.id)

print('Record the following and make sure to add to `desired_batch_ids` in the following cells!')
print(batch_ids)

#### The batch processes should now be underway, they will take up to 24hrs

We can run the following cell to check on process


In [None]:
from openai import OpenAI

client = OpenAI(api_key=os.getenv('OPENAI_KEY'),max_retries=3)

batch_jobs = client.batches.list()

desired_batch_ids = ['batch_vdZSGRcPyfdMH8T0UCfssPpw']

for batch in batch_jobs.data:
    if batch.id in desired_batch_ids:
        print(batch.id, batch.status, batch.request_counts)

#### Once processing is done, we can download the completed files

Files are saved here: `./data/step1/output`


In [None]:
from openai import OpenAI

client = OpenAI(api_key=os.getenv('OPENAI_KEY'),max_retries=3)

batch_jobs = client.batches.list()

# we only want to download the batch jobs that were set up in cell 11
desired_batch_ids = ['batch_vdZSGRcPyfdMH8T0UCfssPpw']

for batch in batch_jobs.data:
    if batch.id in desired_batch_ids:        
        output_file = batch.output_file_id
        content = client.files.content(output_file)    

        jsonl_file_path = f'./data/step1/output/{output_file}.jsonl'
        content.write_to_file(jsonl_file_path)

## 4. Process AI responses and save data

Now we have all the AI responses, we need to process and save the results. This
will update the json file from step 2, and also export the responses as an Excel
file for review. The Excel file will be located: `./data/step1/review1/xlsx`


In [None]:
from openai import AzureOpenAI
from config import AZURE_OPENAI_KEY, AZURE_OPENAI_BASE_URL
import os
import json
import pandas as pd

azure_client = AzureOpenAI(
    api_key=AZURE_OPENAI_KEY,
    api_version="2024-02-15-preview",
    azure_endpoint=AZURE_OPENAI_BASE_URL
)

# Parses the JSON from a function call, if there is an error in JSON parsing, recalls the LLM with the fix json function to get a valid json response.
def parse_JSON(json_str: str) -> dict:        
    try: 
        return json.loads(json_str)
    except Exception as e:              
        messages = [
      {
        'role': 'system',
        'content':
          'Assistant is a large language model designed to fix and return correct JSON objects.',
      },
      {
        'role': 'user',
        'content': f'ORIGINAL ERROR CONTAINING JSON OBJECT:\n\n{json_str}\n\nERROR MESSAGE: {e}',
      },
    ]
        
        tool_choices = [{
      'type': 'function',
      'function': {
        'name': 'fix_object',
        'description':
          'You will be given an incorrectly formed JSON Object and a error message. You must fix the incorrect JSON Object and return the valid JSON object.',
        'parameters': {
          'type': 'object',
          'properties': {
            'fixedJSON': {
              'type': 'string',
              'description': 'The reformated and error free JSON object. Return the JSON object only!',
            },
          },
          'required': ['fixedJSON'],
        },
      },
    }]                
        response = azure_client.chat.completions.create(
                    model='gpt-4',
                    messages=messages,                    
                    max_tokens=4096,
                    temperature=0,
                    tools=tool_choices,
                    tool_choice={ 'type': 'function', 'function': { 'name': 'fix_object' } },        
                )        
                
        second_test_json = response.choices[0].message.tool_calls[0].function.arguments 
                  
        to_return = json.loads(second_test_json)
        return json.loads(to_return['fixedJSON'])

output_folder = './data/step1/output'

jsonl_files = [f for f in os.listdir(output_folder) if os.path.isfile(os.path.join(output_folder, f)) and f.endswith('.jsonl')]

# Load original JSON list
with open('./data/step1/list.json', 'r') as f:
    list_data = json.load(f)

# Load the JSONL files
for file in jsonl_files:
    with open(f"{output_folder}/{file}", "r") as f:
        for line in f:
            item = json.loads(line)            
            item_key = item['custom_id']            
            json_res = parse_JSON(item['response']['body']['choices'][0]['message']['tool_calls'][0]['function']['arguments'])
            # grab the matching item in our list
            list_item = next((x for x in list_data if x['uniqueId'] == item_key), None)
            if list_item:
              if list_item['group'] == None:
                list_item['group'] = json_res['responder_category'].lower()
                list_item['metadata']['groupDefinedBy'] = 'AI'
              list_item['metadata']['openAI_system_fingerprint'] = item['response']['body']['system_fingerprint']
              list_item['AI_response_general'] = json_res

# Save the updated list back to the json file
with open('./data/step1/list.json', 'w') as f:
    json.dump(list_data, f)

# Export the list to an Excel file for review
# Convert JSON to DataFrame
df = pd.json_normalize(list_data)

# Save DataFrame to Excel
df.to_excel('./data/step1/review1.xlsx', index=False)