# QDA 2 - Accessing AI

In this notebook we'll take some prompts and generate a response using a commercial AI model via their API. We'll be using Google Gemini for but we also provide some code to use OpenAI's Chat GPT.

#### Getting a prompt

We made a prompt and saved it as `outputs/test_prompt.txt` last time let's see open this and send it to an AI model and get the response. `GeminiAIModel` is our little helper around the API which handles things like getting a list of models, tracking costs disabling all of the content filters. We also adopt an output format in common with an OpenAI version so we can easily switched between models. 

In [9]:
from resbaz24.genai_google import GeminiAIModel

with open('outputs/test_prompt.txt', 'r', encoding='utf-8') as f:
    prompt_str = f.read()

janky = GeminiAIModel() # our nickname for Germini

ai_response = janky.get_completion(prompt=prompt_str) # This returns a CostedResponse
# This will take a little while to run - then let's print the cost, and save the output to a file
if ai_response.response is None:
    raise ValueError("No response was generated")
print(f"{ai_response.model} responded which cost us: ${ai_response.cost:.2f}")
with open('outputs/test_response.txt', 'w', encoding='utf-8') as f:
    f.write(ai_response.response)



Input tokens: 227, output tokens: 354
gemini-1.5-flash-002 responded which cost us: $0.00


#### Getting serious

Okay, but that's no different than what we could have done by pasting it into the web page. Let's go for a coding task, and this time use JSON as the output format so we can parse the data - and reassemble back with the original ids (remember the prompt doesn't have them).

We need to go back to the dataset and get the dataframe, and make new documents, get a response and write the file.

In [10]:
import pandas as pd
from resbaz24.documents import SimpleDocument, documents_to_prompts

data_submissions = pd.read_parquet('2024_qld_election_reddit_dataset/submissions.parquet')
docs = [SimpleDocument(id=row.id, text=f"{row.title}: {row.selftext}") for _, row in data_submissions[data_submissions.selftext.notna()].head(5).iterrows()]
instruction_prompt = "Consider the following Reddit submissions and code them according to two criteria: 1) Whether the submission is about the housing crisis, and 2) Where the submission author is a nerd, or not. Output a JSON list of objects where each object has boolean properties 'housing_crisis' and 'is_nerd'.\n"
full_prompts = documents_to_prompts(data=docs, prompt=instruction_prompt, max_words=1000)
with open('outputs/json_prompt.txt', 'w', encoding='utf-8') as f:
    f.write(full_prompts[0].prompt)
ai_response = janky.get_completion(prompt=full_prompts[0].prompt, json_mode=True)
if ai_response.response is None:
    raise ValueError("No response was generated")
with open('outputs/json_response.json', 'w', encoding='utf-8') as f:
    f.write(ai_response.response)


Input tokens: 278, output tokens: 67


#### Parsing JSON data

That's great but the point of JSON is that we can parse it directly back into Python. Also, our JSON doesn't have the original ids. Remember that the ids are in the .idlist property from the `PromptDocument` object. So let's get everything into a proper Python object, and then dump it back out as JSON to have a look at it.

In [11]:
import json
assert ai_response.response is not None # because python notebooks forget we trapped this error in the above cell
decoded_json = json.loads(ai_response.response)
for i in range(len(decoded_json)):
    decoded_json[i]['id'] = full_prompts[0].idlist[i] # add the id from our original data
with open('outputs/json_response_decoded.json', 'w', encoding='utf-8') as f:
    json.dump(decoded_json, f, indent=2) # This indent arg makes the file human-readable
data_submissions[data_submissions.selftext.notna()].head(5) # Let's see the original data for comparison



Unnamed: 0,author,id,title,permalink,selftext,created_utc,subreddit,num_comments
1,SCOOBASTEVE,1genh6z,Interested in hearing from people who live/hav...,/r/brisbane/comments/1genh6z/interested_in_hea...,I went for a drive along the Centenary a few w...,2024-10-29 05:36:36,brisbane,23
3,ConanTheAquarian,1gegdid,"Five new Brisbane bus routes, changes to dozen...",/r/brisbane/comments/1gegdid/five_new_brisbane...,"Buses could be more frequent, more reliable, l...",2024-10-28 23:22:40,brisbane,31
4,UnlikelyBicycle2559,1gefear,Buying in Bris vs Melb?,/r/brisbane/comments/1gefear/buying_in_bris_vs...,TLDR: Pros and cons of moving to Melb because ...,2024-10-28 22:39:17,brisbane,19
6,Grugly,1ge2t3w,Did you change your mind after voting early?,/r/brisbane/comments/1ge2t3w/did_you_change_yo...,I'm a bit of a data nerd and reviewing the ele...,2024-10-28 14:02:14,brisbane,23
7,Wonderful_Alps5959,1gdyvc3,Is Woodridge that bad? FHB,/r/brisbane/comments/1gdyvc3/is_woodridge_that...,Looking around Brisbane as I’m FHB - and have ...,2024-10-28 10:37:55,brisbane,36


#### Making a CSV

CSV is awful but it's a an easy way just to get some data into a spreadsheet so we can view it easily. Let's make a CSV file with the ids, the title, selftext, and columns for the flags we got from the AI model.

In [None]:
main_df = data_submissions[data_submissions.selftext.notna()].head(5)
json_df = pd.DataFrame(decoded_json) # make a little dataframe from our decoded JSON
if len(main_df) != len(json_df):
    raise ValueError("Dataframes are not the same length")
result_df = main_df.merge(
    json_df[['id', 'housing_crisis', 'is_nerd']], 
    on='id', 
    how='left'
)
final_df = result_df[['id', 'title', 'selftext', 'housing_crisis', 'is_nerd']]
final_df.to_csv('outputs/result_df.csv', index=False) # save to a CSV file with pandas

