# QDA 1 - Making AI Prompts

In this notebook we will look at the basics of assembling a text prompt for an AI. We'll use the Queensland Election Reddit dataset as a source and focus on making a prompt we could just go and paste into a chat window.

#### Fetching submissions from the Queensland Election Reddit dataset

We'll start by pulling in some data from the Queensland Election Reddit dataset. The dataset files are in parquet format, a fancy columnar format that's way better than CSV because parquet files strongly define the types of every column. Thankfully, Pandas can read parquet files directly into a DataFrame. Let's do that now. Also, see the `dataset_readme.md` file in the dataset directory for more information on the dataset.

In [1]:
import pandas as pd

data_submissions = pd.read_parquet('2024_qld_election_reddit_dataset/submissions.parquet')
data_submissions.head(5)


Unnamed: 0,author,id,title,permalink,selftext,created_utc,subreddit,num_comments
0,ScubaFett,1geoofo,Anyone else only paying 10c fares on the train...,/r/brisbane/comments/1geoofo/anyone_else_only_...,,2024-10-29 07:07:12,brisbane,18
1,SCOOBASTEVE,1genh6z,Interested in hearing from people who live/hav...,/r/brisbane/comments/1genh6z/interested_in_hea...,I went for a drive along the Centenary a few w...,2024-10-29 05:36:36,brisbane,23
2,Tac0321,1gemngt,Doctors call on newly elected Queensland gover...,/r/brisbane/comments/1gemngt/doctors_call_on_n...,,2024-10-29 04:42:16,brisbane,98
3,ConanTheAquarian,1gegdid,"Five new Brisbane bus routes, changes to dozen...",/r/brisbane/comments/1gegdid/five_new_brisbane...,"Buses could be more frequent, more reliable, l...",2024-10-28 23:22:40,brisbane,31
4,UnlikelyBicycle2559,1gefear,Buying in Bris vs Melb?,/r/brisbane/comments/1gefear/buying_in_bris_vs...,TLDR: Pros and cons of moving to Melb because ...,2024-10-28 22:39:17,brisbane,19


#### Fetching comments 

 This is just an example of reading comments too but in this notebook we'll focus on the posts (submissions).

In [4]:
data_comments = pd.read_parquet('2024_qld_election_reddit_dataset/comments.parquet')
data_comments.head(5)


Unnamed: 0,id,author,body,created_utc,root_comment,parent_id,subreddit,submission_id
0,lubakzt,ran_awd,It's an authorisation charge. It used to be $1...,2024-10-29 07:13:50,True,1geoofo,brisbane,1geoofo
1,lubfjis,CaptainObvious2794,"Man, my go-card charged 83c on Friday. I guess...",2024-10-29 08:11:37,True,1geoofo,brisbane,1geoofo
2,lubidsm,butterbuts,Be careful using your phone for payment. I was...,2024-10-29 08:45:03,True,1geoofo,brisbane,1geoofo
3,lubj5fd,DealerGullible4673,It’s the temporary hold they charge. Money isn...,2024-10-29 08:53:57,True,1geoofo,brisbane,1geoofo
4,lubajlk,heisdeadjim_au,"See how it says ""pending'? You went somewhere...",2024-10-29 07:13:23,True,1geoofo,brisbane,1geoofo


#### Making documents

Let's try making a prompt. First, we need a list of documents and we'll use this `SimpleDocument` Named Tuple to represent them, they are just objects with an id and a text. This is a Python list comprehension, it's a for loop in a list []. We'll be using some code from a qdaai package which includes some things for QDA pipelines. You don't need to know how they work, but you can look at the code in the `qdaai` directory if you're curious.

In [None]:
from qdaai.documents import SimpleDocument

docs = [
    SimpleDocument(
        id=row.id,
        text=f"{row.title}: {row.selftext}" # Join the submission title and selftext together
    ) # This is what's being returned for each item in our list comprehension
    for _, row in data_submissions[data_submissions.selftext.notna()] # This is what we are looping over - rows where the selftext is not null
    .head(5) # Limit to 5 for demonstration purposes
    .iterrows()
]

#### Assembling a prompt
Now let's assemble a prompt from these documents. A couple of things of note. The prompts present our documents (Reddit submissions) as an ordered list because this works better for LLMs. The PromptDocument result has the prompt text as .prompt, and the original ids as as a list from the property .idlist. We need that idlist to match up the results from the AI with the original documents.

In [3]:
from qdaai.documents import documents_to_prompts

instruction_prompt = "Themeatically speaking, what are the following Reddit submissions about?\n"
full_prompts = documents_to_prompts(data=docs, prompt=instruction_prompt, max_words=1000)
# Those 5 will fit into a single prompt, so we only have one item in the list. Let's write it to a file so we can view it
with open('outputs/test_prompt.txt', 'w', encoding='utf8') as f: # You really need to write Reddit data as utf8, think of all the emojis!
    f.write(full_prompts[0].prompt)
# Print the idlist
print(full_prompts[0].idlist)


['1genh6z', '1gegdid', '1gefear', '1ge2t3w', '1gdyvc3']
