## QA Generation

Last time i did this it cost about $10 to create question:answer pairs, this time im gonna use local models to it using ollama.
This will iterate over the posts extracted from r/localllama and generate a QA dataset.

In [1]:
import tqdm
import ollama
import pickle
import pprint as pp
from datetime import datetime

In [None]:
# test
%time
response = ollama.chat(model='mistral:latest', messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
    'temperature': 0.01,
  },
])
print(response['message']['content'])

In [None]:
response

In [None]:
# test
%time
response = ollama.generate(model='mistral:latest', prompt='Why is the sky blue?')
response

---

### Ollama client

`OLLAMA_HOST=127.0.0.1:5050 ollama serve`

In [2]:
DATA_PATH = "./_output/new/localllama-new-17-02-2024.txt"
with open(DATA_PATH, "r") as file:
    data = file.read()

data_chunks = data.split("---\nPost ID:")

print(f"There are {len(data_chunks)} questions in total")
data_chunks[:3]

There are 975 questions in total


["Post ID: 1at0288\nTitle: Ok, which one of you was this? ðŸ¤£ðŸ¤£ðŸ¤£\nLink: https://redd.it/1at0288\nContent: \nReplies:\n- No, I don't think OpenAI would ever allow porn to be generated. I rather think that copies of Sora, recreated open source image generators will appear and fullfill this task. Porn is always one of the first use cases in any technologie that appeared and I don't think it'll take long for the industry to hop into this new tech. This is good for us as it further pushes open source AI technology for any use case.\n\n",
 ' 1aszy6f\nTitle: What are your favorite resources for evaluating text generation for stuff like readability, engagement (and other "soft" metrics)\nLink: https://redd.it/1aszy6f\nContent: Hi everyone, i\'m working on a thesis looking at different prompt engineering methods and trying to evaluate the quality of generated content for stuff like articles, newsletters = human read content. Most research focuses on stuff like factuality, reasoning but I\

In [3]:
client = ollama.Client(host='http://127.0.0.1:5050')
response_chunks = []

instructions = f"""
\n
# INSTRUCTIONS: 
Your job is to look at this single reddit post and to produce several technical question/answer pairs based on the content provided. 
Your response will be inserted into question and answer dataset made up of hundreds of reddit post QA pairs.
For longer posts (such as ones with a lot of information in the content or with many comments) produce a lot of QA pairs. 
For posts with less content, produce fewer. Only include QA pairs with general useful information. 
Also look at the replies for additional informative technical information. 
Write everything in the present tense. Provide code extracts or configurations where appropriate.
Write the QA's as general questions and not specific to the reddit post itself, as there would be no context to the post in the dataset.  

# RULES: 
Do NOT produce QA pairs for anything that is not in the provided text. 
Do NOT include phrases like "the user", "the poster", "this post", "reddit post", "the person" or "the author".
Only provide the QA pairs. 
Do NOT provide introductions or conclusions.
Do NOT write anything that is personal information, personal opinion, or conversational text.
Failure to comply with these rules will result in you being penalized.
Adhering to the rules will get you a $200 tip.

# FORMAT: 
Write your response in this format:
```
Q: What is the colour of the sky?
A: The colour of the sky is blue.

Q: How old is OpenAI? 
A: OpenAI was founded in 2015, therefore it is 8 years old.
```
"""

In [None]:
output_file = f"./_output/new/responses-mistral-new-{datetime.now().strftime('%d-%m-%Y')}.txt"

print("Generating QA Pairs...")
with open(output_file, "w") as file:
    for chunk in tqdm.tqdm(data_chunks):
        prompt = f"""
        ```
        {chunk}
        ```
        """ + instructions

        file.writelines(f"""{
            client.chat(model='mistral:latest', messages=[{
                'role': 'user',
                'content': prompt,
                'temperature': 0.1
            }])['message']['content']} \n\n"""
        )

print("Done!")