#  Generate a corpus with an LLM

The last cells in the notebook allow you to delete a corpus or zip a corpus to download.

Run this cell to install a specific Python libraries.

In [None]:
!pip install python-slugify

Run the following cell to import relevant Python libraries used in this notebook and set the logging level.

In [None]:
import logging
import requests
import json
import time
import getpass
from slugify import slugify
import shutil
import os
import csv

logging.basicConfig(level=logging.INFO)

Configure the key ...

In [None]:
OPENROUTER_API_KEY = getpass.getpass()

The following cell contains a function to query Open Router and generate LLM text. Just run it to make the function available. Change it if you know what you are doing.

In [None]:
def query_llm(prompt:str, # prompt to send to LLM
            model: str, # model name e.g. google/gemma-2-9b-it:free
            system_prompt: str = None, # system prompt to send to LLM
            max_tokens: int = 2048, # maximum number of tokens to generate (includes prompt tokens)
            response_format: str = None, # response format: json or None
            temperature: float = None # temperature for sampling
            ) -> requests.models.Response:
    """ Query LLM with prompt """

    api_url = "https://openrouter.ai/api/v1/chat/completions"
    if OPENROUTER_API_KEY is None:
        logging.error("OPENROUTER_API_KEY not set. Not querying llm.")
        return None
    api_key = OPENROUTER_API_KEY
    
    if prompt.strip() == '':
        logging.error('No prompt provided. Not querying llm.')
        return None
    
    messages = []
    if system_prompt is not None and system_prompt.strip() != '':
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": prompt})

    request_data = {
                "model": model, 
                "messages": messages,
                'max_tokens': max_tokens
                    }

    if temperature is not None:
        request_data['temperature'] = temperature
    
    if response_format == "json":
        request_data['response_format'] = {"type": "json_object"}
        
    text = None

    try:
        response = requests.post(
            url=api_url,
            headers={
                "Authorization": f"Bearer {api_key}",
            },
            data=json.dumps(request_data)
            )
        response.raise_for_status() 
        text = response.json()['choices'][0]['message']['content']
    except requests.exceptions.RequestException as e:
        logging.error(f"Error querying LLM: {e}")
        print(response.json())
        raise
    except KeyError as e:
        logging.error(f"Error querying LLM: {e}")
        print(response.json())
        raise
    except Exception as e:
        logging.error(f"Error querying LLM: {e}")
        print(response.json())
        raise

    return text


## Note about Open Router models (and set a model)

While I've been using Open Router the last few days, sometimes I get some unhelpful errors back from their API if one of the free models is unavailable. If you get errors querying Open Router you can look up the message or error codes in their documentation. Often if you get an error on one model, another will work fine. Here is where you can set the model to use for generation.

In [None]:
model = 'meta-llama/llama-3-8b-instruct:free'

## Example 1 - generate a corpus with one LLM prompt

This is a very basic example that uses one single prompt to generate a small corpus. Example 2 and 3 will probably be more helpful for your assignment.

Set the path to save the corpus ...

In [None]:
corpus_path = 'example1-corpus/'

Check if the path exists, if not create it!

In [None]:
if not os.path.exists(corpus_path):
    print(f'Creating path: {corpus_path}')
    os.makedirs(corpus_path)
else:
    print(f'Path already exists: {corpus_path}')

Below are the settings we will use to generate a tiny corpus of three documents. 

***Warning:***

You can change the number of texts to generate, but DON'T do this during the lab times as this may affect the classes ability to run the notebook. Remember there are limits (200 requests for free models per day, and a maximum of 20 requests per minute).

In [None]:
number_of_texts_to_generate = 3 # PLEASE LEAVE THIS FOR NOW!

system_prompt = '''
'''

prompt = '''
Write a short children's story imagining the future with AI.
'''

max_tokens = 1024

response_format = ''

This queries the API and generates a text, displays a preview of each generated text, and then saves the generated text as a file.

In [None]:
for i in range(number_of_texts_to_generate):
    response = query_llm(prompt, model, system_prompt, max_tokens, response_format = response_format)
    print(f'Text {i} preview: {response[0:200]} ...')
    
    filename = f'text-{i}.txt'
    with open(os.path.join(corpus_path, filename), 'w', encoding='utf8') as f:
        print(f'Saving to {os.path.join(corpus_path, filename)}')
        f.write(response)
    
    print('------')
        
    time.sleep(10) # always leave a delay!

Inspect your txt files in the directory path you specified above!

## Example 2: generate a corpus by seeding the prompt with some other generated data

Here we first generate some data, and then we use this data to prompt the LLM. Here it is simply a title, but this could be more complex. For example, perhaps you generate the name of a person and a life history as the basis for a generated biography. 

Set the path to save the corpus ...

In [None]:
corpus_path = 'example2-corpus/'

Check if the path exists, if not create it!

In [None]:
if not os.path.exists(corpus_path):
    print(f'Creating path: {corpus_path}')
    os.makedirs(corpus_path)
else:
    print(f'Path already exists: {corpus_path}')

First, we are going to generate some data that we will then use to generate the corpus. In this case we are going to generate some titles of opinion pieces on AI. Note: here I am generating JSON data. Note: there is a system prompt advising JSON output is required and the user prompt includes the required format. This sometimes may not work with smaller models! If it doesn't, just run it again. If you wanted to generate more titles you would change the number in the user prompt. For now, leave it as it is.

In [None]:
system_prompt = '''
Always respond with JSON data.
'''

prompt = '''
Generate a list of 3 editorial titles about artificial intelligence. 
Vary the length and theme of each title.
JSON format: 
{
    "titles": [
        "title 1",
        "title 2",
    ]
}
'''

max_tokens = 8096

response_format = 'json'

Query the API to generate the titles ...

In [None]:
response = query_llm(prompt, model, system_prompt, max_tokens, response_format = response_format)
print(response)

If the next cell runs ok you have some valid JSON. If not, run the cell above again. Sometimes LLMs create don't follow instructions!

In [None]:
seed_data = json.loads(response)
for seed in seed_data['titles']:
    print(seed)

Now we will generate a tiny corpus of three documents based on these titles.  
Note: we are specifying a system prompt, but there is no user prompt.  
The prompt for each story is the title we have generated above.  

In [None]:
system_prompt = '''
Write an editorial based on the title provided. The editorial should be written for a general audience.
It will appear in a major news outlet. Do not include the title as part of the output.
'''

max_tokens = 2048
response_format = ''

Run this to generate the corpus ...

In [None]:
for prompt in seed_data['titles']:
    print(f'Generating text based on: {prompt}')
    response = query_llm(prompt, model, system_prompt, max_tokens, response_format = response_format)
    print(f'Text {i} preview: {response[0:200]} ...')

    filename = slugify(prompt) + '.txt' # Note: this creates a nice filename from the title
    with open(os.path.join(corpus_path, filename), 'w', encoding='utf8') as f:
        print(f'Saving to {os.path.join(corpus_path, filename)}')
        f.write(response)
    
    print('-----------------')
    time.sleep(10) # always use a time delay

Inspect your txt files in the directory path you specified above!

## Example 3: generate a corpus by with a prompt from a CSV file

If you are using another data source, whether that is scraped or a corpus you have found online, you may want to generate  comparable corpus using this data. For example, if you want to compare human vs generated news stories and have the human-authored stories collected, you could generate texts based on the titles of the human-authored texts.

Set the path to save the corpus ...

In [None]:
corpus_path = 'example3-corpus/'

Check if the path exists, if not create it!

In [None]:
if not os.path.exists(corpus_path):
    print(f'Creating path: {corpus_path}')
    os.makedirs(corpus_path)
else:
    print(f'Path already exists: {corpus_path}')

Specify the CSV file name here and the field name you are using for prompting. The example is from some recent RadioNZ stories about AI.

In [None]:
csv_file = 'sample-for-example3.csv'
field_name = 'title'

Read the data for field_name and preview it ...

In [None]:
with open(csv_file, 'r', newline='') as file:
    csv_reader = csv.DictReader(file)
    
    header = csv_reader.fieldnames
    print(f'Header fields: {header}')
    if field_name not in header:
        print(f'The field name {field_name} is not in the header row!')
    else:
        seed_data = []
        for i, row in enumerate(csv_reader):
            seed_data.append(row['title'])
            
print(f'Data for prompting: {seed_data}')

Now we will generate a tiny corpus of three documents based on these title field.  
Note: we are specifying a system prompt, but there is no user prompt.  
The prompt for each story is the title we have generated above.  

In [None]:
system_prompt = '''
Write a news story for Radio New Zealand based on the title provided. This should be written for a general audience. 
Do not include the title as part of the output.
'''

max_tokens = 2048
response_format = ''

Run this to generate the corpus ...

In [None]:
for prompt in seed_data:
    print(f'Generating text based on: {prompt}')
    response = query_llm(prompt, model, system_prompt, max_tokens, response_format = response_format)
    print(f'Text {i} preview: {response[0:200]} ...')

    filename = slugify(prompt) + '.txt' # Note: this creates a nice filename from the title
    with open(os.path.join(corpus_path, filename), 'w', encoding='utf8') as f:
        print(f'Saving to {os.path.join(corpus_path, filename)}')
        f.write(response)
    
    print('-----------------')
    time.sleep(10) # always use a time delay

## Delete your corpus (if you need to)

If you made a mistake or want to remove your corpus files for some other reason. Change `i_want_to_delete_my_files` from `False` to `True`. Change it back again afterwards so you don't accidentally run it!

In [None]:
corpus_path = 'example1-corpus/' # set this to whatever path makes sense!

In [None]:
i_want_to_delete_my_files = False

if i_want_to_delete_my_files == True:
    for filename in os.listdir(corpus_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(corpus_path, filename)
            os.remove(file_path)
            print(f"Deleted: {filename}")

## Save a zip file of your corpus

If you are happy with your corpus and want to download it, you can zip it and download the zip file.

In [None]:
corpus_file_name = 'example1-corpus.zip'
corpus_path = 'example1-corpus/'

shutil.make_archive(corpus_file_name[:-4], 'zip', corpus_path)