<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/ATP_2025_Notebook_Python2_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using OpenAI's API

In [None]:
from openai import OpenAI

secret_token = "" # copy your API key token here!

client = OpenAI(api_key=secret_token)
model = "gpt-4o-mini"


def estimate_cost(usage):
  # https://platform.openai.com/docs/pricing (accessed 16.12.2025)
  input_cost = usage.prompt_tokens / 1000000 * 0.15
  output_cost = usage.completion_tokens / 1000000 * 0.60
  return input_cost + output_cost

def generate(prompt):
  chat_completion = client.chat.completions.create(messages=[{"role": "user", "content": prompt}], model=model)
  cost = estimate_cost(chat_completion.usage)
  print("User:", prompt)
  print("Assistant:", chat_completion.choices[0].message.content)
  print(f"Cost: {cost:.6f}$")


In [None]:
my_prompt = "Translate into Finnish: Please write your own prompt here!"
generate(my_prompt)

User: Translate into Finnish: Please write your own prompt here!
Assistant: Ole hyvä ja kirjoita oma pyyntösi tähän!
Cost: 0.000010$


# Analyzing a document collection using LLMs

Here is an example how to use LLMs to analyze a document collection. Here we will use few paragraphs from Project Gutenberg books, and translate those into Finnish (or any other language).

In [None]:
# To avoid downloading data, I have copied here samples from three books.
# Book 1: Frankenstein, Mary Wollstonecraft Shelley: https://www.gutenberg.org/cache/epub/84/pg84.txt (Beginning of Chapter 1)
# Book 2: Moby Dick, Herman Melville: https://www.gutenberg.org/cache/epub/2701/pg2701.txt (Beginning of Chapter 1)
# Book 3: Pride and Prejudice, Jane Austen: https://www.gutenberg.org/cache/epub/1342/pg1342.txt (Beginning of Chapter 1)

books = ["I am by birth a Genevese, and my family is one of the most distinguished of that republic. My ancestors had been for many years counsellors and syndics, and my father had filled several public situations with honour and reputation. He was respected by all who knew him for his integrity and indefatigable attention to public business. He passed his younger days perpetually occupied by the affairs of his country; a variety of circumstances had prevented his marrying early, nor was it until the decline of life that he became a husband and the father of a family.",
         "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation.",
         "It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters."]

def generate_dataset(prompt, dataset):
  total_cost = 0
  generated_outputs = []
  for document in dataset:
    prompt_with_document = prompt + document
    chat_completion = client.chat.completions.create(messages=[{"role": "user", "content": prompt_with_document}], model=model)
    total_cost += estimate_cost(chat_completion.usage)
    generated_outputs.append(chat_completion.choices[0].message.content)
  print(f"Generation done, analyzed {len(generated_outputs)} documents.")
  print(f"Total cost: {total_cost:.6f}$\n")
  for output in generated_outputs:
    print(output, "\n")

# call the function
my_prompt = "Translate into Finnish: "
generate_dataset(my_prompt, books)


Generation done, analyzed 3 documents.
Total cost: 0.000268$

Here is the translation into Finnish:

"Olen syntynyt geneveläinen, ja perheeni on yksi tämän tasavallan arvostetuimmista. Esivanhempani olivat olleet monta vuotta neuvonantajia ja syndikkejä, ja isäni oli täyttänyt useita julkisia tehtäviä kunnialla ja hyvässä maineessa. Häntä kunnioitettiin kaikilla, jotka tunsivat hänet, hänen integriteettinsä ja väsymätön huomionsa julkisiin asioihin vuoksi. Hän vietti nuoruutensa jatkuvasti kansansa asioiden parissa; monenlaiset olosuhteet estivät häntä menemästä naimisiin varhain, eikä hänestä tullut aviomiestä tai perheenisää ennen elämänsä ehtoopuolta." 

Sure! Here is the translation into Finnish:

"Puhelkaa minua Ishmaeliksi. Muutamia vuosia sitten—älkää kysykö kuinka pitkään—kun kukkarossani ei ollut juuri ollenkaan rahaa, enkä tiennyt mitään erityistä, mikä kiinnostaisi minua maalla, ajattelin, että purjehtisin vähän ja näkisin maailman vesisiä osia. Se on tapa, jolla karkotan al

# Exercises

1) Use `generate("Put your own prompt here")` to test LLM generation. Try it with different prompts and languages. You can ask anything you like.

2) Use `generate_dataset("Put your own prompt here", books)` to test how an LLM can be applied to a small dataset of 3 book samples. Note that the function combines the given prompt and each book, so each time the LLM will see your instruction + one book text. Try different prompts. In addition to translation, you can try e.g. "list all verbs", "convert text to uppercase", "who is the author of the given book", "simplify language", etc.

3) Extra exercise: The full book "Pride and Prejudice" has about 175,000 tokens (based on `gpt-4o-mini` tokenizer, note that these are different from linguistic tokens or words). Based on the pricing table, estimate what would be the cost of running the full book through `gpt-4o-mini` to e.g. convert it to uppercase? This means that the prompt would have 175,000 tokens, and the generated output would have the same 175,000 tokens. How about if using `gpt-5.2`?

In [None]:
# Exercise 3:

# gpt-4o-mini
input_cost = 175000 / 1000000 * 0.15
output_cost = 175000 / 1000000 * 0.60
print(f"gpt-4o-mini input cost: {input_cost:.6f}$")
print(f"gpt-4o-mini output cost: {output_cost:.6f}$")
print(f"Total: {input_cost+output_cost:.6f}$\n")

# gpt-5.2
input_cost = 175000 / 1000000 * 1.75
output_cost = 175000 / 1000000 * 14.0
print(f"gpt-5.2 input cost: {input_cost:.6f}$")
print(f"gpt-5.2 output cost: {output_cost:.6f}$")
print(f"Total: {input_cost+output_cost:.6f}$\n")

# gpt-5.2-pro (very expensive model)
input_cost = 175000 / 1000000 * 21.0
output_cost = 175000 / 1000000 * 168.0
print(f"gpt-5.2-pro input cost: {input_cost:.6f}$")
print(f"gpt-5.2-pro output cost: {output_cost:.6f}$")
print(f"Total: {input_cost+output_cost:.6f}$\n")

gpt-4o-mini input cost: 0.026250$
gpt-4o-mini output cost: 0.105000$
Total: 0.131250$

gpt-5.2 input cost: 0.306250$
gpt-5.2 output cost: 2.450000$
Total: 2.756250$

gpt-5.2-pro input cost: 3.675000$
gpt-5.2-pro output cost: 29.400000$
Total: 33.075000$

