<a href="https://colab.research.google.com/github/TurkuNLP/DIGHT25/blob/main/03_summaries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DSPy tutorial part 3

* Works with historical news from the HMD project (Heritage Made Digital)
* I had to pre-prepare a small sample because the dataset is massive
* Here we try multiple outputs


In [1]:
!pip3 install -q dspy
!wget https://github.com/TurkuNLP/DIGHT25/raw/refs/heads/main/hmd_newspapers_texts.json

#Get the API key
!wget -O api-key.txt http://epsilon-it.utu.fi/dight-api-key-1.txt
api_key=open("api-key.txt").read().strip()

#Backup option:
#api_key="sk_...."

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.2/41.2 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m260.1/260.1 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.9/400.9 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.4/247.4 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25h--2025-08-28 08:05:39--  https://github.com/TurkuNLP/DIGHT25/raw/refs/heads/main/hmd_newspapers_texts.json
Resolving github.com (github.com)... 140.82.114.4
Co

In [2]:
from google.colab import userdata
import json, dspy
import pandas as pd
import random

INPUT_JSON = "hmd_newspapers_texts.json"
OUTPUT_XLSX = "news_topics_summary.xlsx"

# --- DSPy signature (class-based), no module class ---
class TopicAndSummary(dspy.Signature):
    """Read a historical news snippet and return:
    - topics: up to 3 concise topics (1–3 words each), comma-separated
    - summary: a brief, neutral summary (1–2 sentences)
    Example topics: crime, politics, war, economy, agriculture, culture, religion, science, health, local news.
    """
    text = dspy.InputField(desc="Historical news")
    topics = dspy.OutputField(desc="Comma-separated topics (≤3)")
    summary = dspy.OutputField(desc="Brief summary (1–2 sentences)")

# --- configure DSPy (adjust model / API as you use elsewhere) ---
lm = dspy.LM("openai/gpt-4.1-mini", api_key=api_key)
dspy.configure(lm=lm)
predict = dspy.Predict(signature=TopicAndSummary)

# --- load & cut down ---
with open(INPUT_JSON, encoding="utf-8") as f:
    texts = json.load(f)
random.shuffle(texts)
texts = texts[:15]  # small subset for the demo/class



In [3]:
# --- run extraction ---
rows = []
for i, t in enumerate(texts, 1):
    out = predict(text=t)
    topics = out.topics.strip()
    summary = out.summary.strip()

    # print to console (compact)
    print(f"--- {i} ---")
    preview = t[:200].replace("\n", " ")
    print("TEXT:", preview + ("..." if len(t) > 200 else ""))
    print("TOPICS:", topics)
    print("SUMMARY:", summary)
    print()

    rows.append({"text": t, "topics": topics, "summary": summary})

# --- write Excel (text, topics, summary) ---
df = pd.DataFrame(rows, columns=["text", "topics", "summary"])
# If openpyxl/xlsxwriter is installed, this will just work. Otherwise install one.
df.to_excel(OUTPUT_XLSX, index=False)

print(f"Wrote {len(df)} rows to {OUTPUT_XLSX}")

--- 1 ---
TEXT: DOCK COMMITTEE.  The proceedings of this committee were read, awl, after some discussion, on the motion of Mr. C. TURNER, confirmed.  COURTS LAW AED ST. GEORGE'S HALL COMMTTTRE.  The minutes of this c...
TOPICS: local government, public facilities, health committee
SUMMARY: The Dock Committee's proceedings were confirmed after discussion, and the Courts Law and St. George's Hall Committee resolved to open the hall daily to the public with a new organist appointed. The Council expressed concerns about the eastern approach's condition but confirmed the proceedings of the Markets and Health Committees despite some debate.

--- 2 ---
TEXT: COMPLETION OF ALTERATIONS. WATERPROOF, AIRPROOF, INDIARUBBER, AND GIJTTA PERCHA DEPOT, 2 and 3, QUADRANT-BUILDINGS, LIME-STREET. HELLEWELL, in announcing the completion S • of the extensive Alteration...
TOPICS: commerce, manufacturing, health
SUMMARY: Hellwell announces the completion of extensive alterations at their waterproof, airproo

# Further exercises:

* If done and bored, go back to task 2 and try to replicate the OCR error correction idea - do you see any positive outcome?