<a href="https://colab.research.google.com/github/TurkuNLP/DIGHT25/blob/main/03_summaries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DSPy tutorial part 3

* Works with historical news from the HMD project (Heritage Made Digital)
* I had to pre-prepare a small sample because the dataset is massive


In [7]:
!pip3 install -q dspy
!wget https://github.com/TurkuNLP/DIGHT25/raw/refs/heads/main/hmd_newspapers_texts.json

--2025-08-27 16:17:36--  https://github.com/TurkuNLP/DIGHT25/raw/refs/heads/main/hmd_newspapers_texts.json
Resolving github.com (github.com)... 140.82.116.3
Connecting to github.com (github.com)|140.82.116.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/TurkuNLP/DIGHT25/refs/heads/main/hmd_newspapers_texts.json [following]
--2025-08-27 16:17:36--  https://raw.githubusercontent.com/TurkuNLP/DIGHT25/refs/heads/main/hmd_newspapers_texts.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 785985 (768K) [text/plain]
Saving to: ‘hmd_newspapers_texts.json.3’


2025-08-27 16:17:36 (20.7 MB/s) - ‘hmd_newspapers_texts.json.3’ saved [785985/785985]



In [8]:
from google.colab import userdata
import json, dspy
import pandas as pd
import random

INPUT_JSON = "hmd_newspapers_texts.json"
OUTPUT_XLSX = "news_topics_summary.xlsx"

# --- DSPy signature (class-based), no module class ---
class TopicAndSummary(dspy.Signature):
    """Read a historical news snippet and return:
    - topics: up to 3 concise topics (1–3 words each), comma-separated
    - summary: a brief, neutral summary (1–2 sentences)
    Example topics: crime, politics, war, economy, agriculture, culture, religion, science, health, local news.
    """
    text = dspy.InputField(desc="Historical news")
    topics = dspy.OutputField(desc="Comma-separated topics (≤3)")
    summary = dspy.OutputField(desc="Brief summary (1–2 sentences)")

# --- configure DSPy (adjust model / API as you use elsewhere) ---
lm = dspy.LM("openai/gpt-4.1-mini", api_key=userdata.get("openai-api-key"))
dspy.configure(lm=lm)
predict = dspy.Predict(signature=TopicAndSummary)

# --- load & cut down ---
with open(INPUT_JSON, encoding="utf-8") as f:
    texts = json.load(f)
random.shuffle(texts)
texts = texts[:15]  # small subset for the demo/class



In [9]:
# --- run extraction ---
rows = []
for i, t in enumerate(texts, 1):
    out = predict(text=t)
    topics = out.topics.strip()
    summary = out.summary.strip()

    # print to console (compact)
    print(f"--- {i} ---")
    preview = t[:200].replace("\n", " ")
    print("TEXT:", preview + ("..." if len(t) > 200 else ""))
    print("TOPICS:", topics)
    print("SUMMARY:", summary)
    print()

    rows.append({"text": t, "topics": topics, "summary": summary})

# --- write Excel (text, topics, summary) ---
df = pd.DataFrame(rows, columns=["text", "topics", "summary"])
# If openpyxl/xlsxwriter is installed, this will just work. Otherwise install one.
df.to_excel(OUTPUT_XLSX, index=False)

print(f"Wrote {len(df)} rows to {OUTPUT_XLSX}")

--- 1 ---
TEXT: BONNETS ! BONNETS ! !  MR. GEORGE JONES has ready for INSPEC- TION all the New Styles in SILK, SATIN, VELVET, CRAPE, STRAW, and FANCY BONNETS, at very Moderate Prices. 55 and 57, GREAT CHARLOTTE-STREE...
TOPICS: fashion, retail, millinery
SUMMARY: Several merchants in Liverpool announce the arrival of new styles in bonnets, millinery, and dressmaking for the winter season of 1855, offering a variety of materials and Parisian fashions at moderate prices.

--- 2 ---
TEXT: CASTLE-STREET WARD,  This ward, which has always been considered "safe" by the Radicals, has been most unexpectedly wrested from them this year, the election having ousted Mr. Avison, the retiring mem...
TOPICS: local election, politics, municipal government
SUMMARY: In Castle-Street Ward, traditionally a Radical stronghold, the Conservative candidate Mr. J. G. Livingston won the election, defeating the retiring Radical member Mr. Avison by 48 votes. Mr. Livingston expressed gratitude to his supporters a