Imports & NLTK Setup

In [1]:
import os
import json
import nltk
from nltk.tokenize import sent_tokenize
from tqdm import tqdm
import pandas as pd

# Download the punkt tokenizer if needed
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\roee1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Define the Chunking Function

In [2]:
def chunk_by_sentences(text: str, chunk_size: int = 3):
    """
    Split `text` into non‐overlapping chunks of `chunk_size` sentences.
    Returns a list of chunk strings.
    """
    sents = sent_tokenize(text)
    chunks = []
    for i in range(0, len(sents), chunk_size):
        chunk = " ".join(sents[i : i + chunk_size]).strip()
        if chunk:
            chunks.append(chunk)
    return chunks


Quick Test on 5 Sample Articles

In [3]:
# Test chunking on the first 5 articles from the 10k sample
sample_chunks = []
with open("../data/sampled/10k_sample.jsonl", "r", encoding="utf-8", errors="ignore") as fin:
    for _ in range(5):
        art = json.loads(next(fin))
        full_text = art.get("headline_summary", "") + " " + art.get("body", "")
        sample_chunks.append(chunk_by_sentences(full_text))

for i, chunks in enumerate(sample_chunks, start=1):
    print(f"Article {i} → {len(chunks)} chunks; first chunk preview:\n")
    print(chunks[0][:200] + "...\n")
    print("-" * 60)


Article 1 → 14 chunks; first chunk preview:

Agilent Technologies A is leaving no stone unturned to bolster its Diagnostics and Genomics Group (DGG) segment on the back of its strategic partnerships. This is evident from its latest distribution ...

------------------------------------------------------------
Article 2 → 14 chunks; first chunk preview:

Ziff Davis ZD reported adjusted earnings of $1.58 per share in third-quarter 2022, which met the Zacks Consensus Estimate and increased 6% year over year. Revenues totaled $341.9 million in the quarte...

------------------------------------------------------------
Article 3 → 12 chunks; first chunk preview:

In the latest market close, Alcoa (AA) reached $26.86, with a +1.59% movement compared to the previous day. The stock exceeded the S&P 500, which registered a gain of 0.38% for the day. On the other h...

------------------------------------------------------------
Article 4 → 10 chunks; first chunk preview:

It’s the unofficial st

Process All 10 000 Articles & Save Counts

In [4]:
# Ensure the data/sampled folder exists
os.makedirs("../data/sampled", exist_ok=True)

counts = []
with open("../data/sampled/10k_sample.jsonl", "r", encoding="utf-8", errors="ignore") as fin:
    for line in tqdm(fin, total=10000):
        art = json.loads(line)
        full_text = art.get("headline_summary", "") + " " + art.get("body", "")
        counts.append(len(chunk_by_sentences(full_text)))

# Compute and display summary stats
counts_series = pd.Series(counts)
print("Average #chunks/article:", counts_series.mean())
print("Std dev of #chunks/article:", counts_series.std())

# Save to CSV
counts_series.to_csv("../data/sampled/chunks_per_article.csv", index=False)
print("✔ Saved chunk counts to data/sampled/chunks_per_article.csv")


100%|██████████| 10000/10000 [00:07<00:00, 1298.01it/s]

Average #chunks/article: 12.8712
Std dev of #chunks/article: 18.854764935582665
✔ Saved chunk counts to data/sampled/chunks_per_article.csv



