#Parsing

Apparently, most financial news websites **do not provide full or free APIs**.  
Some sites have **no public API at all**, and others place most of their content **behind paywalls**.

There are tools, in this notebook I test [NewsAPI.org](https://newsapi.org/).

# Testing NewsAPI.org to aggregate news

## How many sources can it aggregate for economic news from the US?

In [2]:
import requests
import pandas as pd

url = f"https://newsapi.org/v2/top-headlines/sources?apiKey={API_KEY}"

response = requests.get(url)
data = response.json()

if response.status_code != 200:
    print("Error:", data)
else:
    sources = data.get("sources", [])
    df = pd.DataFrame([{
        "id": s["id"],
        "name": s["name"],
        "category": s["category"],
        "language": s["language"],
        "country": s["country"],
        "url": s["url"],
        "description": s["description"]
    } for s in sources])

    # Filter for US + English
    df = df[(df["country"] == "us") & (df["language"] == "en")]

    # Keep relevant categories
    df = df[df["category"].isin(["business", "general", "technology", "science"])]

    # Manually tag most relevant economic/financial outlets
    relevance_keywords = [
        "finance", "business", "market", "money",
        "economy", "financial", "investment", "stock"
    ]

    df["is_economic"] = df["description"].str.contains(
        "|".join(relevance_keywords), case=False, na=False
    )

    econ_df = df[df["is_economic"]].copy()

    # Sort by category for overview
    econ_df = econ_df.sort_values(["category", "name"])
    econ_df = econ_df.reset_index(drop=True)

In [3]:
econ_df['id']

Unnamed: 0,id
0,bloomberg
1,business-insider
2,fortune
3,associated-press
4,fox-news
5,nbc-news
6,newsweek
7,reuters
8,the-washington-post
9,hacker-news


# Twelve sources - not bad

## How many news for today can it agreggate from these sources?

In [4]:
from datetime import datetime, timezone, timedelta

# üóìÔ∏è Get today's date in UTC (ISO 8601 compatible)
today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
print(today)

2025-11-03


In [5]:
# üóûÔ∏è Collect all articles from your econ_df
all_articles = []

print(f"Fetching today's news ({today}) from {len(econ_df)} economic/financial sources...\n")

for source_id, source_name in zip(econ_df["id"], econ_df["name"]):
    url = (
        f"https://newsapi.org/v2/top-headlines?"
        f"sources={source_id}&"
        f"language=en&"
        f"pageSize=100&"
        f"apiKey={API_KEY}"
    )

    response = requests.get(url)
    data = response.json()

    if response.status_code != 200:
        print(f"‚ùå {source_name} ({source_id}): Error {response.status_code} ‚Üí {data.get('message')}")
        continue

    articles = data.get("articles", [])
    print(f"‚úÖ {source_name}: {len(articles)} articles")

    for a in articles:
        all_articles.append({
            "source_id": source_id,
            "source_name": source_name,
            "title": a["title"],
            "description": a["description"],
            "url": a["url"],
            "publishedAt": a["publishedAt"]
        })

Fetching today's news (2025-11-03) from 12 economic/financial sources...

‚úÖ Bloomberg: 10 articles
‚úÖ Business Insider: 10 articles
‚úÖ Fortune: 10 articles
‚úÖ Associated Press: 10 articles
‚úÖ Fox News: 10 articles
‚úÖ NBC News: 10 articles
‚úÖ Newsweek: 10 articles
‚úÖ Reuters: 0 articles
‚úÖ The Washington Post: 10 articles
‚úÖ Hacker News: 10 articles
‚úÖ The Next Web: 7 articles
‚úÖ Wired: 10 articles


In [6]:
if all_articles:
    news_df = pd.DataFrame(all_articles)

    # ‚úÖ robust timestamp parsing
    news_df["publishedAt"] = pd.to_datetime(news_df["publishedAt"], format="ISO8601", errors="coerce")

    news_df = news_df.sort_values("publishedAt", ascending=False).reset_index(drop=True)

    print(f"\n‚úÖ Total articles collected: {len(news_df)} from {news_df['source_name'].nunique()} sources.")
else:
    print("\n‚ö†Ô∏è No articles found for today.")


‚úÖ Total articles collected: 107 from 11 sources.


In [7]:
news_df.head()

Unnamed: 0,source_id,source_name,title,description,url,publishedAt
0,associated-press,Associated Press,Israel says the Red Cross has received the rem...,Israel says the Red Cross has received the rem...,https://apnews.com/article/israel-hamas-hostag...,2025-11-02 18:41:37+00:00
1,bloomberg,Bloomberg,Tesla Owner Complaints Rise in US Probe Over I...,The US auto safety regulator investigating whe...,https://www.bloomberg.com/news/articles/2025-1...,2025-11-02 18:15:20+00:00
2,fox-news,Fox News,Mamdani's socialist allies embrace watchdog's ...,Socialist organizers praised an anti-Mamdani g...,https://www.foxnews.com/politics/mamdanis-soci...,2025-11-02 17:22:24.272071+00:00
3,associated-press,Associated Press,Trump says China's Xi has assured him that he ...,President Donald Trump says Chinese President ...,https://apnews.com/article/trump-xi-china-taiw...,2025-11-02 17:15:45+00:00
4,fox-news,Fox News,ESPN broadcasters roast Oklahoma kicker for we...,Oklahoma Sooners kicker Tate Sandell was the b...,https://www.foxnews.com/sports/espn-broadcaste...,2025-11-02 17:07:24.538435400+00:00


## Can already see that many articles are relevant, some local news

# Analysing the aggregated news

## News categories defined by GPT model (using fixed topic list)

### Fixed categories

In [8]:
# possible topics
topics = [
    "Monetary Policy",
    "Inflation",
    "Labor Market",
    "Corporate Earnings",
    "Stock Markets",
    "Energy Prices",
    "Real Estate",
    "Trade and Geopolitics",
    "Fiscal Policy",
    "Consumer Spending",
    "Banking and Credit",
    "Technology and Innovation",
    "Commodities",
    "Financial Regulation",
    "Auto Industry Issues", # added after check
    "Government Shutdown Effects"  # added after check
]

In [9]:
from openai import OpenAI
from tqdm import tqdm
import pandas as pd

client = OpenAI(api_key=my_gpt_key)

df = news_df.copy()
texts = (
    df["title"].fillna("") + ". " + df["description"].fillna("")
).tolist()

print(f"Using fixed {len(topics)} categories:\n{topics}\n")

# --- classification function ---
def classify_article(text, topics):
    prompt = f"""
You are an economics analyst.
Assign the following news item to the *closest* matching category from this list:
{', '.join(topics)}.

If it is not economic, financial, or business related, respond exactly with: "Other".
You must choose **exactly one** from the list if possible.

News:
{text}

Return only the category name exactly as in the list or "Not Economic".
"""
    r = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,  # üëà stability: deterministic output
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content.strip()

# --- run classification ---
labels = []
for text in tqdm(texts, desc="Categorizing articles"):
    label = classify_article(text, topics)
    labels.append(label)

df["category"] = labels
print("‚úÖ Classification complete.")

Using fixed 16 categories:
['Monetary Policy', 'Inflation', 'Labor Market', 'Corporate Earnings', 'Stock Markets', 'Energy Prices', 'Real Estate', 'Trade and Geopolitics', 'Fiscal Policy', 'Consumer Spending', 'Banking and Credit', 'Technology and Innovation', 'Commodities', 'Financial Regulation', 'Auto Industry Issues', 'Government Shutdown Effects']



Categorizing articles: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 107/107 [00:52<00:00,  2.05it/s]

‚úÖ Classification complete.





### Do these categories reflect news?

In [10]:
from openai import OpenAI
import re, ast
from datetime import datetime

client = OpenAI(api_key=my_gpt_key)

# --- use the same cleaned df you already have ---
sample_texts = (
    df["title"].fillna("") + ". " + df["description"].fillna("")
).tolist()[:50]  # sample first 50 items for context

# ===============================================================
# 1Ô∏è‚É£  GPT defines categories based on current news headlines
# ===============================================================
prompt_infer = f"""
You are an economics analyst.
Given these news headlines, identify 8‚Äì12 short, clear economic or financial categories
that best represent the topics covered. Use 2‚Äì3 word labels.

Headlines:
{chr(10).join(sample_texts)}

Return only a valid Python list of strings (no commentary).
"""

resp_infer = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    messages=[{"role": "user", "content": prompt_infer}],
)
raw_infer = resp_infer.choices[0].message.content.strip()

# --- clean and parse list ---
raw_infer = re.sub(r"^```[a-zA-Z]*", "", raw_infer).replace("```", "").strip()
try:
    inferred_categories = ast.literal_eval(raw_infer)
except Exception:
    inferred_categories = [c.strip() for c in raw_infer.split(",") if c.strip()]

print(f"\nüß© GPT-inferred categories ({len(inferred_categories)}):\n{inferred_categories}\n")

# ===============================================================
# 2Ô∏è‚É£  Ask GPT to compare inferred vs. fixed lists
# ===============================================================
prompt_compare = f"""
Compare these two lists of economic categories.

Fixed categories:
{topics}

Newly inferred categories from data:
{inferred_categories}

Analyze whether the fixed list sufficiently covers the new ones.
If some inferred categories are missing or too distinct,
recommend which 1‚Äì3 additional categories (if any) should be added.

Be concise, and finish with a clear yes/no about whether to expand the list.
"""

resp_compare = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    messages=[{"role": "user", "content": prompt_compare}],
)
conclusion = resp_compare.choices[0].message.content.strip()

print("### üß† Category Coverage Analysis\n")
print(conclusion)


üß© GPT-inferred categories (12):
['Geopolitical Tensions', 'Auto Industry Issues', 'Socialism vs Capitalism', 'Government Shutdown', 'Drug Policy', 'Election Dynamics', 'Food Assistance Programs', 'Market Reforms', 'Investment Strategies', 'Bond Market Trends', 'Academic Integrity', 'Natural Disasters']

### üß† Category Coverage Analysis

The fixed categories cover a broad range of economic topics, but there are some gaps when compared to the newly inferred categories. Specifically:

1. **Geopolitical Tensions** - While "Trade and Geopolitics" is included, the specific focus on geopolitical tensions is distinct and warrants its own category.
2. **Socialism vs Capitalism** - This ideological debate is not explicitly covered in the fixed list and is significant in economic discussions.
3. **Government Shutdown** - While "Government Shutdown Effects" is included, it may be beneficial to have a separate category for the event itself.
4. **Drug Policy** - This is not addressed in the f

## Summary for ecomomic news

In [11]:
df = df[df["category"] != "Other"].copy()
df = df.reset_index(drop=True)

In [None]:
prompt_summary = f"""
You are an economics journalist writing concise market briefs.
Read the following headlines about "{cat}" and produce a **short 2‚Äì3 sentence summary** only.

Summarize:
- What specifically happened today
- Why it matters economically
- What it could mean going forward

Use clear, factual language, keep it under 80 words total, and include key numbers or names if relevant.
Do NOT exceed three sentences.

Headlines:
{headlines}

Return only the summary text (no intro, no list, no formatting).
"""


In [13]:
# --- Setup ---
client = OpenAI(api_key=my_gpt_key)

# --- Time with timezone ---
tz_offset = timezone(timedelta(hours=1))  # set your offset here
current_time = datetime.now(tz=tz_offset).strftime("%Y-%m-%d %H:%M:%S (UTC+1)")

# --- Basic stats ---
total_econ = len(df)
source_counts = df["source_name"].value_counts().to_dict()
cat_counts = df["category"].value_counts()
cat_counts_dict = cat_counts.to_dict()

# --- Sort by most frequent categories ---
df["category"] = pd.Categorical(df["category"], categories=cat_counts.index, ordered=True)
df = df.sort_values("category")

# --- HEADER ---
print(f"### üóìÔ∏è Daily Economic News Summary ({current_time})\n")
print(f"Till this time today, NewsAPI aggregated **{total_econ} economic news items**.\n")

# --- SOURCES ---
print("**Sources contributing (news count in brackets):**")
print(", ".join([f"{src} ({n})" for src, n in source_counts.items()]))
print("\n")

# --- CATEGORIES ---
print(f"GPT categorized them into **{len(cat_counts)} economic categories:**")
print(", ".join([f"{cat} ({count})" for cat, count in cat_counts_dict.items()]))
print("\n")

# --- HEADLINES BY CATEGORY ---
for cat in cat_counts.index:
    print(f"#### {cat} ({cat_counts_dict[cat]} news)\n")
    cat_df = df[df["category"] == cat]
    for _, row in cat_df.iterrows():
        print(f"- {row['title']} ({row['source_name']})")
    print()

# --- GPT SUMMARIES FOR TOP 3 CATEGORIES ---
top3 = cat_counts.head(3).index.tolist()

for cat in top3:
    cat_df = df[df["category"] == cat]
    headlines = "\n".join(cat_df["title"].tolist())

    prompt_summary = f"""
You are an economics journalist writing concise market briefs.
Read the following headlines about "{cat}" and produce a **short 2‚Äì3 sentence summary** only.

Summarize:
- What specifically happened today
- Why it matters economically
- What it could mean going forward

Use clear, factual language, keep it under 80 words total, and include key numbers or names if relevant.
Do NOT exceed three sentences.

Headlines:
{headlines}

Return only the summary text (no intro, no list, no formatting).
"""

    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt_summary}],
    )

    summary = resp.choices[0].message.content.strip()
    print(f"### üß† {cat} ‚Äî Summary\n{summary}\n")


### üóìÔ∏è Daily Economic News Summary (2025-11-03 20:09:49 (UTC+1))

Till this time today, NewsAPI aggregated **38 economic news items**.

**Sources contributing (news count in brackets):**
Bloomberg (10), The Next Web (7), Business Insider (5), Fortune (4), Wired (3), Associated Press (3), Newsweek (2), Hacker News (1), The Washington Post (1), Fox News (1), NBC News (1)


GPT categorized them into **11 economic categories:**
Technology and Innovation (12), Trade and Geopolitics (7), Stock Markets (5), Fiscal Policy (2), Auto Industry Issues (2), Energy Prices (2), Banking and Credit (2), Corporate Earnings (2), Labor Market (2), Government Shutdown Effects (1), Real Estate (1)


#### Technology and Innovation (12 news)

- THE AI ISSUE (Wired)
- EU's EV battery ambitions hang in the balance (The Next Web)
- 3D-printed rocket engine revs up for orbital launch in Scotland (The Next Web)
- A metaverse network plots an escape from Meta's 'walled gardens' (The Next Web)
- This startup gi