<a href="https://colab.research.google.com/github/NataliaLyubaykina/agents_test/blob/main/testing_NewsAPI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Parsing

Apparently, most financial news websites **do not provide full or free APIs**.  
Some sites have **no public API at all**, and others place most of their content **behind paywalls**.

There are tools, in this notebook I test [NewsAPI.org](https://newsapi.org/).

# Testing NewsAPI.org to aggregate news

## How many sources can it aggregate for economic news from the US?

In [2]:
import requests
import pandas as pd

url = f"https://newsapi.org/v2/top-headlines/sources?apiKey={API_KEY}"

response = requests.get(url)
data = response.json()

if response.status_code != 200:
    print("Error:", data)
else:
    sources = data.get("sources", [])
    df = pd.DataFrame([{
        "id": s["id"],
        "name": s["name"],
        "category": s["category"],
        "language": s["language"],
        "country": s["country"],
        "url": s["url"],
        "description": s["description"]
    } for s in sources])

    # Filter for US + English
    df = df[(df["country"] == "us") & (df["language"] == "en")]

    # Keep relevant categories
    df = df[df["category"].isin(["business", "general", "technology", "science"])]

    # Manually tag most relevant economic/financial outlets
    relevance_keywords = [
        "finance", "business", "market", "money",
        "economy", "financial", "investment", "stock"
    ]

    df["is_economic"] = df["description"].str.contains(
        "|".join(relevance_keywords), case=False, na=False
    )

    econ_df = df[df["is_economic"]].copy()

    # Sort by category for overview
    econ_df = econ_df.sort_values(["category", "name"])
    econ_df = econ_df.reset_index(drop=True)

In [3]:
econ_df['id']

Unnamed: 0,id
0,bloomberg
1,business-insider
2,fortune
3,associated-press
4,fox-news
5,nbc-news
6,newsweek
7,reuters
8,the-washington-post
9,hacker-news


# Twelve sources - not bad

## How many news for today can it agreggate from these sources?

In [4]:
from datetime import datetime, timezone, timedelta

# üóìÔ∏è Get today's date in UTC (ISO 8601 compatible)
today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
print(today)

2025-11-05


In [5]:
# üóûÔ∏è Collect all articles from your econ_df
all_articles = []

print(f"Fetching today's news ({today}) from {len(econ_df)} economic/financial sources...\n")

for source_id, source_name in zip(econ_df["id"], econ_df["name"]):
    url = (
        f"https://newsapi.org/v2/top-headlines?"
        f"sources={source_id}&"
        f"language=en&"
        f"pageSize=100&"
        f"apiKey={API_KEY}"
    )

    response = requests.get(url)
    data = response.json()

    if response.status_code != 200:
        print(f"‚ùå {source_name} ({source_id}): Error {response.status_code} ‚Üí {data.get('message')}")
        continue

    articles = data.get("articles", [])
    print(f"‚úÖ {source_name}: {len(articles)} articles")

    for a in articles:
        all_articles.append({
            "source_id": source_id,
            "source_name": source_name,
            "title": a["title"],
            "description": a["description"],
            "url": a["url"],
            "publishedAt": a["publishedAt"]
        })

Fetching today's news (2025-11-05) from 12 economic/financial sources...

‚úÖ Bloomberg: 10 articles
‚úÖ Business Insider: 10 articles
‚úÖ Fortune: 10 articles
‚úÖ Associated Press: 10 articles
‚úÖ Fox News: 10 articles
‚úÖ NBC News: 10 articles
‚úÖ Newsweek: 10 articles
‚úÖ Reuters: 0 articles
‚úÖ The Washington Post: 10 articles
‚úÖ Hacker News: 10 articles
‚úÖ The Next Web: 7 articles
‚úÖ Wired: 10 articles


In [6]:
if all_articles:
    news_df = pd.DataFrame(all_articles)

    # ‚úÖ robust timestamp parsing
    news_df["publishedAt"] = pd.to_datetime(news_df["publishedAt"], format="ISO8601", errors="coerce")

    news_df = news_df.sort_values("publishedAt", ascending=False).reset_index(drop=True)

    print(f"\n‚úÖ Total articles collected: {len(news_df)} from {news_df['source_name'].nunique()} sources.")
else:
    print("\n‚ö†Ô∏è No articles found for today.")


‚úÖ Total articles collected: 107 from 11 sources.


In [7]:
news_df.head()

Unnamed: 0,source_id,source_name,title,description,url,publishedAt
0,fox-news,Fox News,The 2025 election that may determine if Republ...,California voters decide Proposition 50 Tuesda...,https://www.foxnews.com/politics/2025-election...,2025-11-04 17:07:24.105661300+00:00
1,fox-news,Fox News,FBI arrests 2 men in connection with Harvard M...,The FBI's Boston Field Office arrested two Mas...,https://www.foxnews.com/us/fbi-arrests-2-men-c...,2025-11-04 17:07:19.619986200+00:00
2,bloomberg,Bloomberg,YPF Hails Milei Agenda as Abu Dhabi Joins Arge...,State-run YPF SA said the oil and gas industry...,https://www.bloomberg.com/news/articles/2025-1...,2025-11-04 16:55:16+00:00
3,fox-news,Fox News,Erika Kirk reveals her message to Jimmy Kimmel...,"Erika Kirk says she doesn‚Äôt ""need"" Jimmy Kimme...",https://www.foxnews.com/media/erika-kirk-revea...,2025-11-04 16:52:22.479545700+00:00
4,fox-news,Fox News,‚ÄòGolden Bachelor‚Äô star Gerry Turner admits mar...,"Gerry Turner opens up about his failed ""Golden...",https://www.foxnews.com/entertainment/golden-b...,2025-11-04 16:22:24.338996700+00:00


## Can already see that many articles are relevant, some local news

# Analysing the aggregated news

## News categories defined by GPT model (using fixed topic list)

### Fixed categories

In [8]:
# possible topics
topics = [
    "Financial Markets",
     "Stock Markets",
    "Economic Policies",
    "Financial Regulations",
    "Inflation",
    "Labor Market",
    "Corporate Strategy",
    "Energy Sector",
    "Real Estate",
    "Consumer Insights",
    "Banking and Credit",
    "Technology Innovations",
    "Commodities",
    "Auto Industry Issues",
    "Political Dynamics",
    "Political Elections",
    "Social Issues",
    "International Relations",
    "Public Health Policies",
    "Entertainment News"
]



In [9]:
from openai import OpenAI
from tqdm import tqdm
import pandas as pd

client = OpenAI(api_key=my_gpt_key)

df = news_df.copy()
texts = (
    df["title"].fillna("") + ". " + df["description"].fillna("")
).tolist()

print(f"Using fixed {len(topics)} categories:\n{topics}\n")

# --- classification function ---
def classify_article(text, topics):
    prompt = f"""
You are an economics analyst.
Assign the following news item to the *closest* matching category from this list:
{', '.join(topics)}.

If it is not economic, financial, or business related, respond exactly with: "Other".
You must choose **exactly one** from the list if possible.

News:
{text}

Return only the category name exactly as in the list or "Not Economic".
"""
    r = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,  # üëà stability: deterministic output
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content.strip()

# --- run classification ---
labels = []
for text in tqdm(texts, desc="Categorizing articles"):
    label = classify_article(text, topics)
    labels.append(label)

df["category"] = labels
print("‚úÖ Classification complete.")

Using fixed 20 categories:
['Financial Markets', 'Stock Markets', 'Economic Policies', 'Financial Regulations', 'Inflation', 'Labor Market', 'Corporate Strategy', 'Energy Sector', 'Real Estate', 'Consumer Insights', 'Banking and Credit', 'Technology Innovations', 'Commodities', 'Auto Industry Issues', 'Political Dynamics', 'Political Elections', 'Social Issues', 'International Relations', 'Public Health Policies', 'Entertainment News']



Categorizing articles: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 107/107 [00:58<00:00,  1.84it/s]

‚úÖ Classification complete.





In [10]:
from openai import OpenAI
import re, ast
from datetime import datetime

client = OpenAI(api_key=my_gpt_key)

# --- use the same cleaned df you already have ---
sample_texts = (
    df["title"].fillna("") + ". " + df["description"].fillna("")
).tolist()[:50]  # sample first 50 items for context

# ===============================================================
# 1Ô∏è‚É£  GPT defines categories based on current news headlines
# ===============================================================
prompt_infer = f"""
You are an economics analyst.
Given these news headlines, identify 8‚Äì12 short, clear economic or financial categories
that best represent the topics covered. Use 2‚Äì3 word labels.

Headlines:
{chr(10).join(sample_texts)}

Return only a valid Python list of strings (no commentary).
"""

resp_infer = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    messages=[{"role": "user", "content": prompt_infer}],
)
raw_infer = resp_infer.choices[0].message.content.strip()

# --- clean and parse list ---
raw_infer = re.sub(r"^```[a-zA-Z]*", "", raw_infer).replace("```", "").strip()
try:
    inferred_categories = ast.literal_eval(raw_infer)
except Exception:
    inferred_categories = [c.strip() for c in raw_infer.split(",") if c.strip()]

print(f"\nüß© GPT-inferred categories ({len(inferred_categories)}):\n{inferred_categories}\n")

# ===============================================================
# 2Ô∏è‚É£  Ask GPT to compare inferred vs. fixed lists
# ===============================================================
prompt_compare = f"""
Compare these two lists of economic categories.

Fixed categories:
{topics}

Newly inferred categories from data:
{inferred_categories}

Analyze whether the fixed list sufficiently covers the new ones.
If some inferred categories are missing or too distinct,
recommend how the list of fixed categories {topics} should be optimized to better reflect today's news with minimal changes,
and keeping it not significantly longer than it is now.

Be concise, and finish with a clear yes/no about whether to expand the list.
"""#recommend which 1‚Äì3 additional categories (if any) should be added.

resp_compare = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    messages=[{"role": "user", "content": prompt_compare}],
)
conclusion = resp_compare.choices[0].message.content.strip()

print("### üß† Category Coverage Analysis\n")
print(conclusion)


üß© GPT-inferred categories (12):
['Political Elections', 'Healthcare Incidents', 'Energy Sector', 'Consumer Sentiment', 'Financial Services', 'Market Trends', 'Corporate Strategy', 'International Relations', 'Technology Innovations', 'Economic Policy', 'Labor Market', 'Tourism Industry']

### üß† Category Coverage Analysis

The fixed categories cover many of the newly inferred categories, but there are some gaps and distinctions that should be addressed:

1. **Healthcare Incidents**: This is not explicitly covered in the fixed list. It could be integrated into "Public Health Policies" or added as a separate category.
  
2. **Consumer Sentiment**: This is somewhat related to "Consumer Insights," but it may warrant its own category to reflect the focus on consumer attitudes and behaviors.

3. **Financial Services**: This could be encompassed within "Banking and Credit," but it may be beneficial to explicitly include it to capture a broader range of financial activities.

4. **Market 

## Summary for ecomomic news

In [11]:
df = df[df["category"] != "Other"].copy()
df = df.reset_index(drop=True)

In [13]:
from datetime import datetime, timedelta, timezone
import pandas as pd
from openai import OpenAI

# --- Setup ---
client = OpenAI(api_key=my_gpt_key)

# define filename with date
date_str = datetime.now().strftime("%Y-%m-%d")
filename = f"news_summary_for_{date_str}.txt"

# small helper: print + write
def tee_print(text="", end="\n"):
    """Print to console and also write to file."""
    print(text, end=end)
    with open(filename, "a", encoding="utf-8") as f:
        f.write(text + end)

# --- Time with timezone ---
tz_offset = timezone(timedelta(hours=1))  # set your offset here
current_time = datetime.now(tz=tz_offset).strftime("%Y-%m-%d %H:%M:%S (UTC+1)")

# --- Basic stats ---
total_econ = len(df)
source_counts = df["source_name"].value_counts().to_dict()
cat_counts = df["category"].value_counts()
cat_counts_dict = cat_counts.to_dict()

# --- Sort by most frequent categories ---
df["category"] = pd.Categorical(df["category"], categories=cat_counts.index, ordered=True)
df = df.sort_values("category")

# clear file first
open(filename, "w").close()

# --- HEADER ---
tee_print(f"### üóìÔ∏è Daily Economic News Summary ({current_time})\n")
tee_print(f"Till this time today, NewsAPI aggregated **{total_econ} economic news items**.\n")

# --- SOURCES ---
tee_print("**Sources contributing (news count in brackets):**")
tee_print(", ".join([f"{src} ({n})" for src, n in source_counts.items()]))
tee_print()

# --- CATEGORIES ---
tee_print(f"GPT categorized them into **{len(cat_counts)} economic categories:**")
tee_print(", ".join([f"{cat} ({count})" for cat, count in cat_counts_dict.items()]))
tee_print()

# --- HEADLINES BY CATEGORY (prints only) ---
for cat in cat_counts.index:
    print(f"#### {cat} ({cat_counts_dict[cat]} news)\n")
    cat_df = df[df["category"] == cat]
    for _, row in cat_df.iterrows():
        print(f"- {row['title']} ({row['source_name']})")
    print()

# --- GPT SUMMARIES FOR TOP 3 CATEGORIES ---
top3 = cat_counts.head(3).index.tolist()

for cat in top3:
    cat_df = df[df["category"] == cat]
    headlines = "\n".join(cat_df["title"].tolist())

    prompt_summary = f"""
You are an economics journalist writing concise market briefs.
Read the following headlines about "{cat}" and produce a **short 2‚Äì3 sentence summary** only.

Summarize:
- What specifically happened today
- Why it matters economically
- What it could mean going forward

Use clear, factual language, keep it under 80 words total, and include key numbers or names if relevant.
Do NOT exceed three sentences.

Headlines:
{headlines}

Return only the summary text (no intro, no list, no formatting).
"""

    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt_summary}],
    )

    summary = resp.choices[0].message.content.strip()
    tee_print(f"### üß† {cat} ‚Äî Summary\n{summary}\n")

print(f"\n‚úÖ Summary sections saved to: {filename}")


### üóìÔ∏è Daily Economic News Summary (2025-11-05 18:40:21 (UTC+1))

Till this time today, NewsAPI aggregated **55 economic news items**.

**Sources contributing (news count in brackets):**
Bloomberg (8), The Next Web (7), Fortune (7), Fox News (6), Business Insider (6), Associated Press (5), Wired (5), The Washington Post (5), Hacker News (2), Newsweek (2), NBC News (2)

GPT categorized them into **17 economic categories:**
Technology Innovations (10), Political Dynamics (8), Political Elections (7), Corporate Strategy (5), Financial Markets (3), International Relations (3), Energy Sector (3), Stock Markets (3), Social Issues (2), Labor Market (2), Financial Regulations (2), Auto Industry Issues (2), Public Health Policies (1), Inflation (1), Consumer Insights (1), Real Estate (1), Economic Policies (1)

#### Technology Innovations (10 news)

- Sequoia PGP is now LGPL 2.0+ (Hacker News)
- The EV Battery Tech That‚Äôs Worth the Hype, According to Experts (Wired)
- The next Silicon Va