
# üìà Financial News NER (Stocks, Companies, Events)

This notebook builds a **Financial News Named Entity Recognition (NER)** system using **Hugging Face Transformers only**.

It extracts:
- **Stocks / Tickers** ‚Äî e.g., `AAPL`, `TSLA`
- **Companies** ‚Äî e.g., `Apple Inc.`, `Tesla`
- **Events** ‚Äî e.g., `earnings`, `merger`, `dividend`, etc.



## 1Ô∏è‚É£ Install Dependencies

In [1]:

# Uncomment if needed
%pip install -U transformers torch pandas tqdm rich requests python-dateutil
%pip install newsapi-python

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## 2Ô∏è‚É£ Load Model and Define Config

In [2]:

from transformers import pipeline
import re, json, pandas as pd
from typing import List, Dict, Any
from dataclasses import dataclass, field

PRIMARY_MODEL = "dslim/bert-base-NER"

def load_ner_pipeline():
    try:
        # Preferred in many versions
        return pipeline("ner", model=PRIMARY_MODEL, grouped_entities=True)
    except TypeError:
        # Fallback for older/newer API
        return pipeline("ner", model=PRIMARY_MODEL, aggregation_strategy="simple")

ner = load_ner_pipeline()
print("NER pipeline ready ‚úÖ")


  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


NER pipeline ready ‚úÖ




## 3Ô∏è‚É£ Helper Functions

In [3]:

TICKER_REGEX = re.compile(r"\b[A-Z][A-Z\.\-]{0,4}\b")
EVENT_HINTS = ["earnings","guidance","dividend","merger","acquisition","buyback","IPO","bankruptcy","layoff","lawsuit","settlement","investigation","partnership","forecast"]

@dataclass
class NERResult:
    text: str
    companies: List[str] = field(default_factory=list)
    tickers: List[str] = field(default_factory=list)
    events: List[str] = field(default_factory=list)
    raw_entities: List[Dict[str, Any]] = field(default_factory=list)

def normalize_ticker(token: str) -> str:
    t = token.upper().strip(".,:;!?)([]")
    if not TICKER_REGEX.fullmatch(t) or len(t) == 1:
        return ""
    return t

def detect_events(text: str) -> List[str]:
    found = [e for e in EVENT_HINTS if e in text.lower()]
    return list(set(found))

def extract_entities(text: str) -> NERResult:
    res = NERResult(text=text)
    ents = ner(text)
    res.raw_entities = ents
    for e in ents:
        word = e["word"]
        group = e["entity_group"].upper()
        if group in ["ORG","COMPANY","ORGANIZATION"]:
            res.companies.append(word)
        elif group in ["TICKER","STOCK","SYMBOL"]:
            t = normalize_ticker(word)
            if t:
                res.tickers.append(t)
    for t in TICKER_REGEX.findall(text):
        tt = normalize_ticker(t)
        if tt:
            res.tickers.append(tt)
    res.events = detect_events(text)
    res.companies = list(dict.fromkeys(res.companies))
    res.tickers = list(dict.fromkeys(res.tickers))
    res.events = list(dict.fromkeys(res.events))
    return res


## 4Ô∏è‚É£ Demo Run

In [4]:

examples = [
    "Apple (AAPL) raised its Q4 guidance after strong iPhone sales.",
    "Tesla Inc. (TSLA) announced a $5B buyback and unveiled a new battery.",
    "NVIDIA (NVDA) beats earnings and issues positive outlook.",
    "Pfizer and BioNTech expand partnership; EU investigates vaccine pricing.",
    "WeWork files for Chapter 11 bankruptcy protection."
]

results = [extract_entities(x) for x in examples]
for r in results:
    print("\nüì∞", r.text)
    print("üè¢ Companies:", r.companies)
    print("üíπ Tickers:", r.tickers)
    print("üóìÔ∏è Events:", r.events)



üì∞ Apple (AAPL) raised its Q4 guidance after strong iPhone sales.
üè¢ Companies: ['Apple', 'AAPL']
üíπ Tickers: ['AAPL']
üóìÔ∏è Events: ['guidance']

üì∞ Tesla Inc. (TSLA) announced a $5B buyback and unveiled a new battery.
üè¢ Companies: ['Tesla Inc', 'T', '##SLA']
üíπ Tickers: ['TSLA']
üóìÔ∏è Events: ['buyback']

üì∞ NVIDIA (NVDA) beats earnings and issues positive outlook.
üè¢ Companies: ['N', '##VIDIA', 'NVDA']
üíπ Tickers: ['NVDA']
üóìÔ∏è Events: ['earnings']

üì∞ Pfizer and BioNTech expand partnership; EU investigates vaccine pricing.
üè¢ Companies: ['P', '##fizer', 'B', '##ioNTech', 'EU']
üíπ Tickers: ['EU']
üóìÔ∏è Events: ['partnership']

üì∞ WeWork files for Chapter 11 bankruptcy protection.
üè¢ Companies: ['WeWork']
üíπ Tickers: []
üóìÔ∏è Events: ['bankruptcy']


## 5Ô∏è‚É£ Batch Processing Example

In [5]:

def process_batch(texts: List[str]) -> pd.DataFrame:
    rows = []
    for t in texts:
        r = extract_entities(t)
        rows.append({
            "text": t,
            "companies": ", ".join(r.companies),
            "tickers": ", ".join(r.tickers),
            "events": ", ".join(r.events)
        })
    return pd.DataFrame(rows)

df = process_batch(examples)
df


Unnamed: 0,text,companies,tickers,events
0,Apple (AAPL) raised its Q4 guidance after stro...,"Apple, AAPL",AAPL,guidance
1,Tesla Inc. (TSLA) announced a $5B buyback and ...,"Tesla Inc, T, ##SLA",TSLA,buyback
2,NVIDIA (NVDA) beats earnings and issues positi...,"N, ##VIDIA, NVDA",NVDA,earnings
3,Pfizer and BioNTech expand partnership; EU inv...,"P, ##fizer, B, ##ioNTech, EU",EU,partnership
4,WeWork files for Chapter 11 bankruptcy protect...,WeWork,,bankruptcy


## 6Ô∏è‚É£ Live Data: NewsAPI.org Integration

**Get a key at https://newsapi.org** ¬∑ Paste it in `NEWSAPI_KEY` or set an env var `NEWSAPI_KEY`.

In [None]:
import os, requests, json, re
from urllib.parse import urlencode
import pandas as pd
from datetime import datetime, timedelta, timezone
from dateutil import parser as dateparser
NEWSAPI_KEY = "your newsapi key here"   
os.environ["NEWSAPI_KEY"] = NEWSAPI_KEY
print("Key loaded:", "*" * (len(NEWSAPI_KEY) - 6) + NEWSAPI_KEY[-6:])

Key loaded: **************************46421e


In [7]:
import requests
r = requests.get(
    "https://newsapi.org/v2/top-headlines",
    params={"apiKey": os.environ.get("NEWSAPI_KEY",""), "q": "tesla", "language": "en", "pageSize": 1},
    timeout=30,
)
print(r.status_code)
print(r.json())

200
{'status': 'ok', 'totalResults': 0, 'articles': []}


In [None]:
def fetch_newsapi_articles(query: str, from_iso: str = None, to_iso: str = None, language: str = "en", page_size: int = 50):
    import os, requests
    key = os.getenv("NEWSAPI_KEY", "").strip()
    if not key or key == "PASTE_YOUR_KEY_HERE":
        raise RuntimeError("NEWSAPI_KEY is not set. Set it via env var or notebook cell as shown above.")

    url = "https://newsapi.org/v2/everything"
    params = {
        "q": query,
        "language": language,
        "pageSize": min(max(page_size, 1), 100),
        "sortBy": "publishedAt",
        "apiKey": key,
    }
    if from_iso: params["from"] = from_iso
    if to_iso:   params["to"]   = to_iso

    resp = requests.get(url, params=params, timeout=30)
    try:
        data = resp.json()
    except Exception:
        resp.raise_for_status()
        raise

    if resp.status_code == 401:
        # NewsAPI usually returns a helpful message here
        msg = data.get("message", "Unauthorized")
        raise RuntimeError(f"NewsAPI 401 Unauthorized: {msg}. "
                           f"Double-check your key, plan, and that you didn‚Äôt leave the placeholder.")
    if resp.status_code != 200:
        raise RuntimeError(f"NewsAPI error {resp.status_code}: {data}")

    return data.get("articles", [])

def normalize_article(a):
    title = (a.get("title") or "").strip()
    desc = (a.get("description") or "").strip()
    content = (a.get("content") or "").strip()
    combined = " ".join([x for x in [title, desc, content] if x])
    published_at = a.get("publishedAt")
    try:
        published_dt = dateparser.parse(published_at) if published_at else None
    except Exception:
        published_dt = None
    return {
        "source": (a.get("source") or {}).get("name", ""),
        "title": title,
        "text": combined,
        "url": a.get("url", ""),
        "publishedAt": published_at,
        "published_dt": published_dt.isoformat() if published_dt else ""
    }

def extract_from_articles(articles):
    rows = []
    for a in articles:
        text = a.get("text", "")
        r = extract_entities(text)  
        rows.append({
            "publishedAt": a.get("publishedAt", ""),
            "source": a.get("source", ""),
            "title": a.get("title", ""),
            "url": a.get("url", ""),
            "companies": ", ".join(r.companies),
            "tickers": ", ".join(r.tickers),
            "events": ", ".join(r.events)
        })
    return pd.DataFrame(rows)

def newsapi_ner_demo(query: str, page_size: int = 25, hours_back: int = 72):
    to_iso = datetime.now(timezone.utc).isoformat(timespec="seconds")
    from_iso = (datetime.now(timezone.utc) - timedelta(hours=hours_back)).isoformat(timespec="seconds")
    articles = fetch_newsapi_articles(query=query, from_iso=from_iso, to_iso=to_iso, page_size=page_size)
    normalized = [normalize_article(a) for a in articles]
    return extract_from_articles(normalized)

print("‚úÖ newsapi_ner_demo() is ready! Now you can call it.")

‚úÖ newsapi_ner_demo() is ready! Now you can call it.


In [11]:
df_live = newsapi_ner_demo("earnings OR results OR guidance", page_size=20, hours_back=72)
df_live.head(20)

Unnamed: 0,publishedAt,source,title,url,companies,tickers,events
0,2025-11-15T12:41:00Z,Onefootball.com,Ancelotti closely watches Est√™v√£o with the nat...,https://onefootball.com/en/news/ancelotti-clos...,"Team, FIFA",FIFA,
1,2025-11-15T12:32:26Z,Slashdot.org,Single ticket in Georgia claims $980 million M...,https://slashdot.org/firehose.pl?op=view&amp;i...,"CNN, Yahoo",CNN,
2,2025-11-15T12:31:09Z,Github.com,Show HN: RAG-chunk ‚Äì A CLI to test RAG chunkin...,https://github.com/messkan/rag-chunk,,"HN, CLI, RAG, URL",
3,2025-11-15T12:30:00Z,Mother Jones,This Invasive Disease-Carrier Is Showing Up in...,https://www.motherjones.com/environment/2025/1...,"Inside Climate News, Climate Desk",,
4,2025-11-15T12:28:48Z,Biztoc.com,Dow Jones Futures: Nvidia Looms For Market Aft...,https://biztoc.com/x/713104b80cdd4fe2,"Nvidia, S & P, NVDA, S & am",NVDA,earnings
5,2025-11-15T12:28:21Z,Biztoc.com,"Hits 2-Year High on Revenue, Margin Beat",https://biztoc.com/x/a96f671fcc79bb68,"Canadian Solar Inc, CSI, Canadian Solar",CSIQ,
6,2025-11-15T12:28:18Z,Biztoc.com,Jumps 16% on Q3 Blowout,https://biztoc.com/x/4373098aaa60f4ca,"Figure Technology Solutions, Inc, Figure Techn...",FIGR,
7,2025-11-15T12:28:17Z,Biztoc.com,Climbs 10% on Q3 Blowout,https://biztoc.com/x/a95787de062b075b,"RLX Technology Inc, RLX Technology, Fr","RLX, NYSE",
8,2025-11-15T12:26:06Z,Mindful.org,Do I Need to Meditate to Be Mindful?,https://www.mindful.org/do-i-need-to-meditate-...,Mindful,,
9,2025-11-15T12:25:06Z,The-independent.com,The vegetable that can lower cholesterol and c...,https://www.the-independent.com/life-style/hea...,"Big Tech, The, Independent",,


## 7Ô∏è‚É£ Export Results (CSV / JSONL)

In [10]:
def export_results(df: pd.DataFrame, stem: str = "newsapi_ner") -> str:
    import json
    csv_path = f"{stem}.csv"
    jsonl_path = f"{stem}.jsonl"
    df.to_csv(csv_path, index=False)
    with open(jsonl_path, "w", encoding="utf-8") as f:
        for _, row in df.iterrows():
            f.write(json.dumps(row.to_dict(), ensure_ascii=False) + "\n")
    print("Saved:", csv_path, "and", jsonl_path)
    return csv_path


## 8Ô∏è‚É£ Notes & Limits

- **NewsAPI free tier** has request and source limitations; consider batching queries and caching results.
- This notebook keeps dependencies minimal. For advanced event extraction, consider a relation extraction model.
- For production: de-dup articles by URL, filter irrelevant sources, and enrich tickers via a reference mapping (e.g., OpenFIGI).
