# Tech Monthly
Adds **combined news sentiment** (Polygon + Finnhub), **earnings**, **sector & AI exposure indexes**, and **FRED macro lags**.

Outputs CSVs and builds per‑ticker enriched features for modeling.

## Purpose

Build a monthly feature set for major tech stocks (AAPL, MSFT, GOOGL, NVDA, META, AMZN) by combining:

* News sentiment (Polygon + Finnhub, merged & deduped)

* Earnings surprises (Finnhub)

* Sector & AI exposure indexes (yfinance)

* Macro factors from FRED (with engineered features and lags)

* TEST: run safe OLS regressions per ticker to see which features explain returns. All intermediate data are saved as CSVs.

## Setup

In [7]:
# If needed, install packages (uncomment):
# !pip install requests pandas numpy scikit-learn statsmodels matplotlib yfinance pyarrow nltk

import os, time, json, datetime as dt
from typing import List, Dict, Any
import requests
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
import statsmodels.api as sm

print('pandas', pd.__version__)


pandas 2.3.1


## Keys & controls

* Defines tickers, date range (START = "2015-01-01" → END = today), and lag lengths [1,3,6].

* Creates output folder data_enriched.

In [None]:
FRED_KEY     = os.getenv("FRED_API_KEY")
POLYGON_KEY  = os.getenv("POLYGON_API_KEY") 
FINNHUB_KEY  = os.getenv("FINNHUB_API_KEY")

FRED_BASE    = "https://api.stlouisfed.org/fred/series/observations"
POLY_BASE    = "https://api.polygon.io"
FINNHUB_BASE = "https://finnhub.io/api/v1"

OUT_DIR = "data_enriched"
os.makedirs(OUT_DIR, exist_ok=True)

TECH_TICKERS = ["AAPL", "MSFT", "GOOGL", "NVDA", "META", "AMZN"]
AI_BASKET    = ["NVDA", "META", "MSFT", "GOOGL", "AMD", "AVGO"]
START  = "2018-01-01"
END    = dt.date.today().isoformat()
LAGS   = [1,3,6]

print("Keys present:", {"FRED": bool(FRED_KEY), "POLYGON": bool(POLYGON_KEY), "FINNHUB": bool(FINNHUB_KEY)})
print("Universe:", TECH_TICKERS)


Keys present: {'FRED': True, 'POLYGON': True, 'FINNHUB': True}
Universe: ['AAPL', 'MSFT', 'GOOGL', 'NVDA', 'META', 'AMZN']


## Helpers — HTTP, CSV, lags, diagnostics

* HTTP GET with retry, CSV saver, lag builder, simple diagnostics.

In [9]:
def _get_json(url: str, params: Dict[str, Any], max_retries=3, backoff=1.0):
    for i in range(max_retries):
        r = requests.get(url, params=params, timeout=30)
        if r.status_code == 429:
            time.sleep(backoff * (i+1) * 2)
            continue
        r.raise_for_status()
        try:
            return r.json()
        except json.JSONDecodeError:
            time.sleep(backoff)
    raise RuntimeError(f"GET failed after retries: {url}")

def _to_csv(df: pd.DataFrame, path: str):
    df.to_csv(path, index=True)
    print(f"Saved → {path}")

def make_lags(df: pd.DataFrame, cols: List[str], lags=(1,3,6)) -> pd.DataFrame:
    out = df.copy()
    for c in cols:
        if c not in out.columns: 
            continue
        for L in lags:
            out[f"{c}_lag{L}"] = out[c].shift(L)
    return out

def diagnostics(df: pd.DataFrame, name: str, top=10):
    print(f"\n[Diag] {name}: shape={df.shape}, index=({df.index.min()}, {df.index.max()})")
    if df.index.duplicated().any():
        dup = df.index[df.index.duplicated()]
        print(f"[Diag] WARNING: duplicate index rows={len(dup)}")
    na_pct = df.isna().mean().sort_values(ascending=False).head(top)
    print("[Diag] Top columns by NaN%:\n", na_pct.to_string())


## FRED macro (monthly)

* Pulls monthly Fed Funds, CPI, 10Y, Unemployment.

* Engineers features: inflation_yoy, us10y_chg, fedfunds_chg, unrate_chg.

* Saves macro_monthly.csv.

In [10]:
def fred_series_monthly(series_id: str, start: str, end: str, how="mean") -> pd.Series:
    if not FRED_KEY:
        raise RuntimeError("Missing FRED_API_KEY")
    url = FRED_BASE
    params = {"series_id":series_id, "api_key":FRED_KEY, "file_type":"json",
              "observation_start":start, "observation_end":end}
    js = _get_json(url, params)
    obs = js.get("observations", [])
    if not obs:
        return pd.Series(dtype=float)
    df = pd.DataFrame(obs)
    df["date"] = pd.to_datetime(df["date"])
    df["value"] = pd.to_numeric(df["value"], errors="coerce")
    s = df.set_index("date")["value"].astype(float)
    return s.resample("M").last() if how == "last" else s.resample("M").mean()

def get_macro_block(start: str, end: str) -> pd.DataFrame:
    series = {"FEDFUNDS":"fed_funds_rate", "CPIAUCSL":"cpi_index", "DGS10":"us10y", "UNRATE":"unemployment_rate"}
    cols = {}
    for sid, name in series.items():
        cols[name] = fred_series_monthly(sid, start, end, how="mean")
    macro = pd.DataFrame(cols).sort_index()
    # Engineered features
    if "cpi_index" in macro:
        macro["inflation_yoy"] = macro["cpi_index"].pct_change(12) * 100
    if "us10y" in macro:
        macro["us10y_chg"] = macro["us10y"].diff(1)
    if "fed_funds_rate" in macro:
        macro["fedfunds_chg"] = macro["fed_funds_rate"].diff(1)
    if "unemployment_rate" in macro:
        macro["unrate_chg"] = macro["unemployment_rate"].diff(1)
    diagnostics(macro, "macro monthly")
    _to_csv(macro, f"{OUT_DIR}/macro_monthly.csv")
    return macro

macro = get_macro_block(START, END)


  return s.resample("M").last() if how == "last" else s.resample("M").mean()
  return s.resample("M").last() if how == "last" else s.resample("M").mean()
  return s.resample("M").last() if how == "last" else s.resample("M").mean()



[Diag] macro monthly: shape=(94, 8), index=(2018-01-31 00:00:00, 2025-10-31 00:00:00)
[Diag] Top columns by NaN%:
 inflation_yoy        0.127660
unrate_chg           0.031915
cpi_index            0.021277
unemployment_rate    0.021277
fedfunds_chg         0.021277
fed_funds_rate       0.010638
us10y_chg            0.010638
us10y                0.000000
Saved → data_enriched/macro_monthly.csv


  return s.resample("M").last() if how == "last" else s.resample("M").mean()
  macro["inflation_yoy"] = macro["cpi_index"].pct_change(12) * 100


## Prices & sector / AI indexes (yfinance)

* Computes monthly returns for NASDAQ Composite (^IXIC) and XLK.

* Builds AI basket (equal-weight ret of NVDA, META, MSFT, GOOGL, AMD, AVGO).

* Saves ixic_rets.csv, xlk_rets.csv, ai_basket_rets.csv.

In [11]:
def monthly_from_yf(tickers, start, end):
    data = yf.download(tickers, start=start, end=end, progress=False, auto_adjust=True)
    if isinstance(tickers, str) or (isinstance(tickers, list) and len(tickers) == 1):
        close = data["Close"]
        if isinstance(close, pd.Series):
            close = close.to_frame(name=tickers if isinstance(tickers, str) else tickers[0])
    else:
        close = data["Close"]
    m = close.resample("M").last()
    rets = m.pct_change()
    return m, rets

ixic_close, ixic_rets = monthly_from_yf("^IXIC", START, END); ixic_rets.columns = ["ixic_ret"]
xlk_close,  xlk_rets  = monthly_from_yf("XLK",  START, END);  xlk_rets.columns  = ["xlk_ret"]

ai_close, ai_rets = monthly_from_yf(["NVDA","META","MSFT","GOOGL","AMD","AVGO"], START, END)
ai_ret_eqw = ai_rets.mean(axis=1).to_frame(name="ai_basket_ret")

_to_csv(ixic_rets, f"{OUT_DIR}/ixic_rets.csv")
_to_csv(xlk_rets,  f"{OUT_DIR}/xlk_rets.csv")
_to_csv(ai_ret_eqw, f"{OUT_DIR}/ai_basket_rets.csv")


  m = close.resample("M").last()
  m = close.resample("M").last()


Saved → data_enriched/ixic_rets.csv
Saved → data_enriched/xlk_rets.csv
Saved → data_enriched/ai_basket_rets.csv


  m = close.resample("M").last()


## Combined news sentiment (Polygon + Finnhub)

* Fetches news from both APIs, fault-tolerant (soft failure if one is down).

* Coerces timestamps to datetimes, dedupes near-duplicate headlines within a few days.

* Scores sentiment per article with VADER (fallback keyword heuristic).

* Aggregates monthly:

    * sent_mean, sent_count (all articles both sources)

    * sent_mean_weighted (count-weighted blend of source means)

    * Per-source diagnostics: sent_mean_poly, sent_count_poly, sent_mean_fin, sent_count_fin

* *aves per-ticker *_news_sentiment_combined.csv.

In [12]:
def _get_json_safe(url: str, params: dict, max_retries=3, backoff=1.5):
    last_err = None
    for i in range(max_retries):
        try:
            r = requests.get(url, params=params, timeout=30)
            if r.status_code == 429:  # rate limited
                time.sleep(backoff * (i + 1))
                continue
            r.raise_for_status()
            try:
                return r.json()
            except json.JSONDecodeError as e:
                last_err = e
                time.sleep(backoff)
                continue
        except Exception as e:
            last_err = e
            time.sleep(backoff)
    print(f"[warn] _get_json_safe failed: {url} | {type(last_err).__name__}: {last_err}")
    return None  # <-- crucial: fail soft

def polygon_news_raw(ticker: str, api_key: str, start=None, end=None, max_pages=20, page_limit=1000, sleep=0.4):
    """
    Paginate Polygon /v2/reference/news using the 'cursor' token.
    Stops when:
      - no more pages
      - hit max_pages
      - (optional) item date < start (fast-exit filter)
    Returns standardized columns: [date, headline, summary, source]
    """
    cols = ["date","headline","summary","source"]
    if not api_key:
        return pd.DataFrame(columns=cols)

    url = f"{POLY_BASE}/v2/reference/news"
    params = {
        "ticker": ticker,
        "limit": min(page_limit, 1000),
        "order": "desc",
        "apiKey": api_key,
    }
    if start: params["published_utc.gte"] = pd.Timestamp(start).strftime("%Y-%m-%d")
    if end:   params["published_utc.lte"] = pd.Timestamp(end).strftime("%Y-%m-%d")

    all_rows = []
    cursor = None
    pages = 0

    while True:
        if cursor:
            params["cursor"] = cursor
        js = _get_json_safe(url, params)
        if not js or "results" not in js or not js["results"]:
            break

        rows = js["results"]
        for r in rows:
            d  = pd.to_datetime(r.get("published_utc"), errors="coerce")
            tl = r.get("title", "")
            ds = r.get("description", "")
            if pd.isna(d): 
                continue
            # optional fast-exit if we already scrolled past 'start'
            if start and d < pd.Timestamp(start):
                break
            all_rows.append({"date": d, "headline": tl, "summary": ds, "source": "polygon"})

        pages += 1
        cursor = js.get("next_url") or js.get("next") or js.get("cursor")
        if not cursor or pages >= max_pages:
            break
        time.sleep(sleep)

    if not all_rows:
        print(f"[warn] Polygon news empty for {ticker} (pages={pages}).")
        return pd.DataFrame(columns=cols)

    df = pd.DataFrame(all_rows).sort_values("date")
    return df[cols]

def finnhub_news_raw(ticker: str, api_key: str, start: str, end: str, chunk="365D", sleep=0.3):
    """
    Fetch Finnhub company-news in date chunks (e.g., per ~year) to avoid silent truncation.
    Returns standardized columns: [date, headline, summary, source]
    """
    cols = ["date","headline","summary","source"]
    if not api_key:
        return pd.DataFrame(columns=cols)

    start_ts = pd.Timestamp(start)
    end_ts   = pd.Timestamp(end)
    step     = pd.Timedelta(chunk)

    frames = []
    lo = start_ts
    while lo <= end_ts:
        hi = min(lo + step, end_ts)
        url = f"{FINNHUB_BASE}/company-news"
        params = {"symbol": ticker, "from": lo.date().isoformat(), "to": hi.date().isoformat(), "token": api_key}
        js = _get_json_safe(url, params)
        if isinstance(js, list) and js:
            df = pd.DataFrame(js)
            d  = pd.to_datetime(df.get("datetime"), unit="s", errors="coerce")
            if d.isna().all():
                d = pd.to_datetime(df.get("time"), unit="ms", errors="coerce")
            df_out = pd.DataFrame({
                "date": d,
                "headline": df.get("headline", ""),
                "summary":  df.get("summary", ""),
                "source":   "finnhub"
            }).dropna(subset=["date"])
            if not df_out.empty:
                frames.append(df_out)
        else:
            print(f"[warn] Finnhub news empty for {ticker} in {lo.date()}→{hi.date()}")
        lo = hi + pd.Timedelta("1D")
        time.sleep(sleep)

    if not frames:
        return pd.DataFrame(columns=cols)
    return pd.concat(frames, ignore_index=True).sort_values("date")[cols]


# Ensure VADER or fallback exists (reuse from earlier cell if present)
try:
    from nltk.sentiment import SentimentIntensityAnalyzer
    _ = SentimentIntensityAnalyzer()
    _VADER_OK = True
except Exception:
    _VADER_OK = False

def simple_sentiment(text: str) -> float:
    if not isinstance(text, str) or not text.strip():
        return 0.0
    if _VADER_OK:
        sia = SentimentIntensityAnalyzer()
        return float(sia.polarity_scores(text)["compound"])
    t = text.lower()
    pos = sum(w in t for w in ["beat","record","growth","surge","profit","upgrade","outperform","strong","rally"])
    neg = sum(w in t for w in ["miss","cut","probe","lawsuit","downgrade","decline","headwind","weak","plunge"])
    return (pos - neg) / 6.0

def build_monthly_news_sentiment_combined(
    ticker: str,
    start: str,
    end: str,
    polygon_key: str,
    finnhub_key: str,
    dedup_within_days: int = 3
) -> pd.DataFrame:
    # Pull both; either may be empty if API failed
    poly_df = polygon_news_raw(ticker, polygon_key)
    fin_df  = finnhub_news_raw(ticker, finnhub_key, start, end)
    combined = pd.concat([poly_df, fin_df], ignore_index=True)

    # Sanitize
    combined["date"] = pd.to_datetime(combined.get("date"), errors="coerce", utc=True)
    combined = combined.dropna(subset=["date"]).copy()
    try:
        combined["date"] = combined["date"].dt.tz_convert(None)
    except Exception:
        pass
    for col in ["headline","summary"]:
        if col not in combined.columns:
            combined[col] = ""
        combined[col] = combined[col].fillna("").astype(str)

    if combined.empty:
        cols = ["sent_mean","sent_count","sent_mean_weighted","sent_mean_poly","sent_count_poly","sent_mean_fin","sent_count_fin"]
        return pd.DataFrame(columns=cols, dtype=float)

    # Deduplicate within window by (floor(date), headline)
    if dedup_within_days and dedup_within_days > 0:
        rd = f"{dedup_within_days}D"
        combined["date_round"] = combined["date"].dt.floor(rd)
        combined = combined.drop_duplicates(subset=["date_round","headline"])

    # Sentiment & monthly bins
    texts = combined["headline"] + ". " + combined["summary"]
    combined["sent"] = texts.apply(simple_sentiment)
    combined["month"] = combined["date"].dt.to_period("M").dt.to_timestamp("M")

    # Per-source monthly stats
    src = combined.groupby(["month","source"]).agg(
        sent_mean=("sent","mean"),
        sent_count=("sent","size")
    ).reset_index()
    src_piv = src.pivot(index="month", columns="source", values=["sent_mean","sent_count"])

    # Ensure expected columns exist
    for c in [("sent_mean","polygon"), ("sent_mean","finnhub"), ("sent_count","polygon"), ("sent_count","finnhub")]:
        if c not in src_piv.columns:
            src_piv[c] = np.nan if "mean" in c[0] else 0
    src_piv = src_piv.sort_index()

    out = pd.DataFrame(index=src_piv.index)
    out["sent_mean_poly"]  = src_piv[("sent_mean","polygon")]
    out["sent_count_poly"] = src_piv[("sent_count","polygon")].fillna(0).astype(int)
    out["sent_mean_fin"]   = src_piv[("sent_mean","finnhub")]
    out["sent_count_fin"]  = src_piv[("sent_count","finnhub")].fillna(0).astype(int)

    # Combined article-level mean & count
    all_month = combined.groupby("month").agg(sent_mean=("sent","mean"), sent_count=("sent","size"))
    out = out.join(all_month, how="outer")

    # Count-weighted per-source mean
    num = (out["sent_mean_poly"].fillna(0) * out["sent_count_poly"].astype(float) +
           out["sent_mean_fin"].fillna(0)  * out["sent_count_fin"].astype(float))
    den = (out["sent_count_poly"].astype(float) + out["sent_count_fin"].astype(float))
    out["sent_mean_weighted"] = np.where(den > 0, num / den, np.nan)

    return out[[
        "sent_mean","sent_count","sent_mean_weighted",
        "sent_mean_poly","sent_count_poly","sent_mean_fin","sent_count_fin"
    ]].sort_index()

news_sent_maps = {}
for t in TECH_TICKERS:
    print(f"Combined sentiment for {t} ...")
    s = build_monthly_news_sentiment_combined(t, START, END, POLYGON_KEY, FINNHUB_KEY)
    news_sent_maps[t] = s
    _to_csv(s, f"{OUT_DIR}/{t}_news_sentiment_combined.csv")


Combined sentiment for AAPL ...
[warn] _get_json_safe failed: https://api.polygon.io/v2/reference/news | HTTPError: 400 Client Error: Bad Request for url: https://api.polygon.io/v2/reference/news?ticker=AAPL&limit=1000&order=desc&apiKey=TNRyzBuL5CNEIxE3aGKd66ssjVJkVq8B&cursor=https%3A%2F%2Fapi.polygon.io%2Fv2%2Freference%2Fnews%3Fcursor%3DYXA9MjAyNS0wMy0xOVQwOCUzQTIxJTNBMDBaJmFzPTA2MDY2ZTFhM2VkZTY1NzNhMjNiYzEzMGRjMTczMmJlNDBkYzFlYTczYWI4MjA4YzI0OGU2MDZhZWMyNWFiMjImbGltaXQ9MTAwMCZvcmRlcj1kZXNjZW5kaW5nJnRpY2tlcj1BQVBM
[warn] Finnhub news empty for AAPL in 2018-01-01→2019-01-01
[warn] Finnhub news empty for AAPL in 2019-01-02→2020-01-02
[warn] Finnhub news empty for AAPL in 2020-01-03→2021-01-02
[warn] Finnhub news empty for AAPL in 2021-01-03→2022-01-03
[warn] Finnhub news empty for AAPL in 2022-01-04→2023-01-04
[warn] Finnhub news empty for AAPL in 2023-01-05→2024-01-05
Saved → data_enriched/AAPL_news_sentiment_combined.csv
Combined sentiment for MSFT ...
[warn] _get_json_safe failed: h

  combined = pd.concat([poly_df, fin_df], ignore_index=True)


Saved → data_enriched/META_news_sentiment_combined.csv
Combined sentiment for AMZN ...
[warn] _get_json_safe failed: https://api.polygon.io/v2/reference/news | NoneType: None
[warn] Polygon news empty for AMZN (pages=0).
[warn] Finnhub news empty for AMZN in 2018-01-01→2019-01-01
[warn] Finnhub news empty for AMZN in 2019-01-02→2020-01-02
[warn] Finnhub news empty for AMZN in 2020-01-03→2021-01-02
[warn] Finnhub news empty for AMZN in 2021-01-03→2022-01-03
[warn] Finnhub news empty for AMZN in 2022-01-04→2023-01-04
[warn] Finnhub news empty for AMZN in 2023-01-05→2024-01-05


  combined = pd.concat([poly_df, fin_df], ignore_index=True)


Saved → data_enriched/AMZN_news_sentiment_combined.csv


## Earnings features (Finnhub)

* Pulls quarterly EPS actual/estimate; computes surprise %.

* Aggregates to monthly: eps_surprise_mean, eps_surprise_last.

* Saves per-ticker *_earnings_features.csv.

In [13]:
def finnhub_earnings(ticker: str, start: str, end: str) -> pd.DataFrame:
    if not FINNHUB_KEY:
        return pd.DataFrame(columns=["date","epsActual","epsEstimate","surprisePercent"])
    url = f"{FINNHUB_BASE}/stock/earnings"
    params = {"symbol": ticker, "token": FINNHUB_KEY}
    js = _get_json(url, params)
    if not isinstance(js, list) or len(js) == 0:
        url2 = f"{FINNHUB_BASE}/calendar/earnings"
        params2 = {"from": start, "to": end, "token": FINNHUB_KEY}
        js2 = _get_json(url2, params2)
        df2 = pd.DataFrame(js2.get("earningsCalendar", []))
        if df2.empty:
            return pd.DataFrame(columns=["date","epsActual","epsEstimate","surprisePercent"])
        df2["date"] = pd.to_datetime(df2["date"], errors="coerce")
        df2 = df2[df2.get("symbol","") == ticker]
        keep = [c for c in ["date","epsActual","epsEstimate","surprisePercent"] if c in df2.columns]
        return df2[keep].dropna(subset=["date"]).sort_values("date")
    df = pd.DataFrame(js)
    if "date" in df.columns:
        df["date"] = pd.to_datetime(df["date"], errors="coerce")
    elif "period" in df.columns:
        df["date"] = pd.to_datetime(df["period"], errors="coerce")
    if "surprisePercent" not in df.columns and {"epsActual","epsEstimate"} <= set(df.columns):
        with np.errstate(divide='ignore', invalid='ignore'):
            df["surprisePercent"] = (df["epsActual"] - df["epsEstimate"]) / df["epsEstimate"] * 100.0
    keep = [c for c in ["date","epsActual","epsEstimate","surprisePercent"] if c in df.columns]
    return df[keep].dropna(subset=["date"]).sort_values("date")

def monthly_earnings_features(ticker: str) -> pd.DataFrame:
    df = finnhub_earnings(ticker, START, END)
    if df.empty:
        return pd.DataFrame(columns=["month","eps_surprise_mean","eps_surprise_last"]).set_index("month")
    df["month"] = df["date"].dt.to_period("M").dt.to_timestamp("M")
    agg = df.groupby("month").agg(eps_surprise_mean=("surprisePercent","mean"),
                                  eps_surprise_last=("surprisePercent","last")).sort_index()
    return agg

earnings_maps = {}
for t in TECH_TICKERS:
    print(f"Earnings features for {t} ...")
    e = monthly_earnings_features(t)
    earnings_maps[t] = e
    _to_csv(e, f"{OUT_DIR}/{t}_earnings_features.csv")


Earnings features for AAPL ...
Saved → data_enriched/AAPL_earnings_features.csv
Earnings features for MSFT ...
Saved → data_enriched/MSFT_earnings_features.csv
Earnings features for GOOGL ...
Saved → data_enriched/GOOGL_earnings_features.csv
Earnings features for NVDA ...
Saved → data_enriched/NVDA_earnings_features.csv
Earnings features for META ...
Saved → data_enriched/META_earnings_features.csv
Earnings features for AMZN ...
Saved → data_enriched/AMZN_earnings_features.csv


## Build per‑ticker enriched features

* For each ticker: joins monthly returns with IXIC, XLK, AI basket, news sentiment, earnings, and macro lags (to avoid leakage).

* Trims early rows to respect lag availability.

* Saves per-ticker *_features_enriched.csv and a tech_features_combined.csv (wide panel).

In [14]:
def build_ticker_features(ticker: str) -> pd.DataFrame:
    close, rets = monthly_from_yf(ticker, START, END)
    t_ret = rets.rename(columns={rets.columns[0]: f"{ticker}_ret"})
    feats = t_ret.join(ixic_rets, how="left").join(xlk_rets, how="left").join(ai_ret_eqw, how="left")
    if ticker in news_sent_maps: feats = feats.join(news_sent_maps[ticker], how="left")
    if ticker in earnings_maps: feats = feats.join(earnings_maps[ticker], how="left")
    base_feats = ["inflation_yoy","us10y","us10y_chg","fed_funds_rate","fedfunds_chg","unemployment_rate","unrate_chg"]
    macro_lagged = make_lags(macro, base_feats, lags=LAGS)
    feats = feats.join(macro_lagged, how="left")
    max_lag = max(LAGS) if LAGS else 0
    if len(feats) > max_lag:
        feats = feats.iloc[max_lag:]
    return feats

all_feat = {}
for t in TECH_TICKERS:
    print(f"Features for {t} ...")
    ft = build_ticker_features(t)
    diagnostics(ft, f"{t} features")
    all_feat[t] = ft
    _to_csv(ft, f"{OUT_DIR}/{t}_features_enriched.csv")

combined = pd.concat(all_feat, axis=1)
_to_csv(combined, f"{OUT_DIR}/tech_features_combined.csv")
combined.tail()


Features for AAPL ...


  m = close.resample("M").last()
  m = close.resample("M").last()
  m = close.resample("M").last()
  m = close.resample("M").last()
  m = close.resample("M").last()



[Diag] AAPL features: shape=(88, 42), index=(2018-07-31 00:00:00, 2025-10-31 00:00:00)
[Diag] Top columns by NaN%:
 sent_mean_fin         0.954545
eps_surprise_mean     0.954545
eps_surprise_last     0.954545
sent_mean_poly        0.909091
sent_mean_weighted    0.886364
sent_count_poly       0.886364
sent_count            0.886364
sent_count_fin        0.886364
sent_mean             0.886364
inflation_yoy_lag6    0.136364
Saved → data_enriched/AAPL_features_enriched.csv
Features for MSFT ...

[Diag] MSFT features: shape=(88, 42), index=(2018-07-31 00:00:00, 2025-10-31 00:00:00)
[Diag] Top columns by NaN%:
 sent_mean_fin         0.954545
eps_surprise_mean     0.954545
eps_surprise_last     0.954545
sent_mean_poly        0.909091
sent_mean_weighted    0.886364
sent_count_poly       0.886364
sent_count            0.886364
sent_count_fin        0.886364
sent_mean             0.886364
inflation_yoy_lag6    0.136364
Saved → data_enriched/MSFT_features_enriched.csv
Features for GOOGL ...

[D

  m = close.resample("M").last()


Unnamed: 0_level_0,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,...,AMZN,AMZN,AMZN,AMZN,AMZN,AMZN,AMZN,AMZN,AMZN,AMZN
Unnamed: 0_level_1,AAPL_ret,ixic_ret,xlk_ret,ai_basket_ret,sent_mean,sent_count,sent_mean_weighted,sent_mean_poly,sent_count_poly,sent_mean_fin,...,fed_funds_rate_lag6,fedfunds_chg_lag1,fedfunds_chg_lag3,fedfunds_chg_lag6,unemployment_rate_lag1,unemployment_rate_lag3,unemployment_rate_lag6,unrate_chg_lag1,unrate_chg_lag3,unrate_chg_lag6
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2025-06-30,0.021509,0.06571,0.098479,0.140136,0.443481,112.0,0.443481,0.443481,112.0,,...,4.48,0.0,0.0,-0.16,4.2,4.2,4.1,0.0,0.1,-0.1
2025-07-31,0.011698,0.036953,0.037555,0.107196,0.501022,196.0,0.501022,0.501022,196.0,,...,4.33,0.0,0.0,-0.15,4.1,4.2,4.0,-0.1,0.0,-0.1
2025-08-31,0.119639,0.01577,-0.001104,-0.011649,0.503038,195.0,0.503038,0.503038,195.0,,...,4.33,0.0,0.0,0.0,4.2,4.2,4.1,0.1,0.0,0.1
2025-09-30,0.096881,0.056137,0.075337,0.0562,0.417668,324.0,0.417668,0.466827,194.0,0.344308,...,4.33,0.0,0.0,0.0,4.3,4.1,4.2,0.1,-0.1,0.1
2025-10-31,0.013313,0.005318,0.010147,0.004113,0.416321,118.0,0.416321,0.22952,15.0,0.443525,...,4.33,-0.11,0.0,0.0,,4.2,4.2,,0.1,0.0


## Safe OLS per ticker

Simple Ordinary Least Squares, which is one of the most common methods for estimating the parameters of a linear regression model. This will:

* Selects best-covered features (limits to a max count) so you don’t lose the sample to sparsity.

* Fits per-ticker OLS with a minimum-row guard; prints summary if enough data.

* Reports row counts before/after dropna.

In [15]:
def fit_ols_safe(df: pd.DataFrame, target_col: str, min_rows: int = 12):
    X = df.drop(columns=[target_col])
    y = df[target_col]
    aligned = pd.concat([y, X], axis=1).dropna()
    info = {"rows_before": len(df), "rows_after_dropna": len(aligned), "n_features": X.shape[1]}
    if len(aligned) < min_rows:
        return None, info
    y_clean = aligned.iloc[:,0]
    X_clean = aligned.iloc[:,1:]
    X_clean = sm.add_constant(X_clean, has_constant="add")
    m = sm.OLS(y_clean, X_clean).fit()
    return m, info

def top_features_by_coverage(df: pd.DataFrame, target_col: str, k: int = 20):
    na_rates = df.drop(columns=[target_col]).isna().mean()
    keep_feats = na_rates.sort_values().index[:k].tolist()
    cols = [target_col] + keep_feats
    return df[cols]

MIN_ROWS = 12
MAX_FEATS = 20

for t in TECH_TICKERS:
    print(f"\n=== {t} OLS (enriched) ===")
    df_t = all_feat[t].copy()
    target = f"{t}_ret"
    df_t_red = top_features_by_coverage(df_t, target, k=MAX_FEATS)
    m, info = fit_ols_safe(df_t_red, target, min_rows=MIN_ROWS)
    print(f"rows_before={info['rows_before']}, rows_after_dropna={info['rows_after_dropna']}, features={info['n_features']}")
    if m is None:
        print(f"Not enough data after dropna (need >= {MIN_ROWS}). Try longer date range, fewer lags, or fewer sparse features.")
        continue
    try:
        print(m.summary())
    except Exception as e:
        print("Could not print full summary:", e)
        print("Params:\n", m.params)
        print("R2:", getattr(m, "rsquared", None), "Adj R2:", getattr(m, "rsquared_adj", None))



=== AAPL OLS (enriched) ===
rows_before=88, rows_after_dropna=86, features=20
                            OLS Regression Results                            
Dep. Variable:               AAPL_ret   R-squared:                       0.740
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     9.878
Date:                Sun, 05 Oct 2025   Prob (F-statistic):           8.91e-13
Time:                        07:33:13   Log-Likelihood:                 148.25
No. Observations:                  86   AIC:                            -256.5
Df Residuals:                      66   BIC:                            -207.4
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------

# Outputs

data_enriched/macro_monthly.csv

data_enriched/ixic_rets.csv, xlk_rets.csv, ai_basket_rets.csv

Per ticker:

[TECH TICKER]_news_sentiment_combined.csv, …

[TECH TICKER]_earnings_features.csv, …

[TECH TICKER]_features_enriched.csv, …

Panel: data_enriched/tech_features_combined.csv