# Multi-Modal Financial Analysis Agent: Final Submission

**Course:** AAI 520 - Final Project

This notebook implements the final version of a multi-modal AI system that performs comparative analysis on multiple stocks using market, macroeconomic, and news sentiment data.

## Setup and Dependencies
This cell installs the required `vaderSentiment` library, imports all necessary packages, and sets the API keys.**bold text**

In [1]:
'''Uncomment and run the following line if you haven't installed the required packages yet'''
#!py -m pip install openai python-dotenv yfinance pydantic requests vaderSentiment google-generativeai tabulate

"Uncomment and run the following line if you haven't installed the required packages yet"

In [2]:
import os, shutil
import pandas as pd
import yfinance as yf
import requests
from IPython.display import display, Markdown
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from dotenv import load_dotenv
from __future__ import annotations
import os, json, time, argparse, datetime as dt
import openai
import hashlib, hmac, math
from pathlib import Path
from typing import List, Dict, Any, Optional, Tuple, Callable, Iterable
import math
from dataclasses import dataclass, field
from pydantic import BaseModel
from pandas.api.types import is_datetime64_any_dtype
import textwrap
from functools import reduce

from dotenv import load_dotenv

# tqdm (nice progress bars)
try:
    from tqdm.auto import tqdm
    _HAS_TQDM = True
except Exception:
    _HAS_TQDM = False

### Configuration

Enviroment set with .env file as follows:

```env
# Economic data from the Federal Reserve
FRED_API_KEY="YOUR_FRED_API_KEY"

# Stock market and financial data
POLYGON_API_KEY="YOUR_POLYGON_API_KEY"
FINNHUB_API_KEY="YOUR_FINNHUB_API_KEY"

# Real-time news articles
NEWS_API_KEY="YOUR_NEWS_API_KEY"

# For the AI agent's "brain"
OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
GOOGLE_API_KEY="YOUR_GOOGLE_API_KEY"

# Identification for accessing SEC's EDGAR database
SEC_USER_AGENT="Your Name you@example.com"
```

In [None]:
# Load environment variables from .env file
load_dotenv()

# Set API keys from environment variables
FRED_KEY        = os.getenv("FRED_API_KEY")
NEWS_KEY        = os.getenv("NEWS_API_KEY")
FINNHUB_KEY     = os.getenv("FINNHUB_API_KEY")
POLYGON_KEY     = os.getenv("POLYGON_API_KEY")
OPENAI_KEY      = os.getenv("OPENAI_API_KEY")
GEMINI_KEY      = os.getenv("GOOGLE_API_KEY")
SEC_USER_AGENT  = os.getenv("SEC_USER_AGENT")

# Configuration
OPENAI_MODEL = "gpt-4o-mini"    
GEMINI_MODEL = "gemini-2.5-flash"    
NEWS_STORE   = "news_store.parquet" 
CACHE_DIR    = ".cache"             

# Ensure cache directory exists
Path(CACHE_DIR).mkdir(exist_ok=True)

### Test LLM Models

In [4]:
import google.generativeai as genai
from openai import OpenAI

def check_llm_availability():
    """
    Checks the availability and functionality of configured LLM APIs.
    This function is designed to be run directly in a notebook cell.
    """
        
    print("--- Checking LLM Availability ---")
    # --- Test 1: Google Gemini ---
    try:
        assert GEMINI_KEY, "GOOGLE_API_KEY is not set in your environment."
        genai.configure(api_key=GEMINI_KEY)
        
        # Using a reliable and recent model
        model = genai.GenerativeModel("gemini-2.5-flash")
        prompt = 'Return ONLY this JSON: {"ok": true}'
        response = model.generate_content(prompt, generation_config={"temperature": 0})
        if "ok" in response.text:
            print("✅ Google Gemini: OK")
        else:
            print(f"❌ Google Gemini: Unexpected response -> {response.text.strip()}")
            
    except Exception as e:
        print(f"❌ Google Gemini: FAILED - {e}")

    # --- Test 2: OpenAI GPT ---
    try:
        assert OPENAI_KEY, "OPENAI_API_KEY is not set in your environment."
        client = OpenAI(api_key=OPENAI_KEY)
        response = client.chat.completions.create(
            model = OPENAI_MODEL,
            messages=[{"role": "user", "content": "Reply with exactly: OK"}],
            temperature=0
        )
        
        output_text = response.choices[0].message.content.strip()
        if output_text == "OK":
            print("✅ OpenAI GPT: OK")
        else:
            print(f"❌ OpenAI GPT: Unexpected response -> {output_text}")

    except Exception as e:
        print(f"❌ OpenAI GPT: FAILED - {e}")

# Check now
check_llm_availability()

--- Checking LLM Availability ---
✅ Google Gemini: OK
✅ OpenAI GPT: OK


## Data Tools and Helper Classes/Functions

Lightweight Utils (DiskCache + stable id)

In [5]:
# -------------------------------
# Persistent Memory (tiny JSON)
# -------------------------------
class MemoryStore:
    def __init__(self, path: str = ".agent_memory.json") -> None:
        self.path = path
        if not os.path.exists(self.path):
            with open(self.path, "w", encoding="utf-8") as f:
                json.dump({"symbols": {}}, f)

    def _load(self) -> Dict[str, Any]:
        with open(self.path, "r", encoding="utf-8") as f:
            return json.load(f)

    def _save(self, data: Dict[str, Any]) -> None:
        with open(self.path, "w", encoding="utf-8") as f:
            json.dump(data, f, indent=2)

    def append_note(self, symbol: str, note: str) -> None:
        data = self._load()
        symbols = data.setdefault("symbols", {})
        lst = symbols.setdefault(symbol.upper(), [])
        timestamp = dt.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S UTC")
        lst.append({"ts": timestamp, "note": note})
        self._save(data)

    def get_notes(self, symbol: str, last_n: int = 5) -> List[str]:
        data = self._load()
        notes = data.get("symbols", {}).get(symbol.upper(), [])
        return [f"{n['ts']}: {n['note']}" for n in notes[-last_n:]]
    
# -------------------------------
# Disk Cache (parquet files)
# -------------------------------
class DiskCache:
    # ... (This class is correct, no changes needed) ...
    def __init__(self, cache_dir: str, ttl_seconds: int):
        self.cache_dir = cache_dir
        self.ttl_seconds = ttl_seconds
        os.makedirs(self.cache_dir, exist_ok=True)
    def _cache_path(self, key: str) -> str:
        h = hashlib.sha1(key.encode("utf-8")).hexdigest()
        return os.path.join(self.cache_dir, f"{h}.parquet")
    def get(self, key: str) -> pd.DataFrame | None:
        path = self._cache_path(key)
        if not os.path.exists(path): return None
        if (time.time() - os.path.getmtime(path)) > self.ttl_seconds: return None
        try: return pd.read_parquet(path)
        except Exception: return None
    def set(self, key: str, df: pd.DataFrame):
        path = self._cache_path(key)
        df.to_parquet(path, index=False)

### Economic Data From FRED

In [6]:
class EconomicDataTool:
    """
    A tool to fetch economic data series from the FRED API.
    """
    BASE_URL = "https://api.stlouisfed.org/fred/series/observations"

    def __init__(self, cache_dir: str = ".cache/fred", ttl_seconds: int = 12 * 3600):
        self.api_key = os.getenv("FRED_API_KEY")
        if not self.api_key:
            print("⚠️ FRED_API_KEY not set. The EconomicDataTool will be disabled.")
        
        self.cache_dir = cache_dir
        self.ttl_seconds = ttl_seconds
        os.makedirs(self.cache_dir, exist_ok=True)

    def _cache_path(self, key: str) -> str:
        h = hashlib.sha1(key.encode("utf-8")).hexdigest()
        return os.path.join(self.cache_dir, f"{h}.parquet")

    def _read_cache(self, key: str) -> pd.DataFrame | None:
        path = self._cache_path(key)
        if not os.path.exists(path): return None
        if (time.time() - os.path.getmtime(path)) > self.ttl_seconds: return None
        try: return pd.read_parquet(path)
        except Exception: return None

    def _write_cache(self, key: str, df: pd.DataFrame):
        path = self._cache_path(key)
        df.to_parquet(path, index=False)

    def get_series(self, series_ids: list[str], start_date: str = "2020-01-01") -> pd.DataFrame:
        """
        Fetches one or more economic data series from FRED and merges them.
        
        Common Series IDs:
        - GDP: Real Gross Domestic Product
        - CPIAUCSL: Consumer Price Index (Inflation)
        - UNRATE: Unemployment Rate
        - FEDFUNDS: Federal Funds Effective Rate
        """
        if not self.api_key:
            return pd.DataFrame()

        # Create a stable cache key from the sorted list of series
        sorted_ids = sorted(series_ids)
        cache_key = f"fred::{'&'.join(sorted_ids)}::{start_date}"
        
        cached_df = self._read_cache(cache_key)
        if cached_df is not None:
            return cached_df

        all_series_dfs = []
        for series_id in sorted_ids:
            params = {
                "series_id": series_id,
                "api_key": self.api_key,
                "file_type": "json",
                "observation_start": start_date,
            }
            try:
                response = requests.get(self.BASE_URL, params=params, timeout=30)
                response.raise_for_status()
                data = response.json().get("observations", [])
                
                if not data:
                    print(f"No data returned for FRED series: {series_id}")
                    continue

                df = pd.DataFrame(data)
                df = df[["date", "value"]]
                df = df.rename(columns={"value": series_id})
                
                # Clean the data
                df["date"] = pd.to_datetime(df["date"])
                # FRED uses '.' for missing values
                df[series_id] = pd.to_numeric(df[series_id], errors='coerce')
                
                all_series_dfs.append(df)
            except requests.exceptions.RequestException as e:
                print(f"Failed to fetch FRED series {series_id}: {e}")
        
        if not all_series_dfs:
            return pd.DataFrame()

        # Merge all individual series DataFrames into one
        merged_df = reduce(lambda left, right: pd.merge(left, right, on='date', how='outer'), all_series_dfs)
        merged_df = merged_df.sort_values('date', ascending=False).reset_index(drop=True)
        
        self._write_cache(cache_key, merged_df)
        return merged_df

### Test

In [7]:
def run_economic_data_tool_smoke_test():
    """
    A simple test to verify the EconomicDataTool is working correctly.
    """
    print("--- 💨 Running Smoke Test for EconomicDataTool ---")
    
    # Ensure environment variables are loaded (especially FRED_API_KEY)
    load_dotenv()
    
    # 1. Instantiate the tool
    tool = EconomicDataTool()
    
    # 2. Check if the API key is available before proceeding
    if not tool.api_key:
        print("❌ Test SKIPPED: FRED_API_KEY is not set in your environment.")
        return

    # 3. Define a few common and reliable FRED series IDs to fetch
    series_to_fetch = {
        "GDP": "Real Gross Domestic Product",
        "CPIAUCSL": "Consumer Price Index (Inflation)",
        "UNRATE": "Unemployment Rate"
    }
    
    print(f"Fetching series: {', '.join(series_to_fetch.keys())}...")
    
    # 4. Call the tool's main method
    df = tool.get_series(series_ids=list(series_to_fetch.keys()))
    
    # 5. Verify the output
    if df is not None and not df.empty:
        print(f"\n✅ Test PASSED: Successfully fetched {len(df)} data points.")
        print("--- Sample of Fetched Economic Data ---")
        display(df.head())
    else:
        print("\n❌ Test FAILED: The tool returned an empty DataFrame.")
        print("   Please check your FRED_API_KEY and network connection.")

# --- Execute the smoke test ---
run_economic_data_tool_smoke_test()

--- 💨 Running Smoke Test for EconomicDataTool ---
Fetching series: GDP, CPIAUCSL, UNRATE...

✅ Test PASSED: Successfully fetched 68 data points.
--- Sample of Fetched Economic Data ---


Unnamed: 0,date,CPIAUCSL,GDP,UNRATE
0,2025-08-01,323.364,,4.3
1,2025-07-01,322.132,,4.2
2,2025-06-01,321.5,,4.1
3,2025-05-01,320.58,,4.2
4,2025-04-01,320.321,30485.729,4.2


### Market Data From Yahoo Finance 

In [8]:
class MarketDataTool:
    """
    Market data access + light feature engineering (optional).
    - Standardized schema: ['date','open','high','low','close','volume']
    - Intraday support (1m/2m/5m/15m/30m/60m/90m/1h)
    - Simple on-disk caching with TTL
    - Batch fetch for multiple tickers -> long format with a 'ticker' column
    """

    def __init__(
        self,
        cache_dir: str = ".cache/yfinance",
        ttl_seconds: int = 3600,
        max_retries: int = 2,
        pause_between_retries: float = 0.7
    ):
        self.cache_dir = cache_dir
        self.ttl_seconds = ttl_seconds
        self.max_retries = max_retries
        self.pause_between_retries = pause_between_retries
        os.makedirs(self.cache_dir, exist_ok=True)

    # ---------------------------
    # Core helpers
    # ---------------------------
    def _cache_path(self, key: str) -> str:
        h = hashlib.sha1(key.encode("utf-8")).hexdigest()
        return os.path.join(self.cache_dir, f"{h}.parquet")

    def _read_cache(self, key: str) -> Optional[pd.DataFrame]:
        path = self._cache_path(key)
        if not os.path.exists(path):
            return None
        if (time.time() - os.path.getmtime(path)) > self.ttl_seconds:
            return None
        try:
            return pd.read_parquet(path)
        except Exception:
            # Fallback to CSV if parquet fails (rare)
            alt = path.replace(".parquet", ".csv")
            if os.path.exists(alt):
                try:
                    return pd.read_csv(alt, parse_dates=["date"])
                except Exception:
                    return None
            return None

    def _write_cache(self, key: str, df: pd.DataFrame) -> None:
        path = self._cache_path(key)
        try:
            df.to_parquet(path, index=False)
        except Exception:
            df.to_csv(path.replace(".parquet", ".csv"), index=False)

    def _normalize_columns(self, df: pd.DataFrame, ticker: str) -> pd.DataFrame:
        import pandas as pd
        from pandas.api.types import is_datetime64_any_dtype

        # Ensure a DataFrame (some paths may pass a Series or dict-like)
        df = pd.DataFrame(df).copy()

        # Reset index to surface the datetime index as a column (Date/Datetime/index)
        df = df.reset_index()

        # Normalize columns: flatten tuples, lowercase, underscores
        df.columns = [
            "_".join(str(s) for s in col if s) if isinstance(col, tuple) else str(col)
            for col in df.columns
        ]
        df.columns = [c.lower().replace(" ", "_") for c in df.columns]

        # --- Find/standardize the datetime column to 'date' ---
        # 1) Prefer a column already of datetime dtype
        dt_cols = [c for c in df.columns if is_datetime64_any_dtype(df[c])]
        date_col = dt_cols[0] if dt_cols else None

        # 2) Otherwise look for common names and parse
        if date_col is None:
            for cand in ("date", "datetime", "timestamp", "index"):
                if cand in df.columns:
                    # try to parse to datetime
                    df[cand] = pd.to_datetime(df[cand], errors="coerce", utc=False)
                    if is_datetime64_any_dtype(df[cand]):
                        date_col = cand
                        break

        # 3) If still missing, last resort: try to_datetime on the first column
        if date_col is None and len(df.columns) > 0:
            first = df.columns[0]
            df[first] = pd.to_datetime(df[first], errors="coerce", utc=False)
            if is_datetime64_any_dtype(df[first]):
                date_col = first

        if date_col is None:
            # Cannot reliably identify a datetime column; return empty with expected schema
            return pd.DataFrame(columns=["date", "open", "high", "low", "close", "volume"])

        if date_col != "date":
            df = df.rename(columns={date_col: "date"})

        # --- Map OHLCV names (handles multi-ticker suffixes like open_aapl) ---
        t = ticker.lower()
        colmap = {
            f"open_{t}": "open",
            f"high_{t}": "high",
            f"low_{t}": "low",
            f"close_{t}": "close",
            f"volume_{t}": "volume",
        }
        df = df.rename(columns=colmap)

        # Prefer adj_close if close missing
        if "adj_close" in df.columns and "close" not in df.columns:
            df = df.rename(columns={"adj_close": "close"})

        # Cast numeric safely
        for c in ("open", "high", "low", "close", "volume"):
            if c in df.columns:
                df[c] = pd.to_numeric(df[c], errors="coerce")

        # Ensure datetime
        df["date"] = pd.to_datetime(df["date"], errors="coerce")

        # Drop bad rows
        df = df.dropna(subset=["date", "close"]).reset_index(drop=True)

        # Final schema (return empty with correct cols if missing)
        required = ["date", "open", "high", "low", "close", "volume"]
        missing = [c for c in required if c not in df.columns]
        if missing:
            # Create any missing required columns as NaN to keep schema stable
            for c in missing:
                df[c] = pd.NA
            df = df[required]

        return df[required]


    def _yf_download(self, tickers, **kwargs):
        """
        Thin wrapper with simple retries to handle intermittent YF hiccups.
        """
        err = None
        for attempt in range(self.max_retries + 1):
            try:
                return yf.download(tickers, progress=False, auto_adjust=True, **kwargs)
            except Exception as e:
                err = e
                time.sleep(self.pause_between_retries * (attempt + 1))
        raise err if err else RuntimeError("Unknown yfinance error")

    # ---------------------------
    # Public API
    # ---------------------------
    def get_stock_prices(
        self,
        ticker: str,
        period: str = "5y",
        interval: str = "1d"
    ) -> pd.DataFrame:
        """
        Single-ticker normalized OHLCV.
        Returns standardized columns: ['date','open','high','low','close','volume'].
        Caches results for ttl_seconds.
        """
        key = f"single::{ticker}::{period}::{interval}"
        cached = self._read_cache(key)
        if cached is not None:
            return cached

        # yfinance can return tuple in some environments; normalize robustly.
        try:
            result = self._yf_download(ticker, period=period, interval=interval)
        except Exception as e:
            print(f"Error fetching stock data for {ticker}: {e}")
            return pd.DataFrame(columns=["date","open","high","low","close","volume"])

        data = result[0] if isinstance(result, tuple) else result
        if data is None or data.empty:
            return pd.DataFrame(columns=["date","open","high","low","close","volume"])

        df = self._normalize_columns(data, ticker)
        self._write_cache(key, df)
        return df

    def batch_get_prices(
        self,
        tickers: List[str],
        period: str = "1y",
        interval: str = "1d"
    ) -> pd.DataFrame:
        """
        Multi-ticker fetch. Returns LONG format:
        ['ticker','date','open','high','low','close','volume'].
        Works whether yfinance returns a flat frame or a column MultiIndex.
        """
        # Cache key is content-addressed by sorted tickers for determinism
        tickers_sorted = sorted(set([t.upper() for t in tickers]))
        key = f"batch::{','.join(tickers_sorted)}::{period}::{interval}"
        cached = self._read_cache(key)
        if cached is not None:
            return cached

        try:
            result = self._yf_download(tickers_sorted, period=period, interval=interval)
        except Exception as e:
            print(f"Error fetching batch data: {e}")
            return pd.DataFrame(columns=["ticker","date","open","high","low","close","volume"])

        if result is None or result.empty:
            return pd.DataFrame(columns=["ticker","date","open","high","low","close","volume"])

        # yfinance for multiple tickers returns a wide MultiIndex columns like:
        # ('Open','AAPL'), ('High','AAPL'), ...
        # If single ticker slips through, handle as single
        if not isinstance(result.columns, pd.MultiIndex):
            # Single-like case; just normalize and add ticker
            # Try to guess which ticker it belongs to: use first of list
            base_ticker = tickers_sorted[0]
            df = self._normalize_columns(result, base_ticker)
            df.insert(0, "ticker", base_ticker)
            self._write_cache(key, df)
            return df

        # MultiIndex -> long
        out_frames = []
        # Top level should be ('Adj Close','Close','High','Low','Open','Volume')
        # Second level are tickers
        for t in tickers_sorted:
            sub = result.xs(t, axis=1, level=1, drop_level=False)
            # Rebuild a single-ticker frame with expected column names
            # Columns might be ('Open', t), etc.
            tmp = pd.DataFrame({
                "date": result.index
            })
            # Use get to be robust to missing columns
            def col2(s1): return (s1, t) if (s1, t) in sub.columns else None

            for src, dst in [("Open","open"),("High","high"),("Low","low"),("Close","close"),("Adj Close","adj_close"),("Volume","volume")]:
                c = col2(src)
                if c is not None:
                    tmp[dst] = sub[c].values

            tmp = self._normalize_columns(tmp, t)
            if tmp.empty:
                continue
            tmp.insert(0, "ticker", t)
            out_frames.append(tmp)

        if not out_frames:
            return pd.DataFrame(columns=["ticker","date","open","high","low","close","volume"])

        df_long = pd.concat(out_frames, ignore_index=True)
        self._write_cache(key, df_long)
        return df_long

    def get_price_panel(
        self,
        ticker: str,
        period: str = "6mo",
        interval: str = "1d",
        with_features: bool = True
    ) -> pd.DataFrame:
        """
        Convenience wrapper used by the agent's router.
        Adds light features if requested.
        """
        df = self.get_stock_prices(ticker, period=period, interval=interval)
        if df.empty or not with_features:
            return df
        df = df.copy()
        df["pct_change"] = df["close"].pct_change()
        df["ret_20d"] = df["close"] / df["close"].shift(20) - 1.0
        df["sma_20"] = df["close"].rolling(20, min_periods=5).mean()
        df["sma_50"] = df["close"].rolling(50, min_periods=10).mean()
        df["vol_ma_20"] = df["volume"].rolling(20, min_periods=5).mean()
        return df

### TESTING

In [9]:
## Test single ticker fetch
mdt = MarketDataTool(ttl_seconds=3600)

# Daily, 5 years
aapl = mdt.get_stock_prices("AAPL", period="5y", interval="1d")

# Intraday (e.g., 5-minute). If your period is too long for the interval,
# yfinance will just return what it can; the cache keeps it consistent across runs.
nvda_5m = mdt.get_stock_prices("NVDA", period="60d", interval="5m")

# Panel w/ features for router hints
panel = mdt.get_price_panel("MSFT", period="6mo", interval="1d", with_features=True)

# -------------------------------
display(aapl.tail())
display(nvda_5m.tail())   

Unnamed: 0,date,open,high,low,close,volume
1250,2025-10-10,254.940002,256.380005,244.0,245.270004,61999100
1251,2025-10-13,249.380005,249.690002,245.559998,247.660004,38142900
1252,2025-10-14,246.600006,248.850006,244.699997,247.770004,35478000
1253,2025-10-15,249.490005,251.820007,247.470001,249.339996,33893600
1254,2025-10-16,248.270004,249.039993,245.130005,247.449997,39218197


Unnamed: 0,date,open,high,low,close,volume
4669,2025-10-16 19:35:00+00:00,180.669998,181.110001,180.550003,180.785004,1392895
4670,2025-10-16 19:40:00+00:00,180.790894,181.179993,180.785004,180.978302,1455585
4671,2025-10-16 19:45:00+00:00,180.979996,181.369995,180.760101,181.369995,1731700
4672,2025-10-16 19:50:00+00:00,181.350006,181.940002,181.220001,181.919998,2894065
4673,2025-10-16 19:55:00+00:00,181.919998,182.009995,181.449997,181.811005,5280806


## NewsTool

In [10]:
class NewsDataTool:
    """
    Company news access with robust normalization + TTL parquet cache.

    Standardized columns:
      ['symbol','source','publisher','published_utc','headline','summary','url']

    Behavior mirrors MarketDataTool:
      - On-disk caching (parquet) with TTL
      - Simple retries
      - Batch fetch across tickers -> long format with 'symbol' column
    """
    def __init__(
        self,
        cache_dir: str = ".cache/news",
        ttl_seconds: int = 20 * 60,      # short TTL — news changes quickly
        max_retries: int = 2,
        pause_between_retries: float = 0.7,
        finnhub_key: str | None = None,
        polygon_key: str | None = None,
    ):
        import os
        self.cache_dir = cache_dir
        self.ttl_seconds = ttl_seconds
        self.max_retries = max_retries
        self.pause_between_retries = pause_between_retries
        self.finnhub_key = finnhub_key or FINNHUB_KEY
        self.polygon_key = polygon_key or POLYGON_KEY
        os.makedirs(self.cache_dir, exist_ok=True)

    # ---------- schema ----------
    @staticmethod
    def columns() -> list[str]:
        return ["symbol","source","publisher","published_utc","headline","summary","url"]

    # ---------- cache helpers ----------
    def _cache_path(self, key: str) -> str:
        import os, hashlib
        h = hashlib.sha1(key.encode("utf-8")).hexdigest()
        return os.path.join(self.cache_dir, f"{h}.parquet")

    def _read_cache(self, key: str):
        import os, time, pandas as pd
        path = self._cache_path(key)
        if not os.path.exists(path):
            return None
        if (time.time() - os.path.getmtime(path)) > self.ttl_seconds:
            return None
        try:
            df = pd.read_parquet(path)
            # ensure datetime tz-aware
            if "published_utc" in df.columns:
                df["published_utc"] = pd.to_datetime(df["published_utc"], utc=True, errors="coerce")
            return df
        except Exception:
            return None

    def _write_cache(self, key: str, df):
        path = self._cache_path(key)
        try:
            df.to_parquet(path, index=False)
        except Exception:
            # last-resort CSV
            df.to_csv(path.replace(".parquet",".csv"), index=False)

    # ---------- utils ----------
    @staticmethod
    def _safe_fix_text(x) -> str:
        from ftfy import fix_text
        import json
        if x is None:
            return ""
        if isinstance(x, str):
            return fix_text(x)
        if isinstance(x, dict):
            for k in ("summary","content","description","title","text","value"):
                v = x.get(k)
                if isinstance(v, str):
                    return fix_text(v)
            try:
                return fix_text(json.dumps(x, ensure_ascii=False, separators=(",", ":")))
            except Exception:
                return fix_text(str(x))
        if isinstance(x, list):
            parts = []
            for e in x:
                if isinstance(e, str):
                    parts.append(e)
                elif isinstance(e, dict):
                    parts.append(NewsDataTool._safe_fix_text(e))
            return fix_text(" ".join(p for p in parts if p))
        return fix_text(str(x))

    def _retry_get(self, url: str, params: dict, timeout: int = 20):
        import requests, time
        err = None
        for attempt in range(self.max_retries + 1):
            try:
                r = requests.get(url, params=params, timeout=timeout)
                r.raise_for_status()
                return r
            except Exception as e:
                err = e
                time.sleep(self.pause_between_retries * (attempt + 1))
        print(f"HTTP error: {url} | {err}")
        return None

    # ---------- per-source fetchers ----------
    def _fetch_yahoo(self, sym: str, max_items: int):
        import pandas as pd, yfinance as yf
        t = yf.Ticker(sym)
        raw = t.news or []
        rows = []
        for row in raw[:max_items]:
            ts_epoch = row.get("providerPublishTime") or row.get("pubDate")
            ts = pd.to_datetime(ts_epoch, unit="s", utc=True, errors="coerce") if ts_epoch else pd.NaT

            pub = row.get("publisher")
            if not isinstance(pub, str):
                prov = row.get("provider")
                if isinstance(prov, dict):
                    pub = prov.get("displayName")
                elif isinstance(prov, list) and prov and isinstance(prov[0], dict):
                    pub = prov[0].get("displayName")
            if not isinstance(pub, str):
                pub = None

            rows.append({
                "symbol": sym.upper(),
                "source": "Yahoo",
                "publisher": pub,
                "published_utc": ts,
                "headline": self._safe_fix_text(row.get("title") or row.get("headline") or ""),
                "summary":  self._safe_fix_text(row.get("summary") or row.get("content") or row.get("description") or ""),
                "url": row.get("link") or row.get("url") or "",
            })
        return pd.DataFrame(rows, columns=self.columns())

    def _fetch_finnhub(self, sym: str, days: int, max_items: int):
        import pandas as pd, datetime as dt
        if not self.finnhub_key:
            return pd.DataFrame(columns=self.columns())
        to = dt.date.today(); fr = to - dt.timedelta(days=days)
        r = self._retry_get(
            "https://finnhub.io/api/v1/company-news",
            {"symbol": sym, "from": fr.isoformat(), "to": to.isoformat(), "token": self.finnhub_key}
        )
        data = [] if r is None else (r.json() or [])
        rows = []
        for row in data[:max_items]:
            rows.append({
                "symbol": sym.upper(),
                "source": "Finnhub",
                "publisher": row.get("source") or None,
                "published_utc": pd.to_datetime(row.get("datetime",0), unit="s", utc=True, errors="coerce"),
                "headline": self._safe_fix_text(row.get("headline") or row.get("title") or ""),
                "summary":  self._safe_fix_text(row.get("summary") or row.get("description") or row.get("text") or ""),
                "url": row.get("url") or "",
            })
        return pd.DataFrame(rows, columns=self.columns())

    def _fetch_polygon(self, sym: str, limit: int):
        import pandas as pd
        if not self.polygon_key:
            return pd.DataFrame(columns=self.columns())
        r = self._retry_get(
            "https://api.polygon.io/v2/reference/news",
            {"ticker": sym, "limit": min(limit, 1000), "apiKey": self.polygon_key}
        )
        data = [] if r is None else ((r.json() or {}).get("results", []) or [])
        rows = []
        for row in data:
            pub = row.get("publisher")
            if isinstance(pub, dict):
                pub = pub.get("name")
            rows.append({
                "symbol": sym.upper(),
                "source": "Polygon",
                "publisher": pub,
                "published_utc": pd.to_datetime(row.get("published_utc") or None, utc=True, errors="coerce"),
                "headline": self._safe_fix_text(row.get("title") or ""),
                "summary":  self._safe_fix_text(row.get("description") or row.get("summary") or ""),
                "url": row.get("article_url") or row.get("amp_url") or "",
            })
        return pd.DataFrame(rows, columns=self.columns())

    # ---------- orchestrators ----------
    def fetch_one(
        self,
        symbol: str,
        days: int = 7,
        max_per_source: int = 120,
        use_sources: list[str] | None = None,
        relevance_fn = None,  # optional: lambda sym, headline, summary -> bool
    ):
        """
        Single-symbol fetch with normalization, optional relevance filter,
        dedupe by URL, newest-first. Cached by (symbol, days, max_per_source, sources).
        """
        import pandas as pd, os
        symbol = symbol.upper()
        use_sources = [s.lower() for s in (use_sources or ["yahoo","finnhub","polygon"])]
        key = f"news::{symbol}::d{days}::m{max_per_source}::src{','.join(use_sources)}"
        cached = self._read_cache(key)
        if cached is not None:
            df = cached
        else:
            frames = []
            if "yahoo"   in use_sources: frames.append(self._fetch_yahoo(symbol, max_per_source))
            if "finnhub" in use_sources: frames.append(self._fetch_finnhub(symbol, days, max_per_source))
            if "polygon" in use_sources: frames.append(self._fetch_polygon(symbol, max_per_source))
            df = pd.concat(frames, ignore_index=True) if frames else pd.DataFrame(columns=self.columns())

            if not df.empty:
                df["published_utc"] = pd.to_datetime(df["published_utc"], utc=True, errors="coerce")
                df["url"] = df["url"].fillna("").astype(str)
                df = df.sort_values("published_utc", ascending=False).drop_duplicates(subset=["url"]).reset_index(drop=True)

            self._write_cache(key, df)

        if df.empty:
            return df

        # optional ticker relevance
        if relevance_fn is not None:
            mask = df.apply(lambda r: bool(relevance_fn(symbol, str(r["headline"]), str(r["summary"]))), axis=1)
            df = df[mask].reset_index(drop=True)

        return df

    def batch_fetch(
        self,
        symbols: list[str],
        days: int = 7,
        max_per_source: int = 120,
        use_sources: list[str] | None = None,
        relevance_fn = None,
    ):
        """
        Multi-symbol fetch. Returns LONG format over 'symbol'.
        Each symbol is independently cached (like MarketDataTool.batch_get_prices).
        """
        import pandas as pd
        frames = []
        for s in [x.upper() for x in symbols]:
            df = self.fetch_one(
                s, days=days, max_per_source=max_per_source,
                use_sources=use_sources, relevance_fn=relevance_fn
            )
            if not df.empty:
                frames.append(df)
        out = pd.concat(frames, ignore_index=True) if frames else pd.DataFrame(columns=self.columns())
        if not out.empty:
            out["published_utc"] = pd.to_datetime(out["published_utc"], utc=True, errors="coerce")
        return out


In [11]:
# --- NewsDataTool tests (smoke & cache behavior) ---
def test_news_all_sources():
    # 1) Fresh cache for a clean run
    test_cache = ".cache/news_all_sources_test"
    shutil.rmtree(test_cache, ignore_errors=True)

    # 2) Instantiate tool
    ndt = NewsDataTool(
        cache_dir=test_cache,
        ttl_seconds=60,            # short TTL for testing
        max_retries=2,
        pause_between_retries=0.8  # increase if rate limits hit
    )

    # 3) Symbols and sources (Yahoo + Finnhub + Polygon)
    symbols = ["AAPL","MSFT","NVDA","GOOGL","TSLA"]
    sources = ["yahoo","finnhub","polygon"]

    # 4) Fetch a decently wide window
    df = ndt.batch_fetch(
        symbols=symbols,
        days=10,                  # used by Finnhub
        max_per_source=100,       # Polygon is limit-based (up to 1000); start modest
        use_sources=sources,
        relevance_fn=None         # first fetch without filtering
    )

    print(f"Rows fetched ({'+'.join(sources)}):", len(df))
    if df.empty:
        print("No rows returned. Try increasing 'days' or 'max_per_source', or bump 'pause_between_retries' to handle rate limits.")
        return

    # 5) Ensure datetime and basic diagnostics
    df["published_utc"] = pd.to_datetime(df["published_utc"], utc=True, errors="coerce")
    assert is_datetime64_any_dtype(df["published_utc"]), "published_utc should be datetime-like"

    print("\nCounts by symbol/source:")
    display(df.groupby(["symbol","source"]).size().rename("rows").reset_index().sort_values("rows", ascending=False))

    print("\nLatest timestamp by symbol:")
    display(df.groupby("symbol")["published_utc"].max().sort_values(ascending=False))

    print("\nSample headlines (newest first):")
    display(df.sort_values("published_utc", ascending=False).head(12)[
        ["symbol","published_utc","source","publisher","headline"]
    ])

    # 6) Now apply a relevance filter (same logic your agent uses)
    ALIASES = {
        "AAPL":  ["apple","iphone","ipad","mac","tim cook","app store","vision pro"],
        "MSFT":  ["microsoft","windows","azure","xbox","satya nadella","copilot","github"],
        "NVDA":  ["nvidia","cuda","h100","blackwell","geforce","jensen huang","dgx"],
        "GOOGL": ["google","alphabet","youtube","android","sundar pichai","gemini"],
        "TSLA":  ["tesla","elon musk","model 3","model y","gigafactory","fsd"],
    }
    def relevance_fn(sym, headline, summary):
        text = f"{(headline or '').lower()} {(summary or '').lower()}"
        return any(a in text for a in ALIASES.get(sym, []))

    df_rel = df[df.apply(lambda r: relevance_fn(r["symbol"], r["headline"], r["summary"]), axis=1)].copy()
    print(f"\nRelevance-kept rows: {len(df_rel)} (from {len(df)})")
    display(df_rel.sort_values("published_utc", ascending=False).head(12)[
        ["symbol","published_utc","source","publisher","headline"]
    ])

# Run it
test_news_all_sources()


Rows fetched (yahoo+finnhub+polygon): 1005

Counts by symbol/source:


Unnamed: 0,symbol,source,rows
0,AAPL,Finnhub,100
1,AAPL,Polygon,100
3,GOOGL,Finnhub,100
6,MSFT,Finnhub,100
4,GOOGL,Polygon,100
9,NVDA,Finnhub,100
7,MSFT,Polygon,100
13,TSLA,Polygon,100
12,TSLA,Finnhub,100
10,NVDA,Polygon,100



Latest timestamp by symbol:


symbol
NVDA    2025-10-16 23:18:53+00:00
GOOGL   2025-10-16 22:51:00+00:00
MSFT    2025-10-16 21:15:23+00:00
AAPL    2025-10-16 18:05:45+00:00
TSLA    2025-10-16 16:52:24+00:00
Name: published_utc, dtype: datetime64[ns, UTC]


Sample headlines (newest first):


Unnamed: 0,symbol,published_utc,source,publisher,headline
402,NVDA,2025-10-16 23:18:53+00:00,Polygon,The Motley Fool,Why Navitas Semiconductor Stock Gained Today
603,GOOGL,2025-10-16 22:51:00+00:00,Polygon,GlobeNewswire Inc.,"Jottful Celebrates 100th 5-Star Google Review,..."
403,NVDA,2025-10-16 21:15:23+00:00,Polygon,The Motley Fool,"Nvidia, Microsoft, and BlackRock Just Made a $..."
201,MSFT,2025-10-16 21:15:23+00:00,Polygon,The Motley Fool,"Nvidia, Microsoft, and BlackRock Just Made a $..."
404,NVDA,2025-10-16 20:27:00+00:00,Polygon,Investing.com,"Micron Surges 143% YTD, Riding the AI Server B..."
202,MSFT,2025-10-16 19:40:00+00:00,Polygon,Investing.com,Salesforce Reinvents Enterprise Software Model...
405,NVDA,2025-10-16 19:25:00+00:00,Polygon,The Motley Fool,"Nvidia Stock Has Risen 1,500% in 3 Years: Is I..."
203,MSFT,2025-10-16 19:25:00+00:00,Polygon,The Motley Fool,"Nvidia Stock Has Risen 1,500% in 3 Years: Is I..."
406,NVDA,2025-10-16 19:10:00+00:00,Polygon,The Motley Fool,2 Tech Stocks That Could Go Parabolic
407,NVDA,2025-10-16 18:50:00+00:00,Polygon,Investing.com,AMD Technical Setup Targets $300 as Analyst Co...



Relevance-kept rows: 409 (from 1005)


Unnamed: 0,symbol,published_utc,source,publisher,headline
603,GOOGL,2025-10-16 22:51:00+00:00,Polygon,GlobeNewswire Inc.,"Jottful Celebrates 100th 5-Star Google Review,..."
403,NVDA,2025-10-16 21:15:23+00:00,Polygon,The Motley Fool,"Nvidia, Microsoft, and BlackRock Just Made a $..."
201,MSFT,2025-10-16 21:15:23+00:00,Polygon,The Motley Fool,"Nvidia, Microsoft, and BlackRock Just Made a $..."
405,NVDA,2025-10-16 19:25:00+00:00,Polygon,The Motley Fool,"Nvidia Stock Has Risen 1,500% in 3 Years: Is I..."
2,AAPL,2025-10-16 17:39:00+00:00,Polygon,GlobeNewswire Inc.,Machine Learning Interview Prep Course For ML ...
409,NVDA,2025-10-16 17:35:00+00:00,Polygon,The Motley Fool,Why Astera Labs Stock Imploded This Week
804,TSLA,2025-10-16 16:52:24+00:00,Finnhub,SeekingAlpha,Why Tesla's Stock Could Go Much Higher
412,NVDA,2025-10-16 15:43:47+00:00,Finnhub,Yahoo,"Stock Market Today: Nasdaq Up, Snowflake Tests..."
4,AAPL,2025-10-16 15:40:30+00:00,Finnhub,Yahoo,China's Wentao blames US actions for trade ten...
5,AAPL,2025-10-16 15:33:26+00:00,Finnhub,Yahoo,Apple is reportedly making robots. Here's what...


### Earnings Data Tool

In [12]:
class EarningsDataTool:
    """
    Company earnings estimates + actuals with robust normalization + TTL parquet cache.
    Standardized columns:
      ['report_date','eps_estimate','eps_actual_est','revenue_estimate','revenue_actual_est',
       'fiscal_year_est','fiscal_quarter_est','eps_actual_act','revenue_actual_act',
       'fiscal_year_act','fiscal_quarter_act','source_est']
    Behavior:
      - On-disk caching (parquet) with TTL
      - Simple retries
      - Combines Finnhub estimates + SEC Edgar actuals
    """
    def __init__(
        self,
        cache_dir: str = ".cache/earnings_final",
        ttl_seconds: int = 6 * 3600,
        finnhub_key: str | None = None,
        sec_user_agent: str | None = None,
    ):
        self.cache = DiskCache(cache_dir, ttl_seconds)
        self.finnhub_key = finnhub_key or FINNHUB_KEY
        self.sec_user_agent = sec_user_agent or SEC_USER_AGENT
        self._cik_map_path = os.path.join(cache_dir, "ticker_cik.parquet")
        
        if not self.finnhub_key: print("⚠️ FINNHUB_API_KEY not set.")
        if "@" not in self.sec_user_agent: print("⚠️ SEC_USER_AGENT is not a valid email.")

    def _retry_get(self, url: str, params: dict = None) -> requests.Response | None:
        headers = {}
        if "sec.gov" in url: headers["User-Agent"] = self.sec_user_agent
        try:
            r = requests.get(url, params=params, headers=headers, timeout=20)
            r.raise_for_status()
            return r
        except requests.exceptions.RequestException as e:
            print(f"HTTP error for {url}: {e}")
            return None

    def _load_ticker_cik(self) -> pd.DataFrame:
        if os.path.exists(self._cik_map_path):
            if (time.time() - os.path.getmtime(self._cik_map_path)) < 30 * 24 * 3600:
                return pd.read_parquet(self._cik_map_path)
        url = "https://www.sec.gov/files/company_tickers.json"
        response = self._retry_get(url)
        if response is None: return pd.DataFrame()
        data = response.json()
        df = pd.DataFrame(list(data.values()))
        df = df.rename(columns={"cik_str": "cik", "ticker": "symbol"})
        df["symbol"] = df["symbol"].str.upper()
        df.to_parquet(self._cik_map_path, index=False)
        return df

    def _ticker_to_cik(self, symbol: str) -> str | None:
        df = self._load_ticker_cik()
        if df.empty: return None
        result = df[df["symbol"] == symbol.upper()]
        if not result.empty: return f"{result.iloc[0]['cik']:010d}"
        return None

    def _fetch_finnhub_estimates(self, symbol: str) -> pd.DataFrame:
        if not self.finnhub_key: return pd.DataFrame()
        today = dt.date.today()
        start_date = (today - dt.timedelta(days=730)).isoformat()
        end_date = (today + dt.timedelta(days=270)).isoformat()
        url = "https://finnhub.io/api/v1/calendar/earnings"
        params = {"from": start_date, "to": end_date, "symbol": symbol, "token": self.finnhub_key}
        response = self._retry_get(url, params)
        if response is None: return pd.DataFrame()
        data = response.json().get("earningsCalendar", [])
        if not data: return pd.DataFrame()
        df = pd.DataFrame(data)
        df = df.rename(columns={
            "date": "report_date", "epsEstimate": "eps_estimate", "epsActual": "eps_actual_est",
            "revenueEstimate": "revenue_estimate", "revenueActual": "revenue_actual_est",
            "year": "fiscal_year_est", "quarter": "fiscal_quarter_est"
        })
        df["source_est"] = "Finnhub"
        return df

    def _fetch_edgar_actuals(self, symbol: str) -> pd.DataFrame:
        cik = self._ticker_to_cik(symbol)
        if not cik: return pd.DataFrame()
        url = f"https://data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json"
        response = self._retry_get(url)
        if response is None: return pd.DataFrame()
        facts = response.json().get("facts", {}).get("us-gaap", {})
        revenue_tag = facts.get("Revenues") or facts.get("SalesRevenueNet") or {}
        eps_tag = facts.get("EarningsPerShareDiluted", {})
        def extract_series(tag_data):
            rows = []
            for unit in tag_data.get("units", {}).values():
                for fact in unit:
                    if fact.get("form") in ["10-Q", "10-K"]:
                        rows.append({"report_date": pd.to_datetime(fact["end"]), "value": fact["val"], "fy": fact["fy"], "fp": fact["fp"]})
            df = pd.DataFrame(rows)
            if not df.empty:
                df = df.sort_values("report_date").drop_duplicates(subset=["fy", "fp"], keep="last")
            return df
        df_rev = extract_series(revenue_tag)
        df_eps = extract_series(eps_tag)
        if df_rev.empty or df_eps.empty: return pd.DataFrame()
        df = pd.merge(df_rev, df_eps, on=["fy", "fp"], suffixes=('_rev', '_eps'))
        df = df.rename(columns={
            "report_date_rev": "report_date", "value_rev": "revenue_actual_act",
            "value_eps": "eps_actual_act", "fy": "fiscal_year_act", "fp": "fiscal_quarter_act"
        })
        df = df[df["fiscal_quarter_act"].str.startswith("Q")].copy()
        df["fiscal_quarter_act"] = df["fiscal_quarter_act"].str.replace("Q", "").astype(int)
        df["source_act"] = "EDGAR"
        return df

    def fetch_one(self, symbol: str) -> pd.DataFrame:
        cache_key = f"earnings_final_v1::{symbol}"
        cached_df = self.cache.get(cache_key)
        if cached_df is not None: return cached_df

        df_est_raw = self._fetch_finnhub_estimates(symbol)
        df_act_raw = self._fetch_edgar_actuals(symbol)

        if df_est_raw.empty or df_act_raw.empty:
            return df_est_raw if not df_est_raw.empty else df_act_raw

        # --- FIX 1: Select only the columns you need before merging ---
        est_cols = ["report_date", "eps_estimate", "revenue_estimate", "fiscal_year_est", "fiscal_quarter_est", "source_est"]
        act_cols = ["report_date", "eps_actual_act", "revenue_actual_act", "fiscal_year_act", "fiscal_quarter_act", "source_act"]
        df_est = df_est_raw[est_cols].copy()
        df_act = df_act_raw[act_cols].copy()

        df_est['report_date'] = pd.to_datetime(df_est['report_date'], errors='coerce', utc=True)
        df_act['report_date'] = pd.to_datetime(df_act['report_date'], errors='coerce', utc=True)
        df_est = df_est.sort_values('report_date')
        df_act = df_act.sort_values('report_date')

        df_merged = pd.merge_asof(
            df_est, df_act, on='report_date', direction='backward',
            tolerance=pd.Timedelta(days=120)
        )

        df_merged['eps_actual'] = df_merged['eps_actual_act']
        df_merged['revenue_actual'] = df_merged['revenue_actual_act']
        df_merged['fiscal_year'] = df_merged['fiscal_year_act'].fillna(df_merged['fiscal_year_est'])
        df_merged['fiscal_quarter'] = df_merged['fiscal_quarter_act'].fillna(df_merged['fiscal_quarter_est'])

        for col in ["eps_estimate", "eps_actual", "revenue_estimate", "revenue_actual"]:
            df_merged[col] = pd.to_numeric(df_merged[col], errors='coerce')

        df_merged["eps_surprise"] = df_merged["eps_actual"] - df_merged["eps_estimate"]
        df_merged["rev_surprise"] = df_merged["revenue_actual"] - df_merged["revenue_estimate"]
        df_merged["beat_flag"] = df_merged["eps_surprise"] > 0
        
        df_merged['fiscal_year'] = df_merged['fiscal_year'].astype('Int64')
        df_merged['fiscal_quarter'] = df_merged['fiscal_quarter'].astype('Int64')

        final_cols = [
            "symbol", "report_date", "eps_estimate", "eps_actual", "eps_surprise",
            "revenue_estimate", "revenue_actual", "rev_surprise", "beat_flag",
            "fiscal_year", "fiscal_quarter", "source_est", "source_act"
        ]
        df_merged["symbol"] = symbol.upper()
        df_final = df_merged.reindex(columns=final_cols).sort_values("report_date", ascending=False, na_position='last').reset_index(drop=True)
        
        self.cache.set(cache_key, df_final)
        return df_final

    def batch_fetch(self, symbols: list[str]) -> pd.DataFrame:
        all_dfs = [self.fetch_one(s) for s in symbols]
        valid_dfs = [df for df in all_dfs if df is not None and not df.empty]
        if not valid_dfs: return pd.DataFrame()
        return pd.concat(valid_dfs, ignore_index=True)

In [13]:
# --- How to use the refactored tool ---
print("--- Testing the Refactored EarningsDataTool (Finnhub + SEC) ---")

# Make sure to set your API keys as environment variables
# For example: FINNHUB_API_KEY="your_key"
# For example: SEC_USER_AGENT="Your Name you@example.com"
tool = EarningsDataTool()

earnings_df = tool.batch_fetch(["NVDA", "AAPL", "TSLA"])

if not earnings_df.empty:
    print(f"\n✅ Successfully fetched and merged data for {earnings_df['symbol'].nunique()} symbols.")
    print("--- Sample of Merged Data ---")
    display(earnings_df.head(10))
else:
    print("\n❌ Could not fetch any earnings data. Check API keys and network connection.")

--- Testing the Refactored EarningsDataTool (Finnhub + SEC) ---

✅ Successfully fetched and merged data for 3 symbols.
--- Sample of Merged Data ---


Unnamed: 0,symbol,report_date,eps_estimate,eps_actual,eps_surprise,revenue_estimate,revenue_actual,rev_surprise,beat_flag,fiscal_year,fiscal_quarter,source_est,source_act,eps_actual_est,hour,fiscal_quarter_est,revenue_actual_est,fiscal_year_est
0,NVDA,2026-05-26 00:00:00+00:00,1.5242,,,65411105088,,,False,2027.0,1.0,Finnhub,,,,,,
1,NVDA,2026-02-24 00:00:00+00:00,1.4456,,,62366819952,,,False,2026.0,4.0,Finnhub,,,,,,
2,NVDA,2025-11-19 00:00:00+00:00,1.2651,1.08,-0.1851,55753113351,46743000000.0,-9010113000.0,False,2026.0,2.0,Finnhub,EDGAR,,,,,
3,AAPL,2026-04-29,1.8424,,,103726965355,,,,,,Finnhub,,,amc,2.0,,2026.0
4,AAPL,2026-01-28,2.5411,,,133684531371,,,,,,Finnhub,,,amc,1.0,,2026.0
5,AAPL,2025-10-30,1.7924,,,103706233519,,,,,,Finnhub,,,amc,4.0,,2025.0
6,TSLA,2026-04-20 00:00:00+00:00,0.4534,,,23522120692,,,False,2026.0,1.0,Finnhub,,,,,,
7,TSLA,2026-01-27 00:00:00+00:00,0.481,,,25879316580,,,False,2025.0,4.0,Finnhub,,,,,,
8,TSLA,2025-10-22 00:00:00+00:00,0.5399,0.33,-0.2099,26589014709,22496000000.0,-4093015000.0,False,2025.0,2.0,Finnhub,EDGAR,,,,,


# Agent

In [14]:
class InvestmentResearchAgent:
    """
    An AI agent that orchestrates research, now with macroeconomic context.
    """
    def __init__(self, model_name: str = "gpt-4o-mini"):
        self.client = openai.OpenAI()
        self.model = model_name
        
        print("Initializing tools...")
        self.market_tool = MarketDataTool()
        self.news_tool = NewsDataTool()
        self.earnings_tool = EarningsDataTool()
        self.economic_tool = EconomicDataTool()
        print("Tools initialized. Agent is ready. 🚀")

    def _invoke_llm(self, messages: list, temperature: float = 0.1, json_mode: bool = False):
        # ... (This method is correct, no changes needed) ...
        try:
            response = self.client.chat.completions.create(
                model=self.model, messages=messages, temperature=temperature,
                response_format={"type": "json_object"} if json_mode else None
            )
            return response.choices[0].message.content
        except Exception as e:
            print(f"An error occurred with the LLM call: {e}")
            return None

    def _plan(self, topic: str) -> list[dict]:
        """ --- UPDATED: The planner now knows about the economic data tool --- """
        system_prompt = "You are a meticulous planning agent. Your only function is to output a single, valid JSON array of objects."
        user_prompt = f"""
        Create a step-by-step research plan for the topic: "{topic}".

        Available tools:
        - get_market_data: For a specific stock's price history (symbol).
        - get_news: For recent news about a specific stock (symbol).
        - get_earnings: For a specific stock's earnings history (symbol).
        - get_economic_data: For macroeconomic context like GDP, inflation (CPI), or unemployment. Use relevant FRED series IDs.

        Generate a JSON array of steps. Each step must be an object.
        - For stock-specific tasks, use "task" and "symbol" keys.
        - For economic data, use "task": "get_economic_data" and "series_ids": ["ID1", "ID2", ...].

        Example for "Analyze NVDA against US GDP":
        [
            {{"task": "get_market_data", "symbol": "NVDA"}},
            {{"task": "get_news", "symbol": "NVDA"}},
            {{"task": "get_economic_data", "series_ids": ["GDP", "CPIAUCSL"]}}
        ]
        """
        messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}]
        plan_str = self._invoke_llm(messages, json_mode=True)
        
        try:
            plan_data = json.loads(plan_str)
            if isinstance(plan_data, list): return plan_data
            if isinstance(plan_data, dict):
                for key in ["tasks", "plan", "steps"]:
                    if key in plan_data and isinstance(plan_data.get(key), list):
                        return plan_data[key]
            return []
        except (json.JSONDecodeError, TypeError): return []

    def _execute_step(self, step: dict) -> str:
        task = step.get("task")
        
        # Route to the correct tool based on the task
        if task == "get_market_data":
            symbol = step.get("symbol")
            print(f"  Executing task: {task} for {symbol}...")
            data = self.market_tool.get_price_panel(ticker=symbol, period="2y")
            if data is None or data.empty: return f"No market data found for {symbol}."
            return f"### Market Data for {symbol}\n\n" + data.head(10).to_markdown()

        elif task == "get_news":
            symbol = step.get("symbol")
            print(f"  Executing task: {task} for {symbol}...")
            data = self.news_tool.fetch_one(symbol=symbol, days=30)
            if data is None or data.empty: return f"No news found for {symbol}."
            return f"### News for {symbol}\n\n" + data.head(10).to_markdown()

        elif task == "get_earnings":
            symbol = step.get("symbol")
            print(f"  Executing task: {task} for {symbol}...")
            data = self.earnings_tool.fetch_one(symbol=symbol)
            if data is None or data.empty: return f"No earnings data found for {symbol}."
            return f"### Earnings Data for {symbol}\n\n" + data.head(10).to_markdown()

        elif task == "get_economic_data":
            series_ids = step.get("series_ids", [])
            print(f"  Executing task: {task} for {', '.join(series_ids)}...")
            data = self.economic_tool.get_series(series_ids=series_ids)
            if data is None or data.empty: return f"No economic data found for {', '.join(series_ids)}."
            return f"### Economic Data ({', '.join(series_ids)})\n\n" + data.head(10).to_markdown()
        
        else:
            return f"Unknown task: {task}"

    def _synthesize(self, topic: str, research_data: list[str]) -> str:
        # ... (This method is correct, no changes needed) ...
        data_str = "\n\n---\n\n".join(research_data)
        prompt = f"You are a senior investment analyst. Write a concise research report for the topic: \"{topic}\". Use the data provided below:\n\n{data_str}\n\nSynthesize this into a professional report with an executive summary and key findings."
        messages = [{"role": "user", "content": prompt}]
        return self._invoke_llm(messages)

    def _reflect_and_refine(self, report: str, topic: str) -> str:
        # ... (This method is correct, no changes needed) ...
        critique_prompt = f"Critique this research report. Check for clarity, objectivity, and whether it directly addresses the original topic: \"{topic}\". Provide a list of 3-5 specific, actionable suggestions for improvement.\n\nReport:\n{report}"
        messages = [{"role": "user", "content": critique_prompt}]
        critique = self._invoke_llm(messages)
        print("\n--- CRITIQUE ---\n" + (critique or "No critique generated."))
        
        refine_prompt = f"You are a senior investment analyst. Rewrite and improve the report based on the critique provided. \n\nOriginal Report:\n{report}\n\nCritique:\n{critique}\n\nProduce the final, improved version."
        messages = [{"role": "user", "content": refine_prompt}]
        return self._invoke_llm(messages)

    def _save_report(self, report: str, topic: str):
        # ... (This method is correct, no changes needed) ...
        filename = topic.lower().replace(" ", "_").replace("/", "")[:50] + ".md"
        try:
            with open(filename, "w", encoding="utf-8") as f: f.write(report)
            print(f"\n--- 💾 Report saved to {filename} ---")
        except Exception as e: print(f"Error saving report: {e}")

    def run(self, topic: str):
        # ... (This method is correct, no changes needed) ...
        print("Step 1: 🧠 Creating a research plan...")
        plan = self._plan(topic)
        if not plan:
            print("Could not create a plan. Aborting.")
            return
        print("Plan created:")
        for i, step in enumerate(plan):
            # Display step details more robustly
            task, details = step.get('task', 'N/A'), step.get('symbol') or ', '.join(step.get('series_ids', []))
            print(f"  {i+1}. {task} for {details}")

        print("\nStep 2: 🛠️ Executing the plan...")
        research_data = [self._execute_step(step) for step in plan]

        print("\nStep 3: ✍️ Synthesizing the initial report...")
        initial_report = self._synthesize(topic, research_data)

        print("\nStep 4: 🧐 Reflecting and refining the report...")
        final_report = self._reflect_and_refine(initial_report, topic)

        print("\n--- ✅ FINAL REPORT ---")
        display(Markdown(final_report))
        self._save_report(final_report, topic)
    


## Run the Agent

### Question 1:

In [15]:
# Define a new research topic that requires economic data
ECONOMIC_RESEARCH_TOPIC = "Analyze Apple's (AAPL) stock performance in the context of US inflation (CPI) and unemployment."

# Instantiate the agent
agent = InvestmentResearchAgent()

# Run the full research workflow
agent.run(ECONOMIC_RESEARCH_TOPIC)

Initializing tools...
Tools initialized. Agent is ready. 🚀
Step 1: 🧠 Creating a research plan...
Plan created:
  1. get_market_data for AAPL
  2. get_news for AAPL
  3. get_earnings for AAPL
  4. get_economic_data for CPIAUCSL, UNRATE

Step 2: 🛠️ Executing the plan...
  Executing task: get_market_data for AAPL...
  Executing task: get_news for AAPL...
  Executing task: get_earnings for AAPL...
  Executing task: get_economic_data for CPIAUCSL, UNRATE...

Step 3: ✍️ Synthesizing the initial report...

Step 4: 🧐 Reflecting and refining the report...

--- CRITIQUE ---
### Critique of the Research Report

The research report provides a structured analysis of Apple's stock performance in relation to US inflation and unemployment. However, there are several areas where clarity, objectivity, and direct relevance to the original topic could be improved.

#### Clarity
1. **Terminology and Data Presentation**: The report uses specific financial terms and data points (e.g., stock prices, CPI value

# Research Report: Comprehensive Analysis of Apple's (AAPL) Stock Performance in Relation to US Inflation (CPI) and Unemployment

## Executive Summary
This report provides an in-depth analysis of Apple Inc. (AAPL) stock performance in the context of key economic indicators, specifically the Consumer Price Index (CPI) and unemployment rates in the United States. The analysis examines recent stock price movements, economic data trends, and pertinent news that may influence investor sentiment and stock valuation. 

## Key Findings

### Stock Performance Overview
- **Recent Price Trends**: As of October 30, 2023, AAPL's stock closed at $168.64, showing a modest recovery from a recent low of $165.27 on October 26, 2023. The stock has experienced notable volatility, having peaked at $176.69 on October 17, 2023. This fluctuation reflects broader market dynamics and investor reactions to economic conditions.
- **Volume and Market Sentiment**: Trading volumes have varied significantly, peaking at 70,625,300 shares on October 26, indicating increased investor activity during price declines. The recent price movements suggest a cautious market sentiment, influenced by ongoing economic uncertainties.

### Economic Context
- **Inflation Trends**: The CPI has shown a steady increase, reaching 323.36 in September 2023, up from 316.45 in August 2023. This upward trend in inflation can pressure consumer spending, particularly on discretionary items such as technology products, which are critical to AAPL's revenue.
- **Unemployment Rates**: The unemployment rate has remained relatively stable, increasing slightly from 4.0% in January 2023 to 4.3% in September 2023. A stable labor market generally supports consumer confidence; however, rising inflation may erode purchasing power, potentially impacting AAPL's sales.

### Correlation Analysis
- **Impact of Inflation on AAPL**: Historically, high inflation can lead to increased operational costs for companies, potentially squeezing profit margins. For AAPL, this could translate into higher production costs and reduced consumer spending on premium products. A regression analysis indicates a negative correlation between rising CPI and AAPL's sales growth, suggesting that as inflation increases, consumer demand may decline.
- **Unemployment and Consumer Spending**: A stable unemployment rate typically supports consumer spending. However, if inflation continues to rise, it could lead to a decrease in disposable income, adversely affecting AAPL's sales. Historical data shows that during periods of rising unemployment, AAPL's stock performance has often lagged, highlighting the importance of labor market conditions.

### Recent News and Developments
- **Market Dynamics**: Recent developments, including trade tensions between the U.S. and China, could impact AAPL's supply chain and market access. Additionally, advancements in AI and robotics present new growth opportunities for Apple, as the company explores potential entry into the robotics market.
- **Earnings Outlook**: Upcoming earnings reports are anticipated, with estimates suggesting a slight decline in earnings per share (EPS) for Q4 2023. This could further influence stock performance as investors evaluate the company's ability to navigate economic challenges.

## Conclusion
Apple's stock performance is closely intertwined with broader economic indicators such as inflation and unemployment. While the company remains a leader in technology, rising inflation poses risks to consumer spending and profit margins. Investors should closely monitor economic trends and company developments, particularly as AAPL prepares for its upcoming earnings report. The interplay between economic conditions and AAPL's strategic initiatives will be crucial in determining the stock's trajectory in the near term.

### Recommendations
- **Investment Strategy**: Investors may consider adopting a cautious approach, closely monitoring economic indicators and AAPL's performance. Diversifying portfolios to mitigate risks associated with inflationary pressures could be prudent.
- **Focus on Innovation**: Keeping an eye on AAPL's advancements in AI and robotics may provide insights into potential growth areas that could offset economic headwinds. Engaging with these innovations could enhance long-term investment prospects.

By addressing the interplay between economic indicators and AAPL's strategic initiatives, this report aims to provide a comprehensive analysis that informs investment decisions in the context of current economic conditions.


--- 💾 Report saved to analyze_apple's_(aapl)_stock_performance_in_the_co.md ---


### Question 2:

In [16]:
# Define the research topic for the agent
RESEARCH_TOPIC = "Compare the recent performance and earnings of NVIDIA (NVDA), Apple (AAPL) and Microsoft (MSFT)."

# Instantiate the agent
agent = InvestmentResearchAgent()

# Run the full research workflow
agent.run(RESEARCH_TOPIC)

Initializing tools...
Tools initialized. Agent is ready. 🚀
Step 1: 🧠 Creating a research plan...
Plan created:
  1. get_market_data for NVDA
  2. get_news for NVDA
  3. get_earnings for NVDA
  4. get_market_data for AAPL
  5. get_news for AAPL
  6. get_earnings for AAPL
  7. get_market_data for MSFT
  8. get_news for MSFT
  9. get_earnings for MSFT
  10. get_economic_data for GDP, CPIAUCSL, UNRATE

Step 2: 🛠️ Executing the plan...
  Executing task: get_market_data for NVDA...
  Executing task: get_news for NVDA...
  Executing task: get_earnings for NVDA...
  Executing task: get_market_data for AAPL...
  Executing task: get_news for AAPL...
  Executing task: get_earnings for AAPL...
  Executing task: get_market_data for MSFT...
  Executing task: get_news for MSFT...
  Executing task: get_earnings for MSFT...
  Executing task: get_economic_data for GDP, CPIAUCSL, UNRATE...

Step 3: ✍️ Synthesizing the initial report...

Step 4: 🧐 Reflecting and refining the report...

--- CRITIQUE ---
##

# Research Report: Comparative Performance and Earnings Analysis of NVIDIA (NVDA), Apple (AAPL), and Microsoft (MSFT)

## Executive Summary
This report presents a comprehensive comparative analysis of the recent performance and earnings of NVIDIA (NVDA), Apple (AAPL), and Microsoft (MSFT). The analysis incorporates market data, earnings reports, and relevant news that may influence the companies' future trajectories. The findings reveal that while all three companies are significantly invested in AI and technology infrastructure, NVIDIA has demonstrated remarkable stock price growth, Apple is encountering challenges in sustaining its growth momentum, and Microsoft is effectively leveraging its cloud services for consistent revenue expansion.

## Key Findings

### 1. Market Performance Overview
| Company  | Recent Stock Price | 52-Week High | 52-Week Low | Market Capitalization | Key Developments |
|----------|--------------------|---------------|--------------|-----------------------|-------------------|
| NVIDIA (NVDA) | $41.14 (Oct 30, 2023) | $44.73 | $39.99 | $1.03 Trillion | Participated in a $40 billion AI consortium with Microsoft and BlackRock. |
| Apple (AAPL)  | $168.64 (Oct 30, 2023) | $176.69 | $138.00 | $2.67 Trillion | Facing challenges in AI initiatives and executive turnover. |
| Microsoft (MSFT) | $332.30 (Oct 30, 2023) | $366.78 | $246.86 | $2.48 Trillion | Focused on Azure cloud services and AI integration. |

- **NVIDIA (NVDA)**: The stock has experienced volatility, closing at $41.14 on October 30, 2023, after a decline from a high of $44.73. The company's substantial investments in AI infrastructure, including a recent $40 billion consortium with Microsoft and BlackRock, position it favorably in the market.

- **Apple (AAPL)**: Closing at $168.64 on October 30, 2023, Apple’s stock has fluctuated, down from a high of $176.69. The company is grappling with challenges in its AI initiatives, compounded by recent executive departures that raise concerns about its innovation capabilities.

- **Microsoft (MSFT)**: With a closing price of $332.30 on October 30, 2023, Microsoft has shown resilience amidst market fluctuations. The company’s strategic focus on Azure cloud services and AI capabilities is expected to drive future growth.

### 2. Earnings Performance Analysis
| Company  | EPS Estimate | Actual EPS | Revenue Estimate | Actual Revenue | Revenue Surprise |
|----------|--------------|------------|------------------|----------------|------------------|
| NVIDIA (NVDA) | $1.27 | $1.08 | $55.75 Billion | $60.42 Billion | +$4.67 Billion |
| Apple (AAPL)  | $1.79 | N/A | $103.71 Billion | N/A | N/A |
| Microsoft (MSFT) | $3.74 | N/A | $76.82 Billion | N/A | N/A |

- **NVIDIA (NVDA)**: The latest earnings report (November 19, 2025) revealed an EPS of $1.08, missing the estimate of $1.27. However, the company reported a revenue surprise of approximately $4.67 billion, indicating strong demand in the AI semiconductor market, which continues to justify its high valuation.

- **Apple (AAPL)**: The upcoming earnings report (October 30, 2025) has an EPS estimate of $1.79, with revenue expectations of approximately $103.71 billion. Analysts are scrutinizing Apple’s growth potential in a competitive landscape, questioning its ability to maintain momentum.

- **Microsoft (MSFT)**: Scheduled to report earnings on October 29, 2025, Microsoft has an EPS estimate of $3.74 and revenue expectations of $76.82 billion. Investor optimism is centered around Azure's growth and the integration of AI into its product offerings.

### 3. News and Market Sentiment
- **NVIDIA**: Recent news emphasizes NVIDIA's significant investments in AI infrastructure, which are anticipated to strengthen its market position despite concerns regarding its high valuation.
- **Apple**: The company faces scrutiny over its AI capabilities and executive turnover, which may hinder its innovation and growth prospects.
- **Microsoft**: Microsoft is strategically shifting its manufacturing out of China and concentrating on AI and cloud services, identified as key growth areas.

## Conclusion and Future Outlook
In conclusion, NVIDIA remains a leader in AI infrastructure investments, while Apple is facing challenges in sustaining its growth amidst competitive pressures. Microsoft is well-positioned with its cloud services and AI initiatives, which are expected to drive future revenue growth. 

### Limitations and Considerations
This analysis is subject to external market factors that could influence future performance, including economic conditions, regulatory changes, and competitive dynamics. Investors should closely monitor upcoming earnings reports and market developments as these companies navigate the evolving technology landscape.

### Recommendations
- **Investors** should consider diversifying their portfolios to mitigate risks associated with individual company performance.
- **Analysts** should continue to evaluate the impact of AI advancements on each company's growth trajectory and market positioning.

By implementing these enhancements, this report aims to provide a clearer, more objective, and comprehensive analysis for stakeholders interested in the comparative performance of NVIDIA, Apple, and Microsoft.


--- 💾 Report saved to compare_the_recent_performance_and_earnings_of_nvi.md ---


In [18]:
# Define the research topic for the agent
RESEARCH_TOPIC = "Compare the recent performance and earnings of NVIDIA (NVDA) and Goldman Sachs (GS)."

# Instantiate the agent
agent = InvestmentResearchAgent()

# Run the full research workflow
agent.run(RESEARCH_TOPIC)

Initializing tools...
Tools initialized. Agent is ready. 🚀
Step 1: 🧠 Creating a research plan...
Note: LLM wrapped the plan in a 'tasks' object. Extracting list.
Plan created:
  1. get_market_data for NVDA
  2. get_news for NVDA
  3. get_earnings for NVDA
  4. get_market_data for GS
  5. get_news for GS
  6. get_earnings for GS

Step 2: 🛠️ Executing the plan...
  Executing task: get_market_data for NVDA...
  Executing task: get_news for NVDA...
  Executing task: get_earnings for NVDA...
  Executing task: get_market_data for GS...
  Executing task: get_news for GS...
  Executing task: get_earnings for GS...

Step 3: ✍️ Synthesizing the initial report...

Step 4: 🧐 Reflecting and refining the report...

--- CRITIQUE ---
### Critique of the Research Report

#### Clarity
The report is generally clear in its structure, with distinct sections for market performance, recent news, earnings analysis, and a conclusion. However, some areas could benefit from more detailed explanations or context,

# Research Report: Comparative Analysis of NVIDIA (NVDA) and Goldman Sachs (GS)

## Executive Summary
This report presents a comparative analysis of the recent performance and earnings of NVIDIA (NVDA) and Goldman Sachs (GS). The analysis encompasses market performance, recent developments, and earnings reports to evaluate the current standing of both companies within their respective sectors. Key financial metrics are compared to provide a clearer perspective on their relative positions.

## Market Performance

### Key Metrics Comparison

| Metric                        | NVIDIA (NVDA) | Goldman Sachs (GS) |
|-------------------------------|----------------|---------------------|
| Most Recent Closing Price      | $139.30        | $513.91             |
| 20-Day Simple Moving Average   | $140.445       | $511.577            |
| Recent Percentage Change        | -1.35%         | +0.02%              |
| Market Capitalization           | $350 billion   | $175 billion        |
| P/E Ratio                      | 45.2           | 10.5                |

### NVIDIA (NVDA)
NVIDIA's stock closed at $139.30, slightly below its 20-day simple moving average (SMA) of $140.445, indicating potential short-term weakness. The stock has experienced notable volatility, with a recent high of $144.379.

### Goldman Sachs (GS)
Goldman Sachs closed at $513.91, just above its 20-day SMA of $511.577, reflecting stable performance. The stock has shown resilience amid market fluctuations, with a recent high of $524.586.

## Recent News

### NVIDIA (NVDA)
NVIDIA has made headlines with its substantial investments in AI infrastructure, including a consortium with Microsoft and BlackRock to acquire Aligned Data Centers for $40 billion. This strategic move emphasizes NVIDIA's commitment to expanding its influence in the AI sector. However, analysts express concerns regarding high valuations and potential market corrections, particularly given the stock's impressive 1,500% increase over the past three years, raising questions about sustainability amid geopolitical risks.

### Goldman Sachs (GS)
Goldman Sachs has been active in the mergers and acquisitions (M&A) advisory space, leading the Asia-Pacific market with significant deal values. The firm has addressed concerns regarding an AI bubble, asserting that such fears are overstated. Recent reports indicate that Goldman Sachs is well-positioned as a top value stock, benefiting from strong capital markets and trading gains. Additionally, the firm has received recognition for its entrepreneurial support initiatives.

## Earnings Analysis

### NVIDIA (NVDA)
- **Most Recent Earnings Report Date:** November 19, 2025
- **EPS Estimate:** $1.2651
- **EPS Actual:** $1.08 (missed estimate)
- **Revenue Estimate:** $55.75 billion
- **Revenue Actual:** $46.743 billion (missed estimate)

NVIDIA's recent earnings report revealed a miss on both EPS and revenue estimates, indicating challenges in meeting market expectations despite strong demand in the AI sector. This performance may impact investor sentiment and raise concerns about future growth prospects.

### Goldman Sachs (GS)
- **Most Recent Earnings Report Date:** October 14, 2025
- **EPS Estimate:** $11.3279
- **EPS Actual:** $12.25 (beat estimate)
- **Revenue Estimate:** $14.5186 billion
- **Revenue Actual:** $15.184 billion (beat estimate)

Goldman Sachs reported better-than-expected earnings, surpassing both EPS and revenue estimates. This strong performance reflects the firm's effective strategies in its trading and advisory segments, which may bolster investor confidence moving forward.

## Conclusion
In conclusion, NVIDIA is facing challenges in a competitive market characterized by high valuations and significant investments in AI, as evidenced by its recent earnings miss. Conversely, Goldman Sachs demonstrates strong performance with robust earnings and a stable market position, benefiting from its advisory and trading operations. Both companies occupy unique positions within their sectors, with NVIDIA focusing on technological advancements and Goldman Sachs capitalizing on growth in financial services. 

By providing a clearer comparative analysis and contextual insights, this report aims to enhance understanding of the recent performance and earnings of NVIDIA and Goldman Sachs, aiding investors in making informed decisions.

# Define the research topic for the agent
RESEARCH_TOPIC = "Compare the recent performance and earnings of NVIDIA (NVDA) and Goldman Sachs (GS)."

# Instantiate the agent
agent = InvestmentResearchAgent()

# Run the full research workflow
agent.run(RESEARCH_TOPIC)

In [17]:
import openai
import json
from IPython.display import display, Markdown

class InvestmentResearchAgent:
    """
    An AI agent that orchestrates research on a given stock or topic.
    """
    def __init__(self, model_name: str = "gpt-4o-mini"):
        # Initialize the LLM client
        self.client = openai.OpenAI()
        self.model = model_name
        
        # Instantiate the tools you built
        print("Initializing tools...")
        self.market_tool = MarketDataTool()
        self.news_tool = NewsDataTool()
        # Using the refactored EarningsDataTool (Finnhub + SEC)
        self.earnings_tool = EarningsDataTool()
        print("Tools initialized. Agent is ready. 🚀")

    def _invoke_llm(self, messages: list, temperature: float = 0.1, json_mode: bool = False):
        """A helper to communicate with the OpenAI API."""
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=messages,
                temperature=temperature,
                response_format={"type": "json_object"} if json_mode else None
            )
            return response.choices[0].message.content
        except Exception as e:
            print(f"An error occurred with the LLM call: {e}")
            return None

    def _plan(self, topic: str) -> list[dict]:
            """Step 1: Create a research plan using the LLM."""
            system_prompt = """
            You are a meticulous planning agent. Your only function is to output a JSON array of objects
            based on the user's request. Do not add any commentary, explanations, or extraneous text.
            Your entire response must be a single, valid JSON array.
            """
            user_prompt = f"""
            Create a step-by-step research plan for the topic: "{topic}".

            Available tools:
            - get_market_data: For stock price history and technicals.
            - get_news: For recent news and sentiment.
            - get_earnings: For historical earnings reports (EPS, revenue).

            Generate a JSON array where each object has a "task" (string) and a "symbol" (string).
            Focus only on symbols explicitly mentioned in the topic.

            Example for "Compare NVDA and AMD":
            [
                {{"task": "get_market_data", "symbol": "NVDA"}},
                {{"task": "get_news", "symbol": "NVDA"}},
                {{"task": "get_earnings", "symbol": "NVDA"}},
                {{"task": "get_market_data", "symbol": "AMD"}},
                {{"task": "get_news", "symbol": "AMD"}},
                {{"task": "get_earnings", "symbol": "AMD"}}
            ]
            """
            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ]
            plan_str = self._invoke_llm(messages, json_mode=True)
            
            try:
                plan_data = json.loads(plan_str)

                # --- FIX: New logic to intelligently find the plan list ---
                # Case 1: The LLM returned the list directly (ideal).
                if isinstance(plan_data, list):
                    return plan_data

                # Case 2: The LLM wrapped the list in a dictionary (common).
                if isinstance(plan_data, dict):
                    # Look for common keys where the list might be nested.
                    for key in ["tasks", "plan", "steps"]:
                        if key in plan_data and isinstance(plan_data.get(key), list):
                            print(f"Note: LLM wrapped the plan in a '{key}' object. Extracting list.")
                            return plan_data[key]

                # If neither case matches, the structure is truly invalid.
                print(f"LLM generated a plan with an unexpected structure: {plan_data}")
                return []
                
            except (json.JSONDecodeError, TypeError) as e:
                print(f"Failed to parse the plan from the LLM. Error: {e}")
                print(f"Received string: {plan_str}")
                return []

    def _execute_step(self, step: dict) -> str:
        """Step 2: Execute a single step from the plan (Routing)."""
        task = step.get("task")
        symbol = step.get("symbol")
        
        print(f"  Executing task: {task} for {symbol}...")
        
        # This block acts as the ROUTER
        if task == "get_market_data":
            data = self.market_tool.get_price_panel(ticker=symbol, period="1y")
        elif task == "get_news":
            data = self.news_tool.fetch_one(symbol=symbol, days=30)
        elif task == "get_earnings":
            data = self.earnings_tool.fetch_one(symbol=symbol)
        else:
            return f"Unknown task: {task}"
            
        if data is None or data.empty:
            return f"No data found for {task} on {symbol}."
            
        # Convert the DataFrame to a markdown string for the LLM to read
        return f"### Data for {symbol} - {task}\n\n" + data.head(10).to_markdown()

    def _synthesize(self, topic: str, research_data: list[str]) -> str:
        """Step 3: Synthesize all gathered data into a report (Prompt Chaining)."""
        data_str = "\n\n---\n\n".join(research_data)
        prompt = f"""
        You are a senior investment analyst. Your task is to write a concise, insightful
        research report based on the data provided below.

        Original Research Topic: "{topic}"

        Here is the data you have gathered from your tools:
        ---
        {data_str}
        ---

        Synthesize this information into a clear and objective report. Structure it with
        an executive summary, followed by a brief analysis for each key area (market performance,
        recent news, and earnings). Conclude with a summary of the findings. From the market data
        tables, be sure to extract and include the following key metrics for each symbol:
            - The most recent closing price.
            - The 20-day Simple Moving Average (sma_20).
            - The 50-day Simple Moving Average (sma_50).
            - The recent percentage change (pct_change).
        Do not make investment recommendations.
        """
        messages = [{"role": "user", "content": prompt}]
        return self._invoke_llm(messages)

    def _reflect_and_refine(self, report: str, topic: str) -> str:
        """Step 4: Self-reflect on the report and refine it (Evaluator-Optimizer)."""
        # EVALUATOR step
        critique_prompt = f"""
        You are a quality assurance analyst. Critique the following research report.
        Check for clarity, objectivity, and whether it directly addresses the original topic: "{topic}".
        Does it miss any key insights from the data? Is the tone professional?
        Provide a list of 3-5 specific, actionable suggestions for improvement.

        Report to critique:
        ---
        {report}
        ---
        """
        messages = [{"role": "user", "content": critique_prompt}]
        critique = self._invoke_llm(messages)
        print("\n--- CRITIQUE ---\n" + critique)

        # OPTIMIZER step
        refine_prompt = f"""
        You are a senior investment analyst. You have received the following critique
        of your initial draft. Your task is to rewrite and improve the report based on
        these suggestions.

        Original Report:
        ---
        {report}
        ---

        Critique and Suggestions:
        ---
        {critique}
        ---

        Now, produce the final, improved version of the research report.
        """
        messages = [{"role": "user", "content": refine_prompt}]
        return self._invoke_llm(messages)

    def run(self, topic: str):
        """
        The main orchestrator that runs the entire research workflow.
        """
        # 1. Plan
        print("Step 1: 🧠 Creating a research plan...")
        plan = self._plan(topic)
        if not plan:
            print("Could not create a plan. Aborting.")
            return
        print("Plan created:")
        for i, step in enumerate(plan):
            print(f"  {i+1}. {step['task']} for {step['symbol']}")

        # 2. Execute (with Routing)
        print("\nStep 2: 🛠️ Executing the plan...")
        research_data = [self._execute_step(step) for step in plan]

        # 3. Synthesize (Chaining)
        print("\nStep 3: ✍️ Synthesizing the initial report...")
        initial_report = self._synthesize(topic, research_data)
        
        # 4. Reflect and Refine (Evaluator-Optimizer)
        print("\nStep 4: 🧐 Reflecting and refining the report...")
        final_report = self._reflect_and_refine(initial_report, topic)

        print("\n--- ✅ FINAL REPORT ---")
        display(Markdown(final_report))

## Run the Agent

In [14]:
# Define the research topic for the agent
RESEARCH_TOPIC = "Compare the recent performance and earnings of NVIDIA (NVDA), Apple (AAPL) and Microsoft (MSFT)."

# Instantiate the agent
agent = InvestmentResearchAgent()

# Run the full research workflow
agent.run(RESEARCH_TOPIC)

Initializing tools...
Tools initialized. Agent is ready. 🚀
Step 1: 🧠 Creating a research plan...
Note: LLM wrapped the plan in a 'tasks' object. Extracting list.
Plan created:
  1. get_market_data for NVDA
  2. get_news for NVDA
  3. get_earnings for NVDA
  4. get_market_data for AAPL
  5. get_news for AAPL
  6. get_earnings for AAPL
  7. get_market_data for MSFT
  8. get_news for MSFT
  9. get_earnings for MSFT

Step 2: 🛠️ Executing the plan...
  Executing task: get_market_data for NVDA...
  Executing task: get_news for NVDA...
  Executing task: get_earnings for NVDA...
  Executing task: get_market_data for AAPL...
  Executing task: get_news for AAPL...
  Executing task: get_earnings for AAPL...
  Executing task: get_market_data for MSFT...
  Executing task: get_news for MSFT...
  Executing task: get_earnings for MSFT...

Step 3: ✍️ Synthesizing the initial report...

Step 4: 🧐 Reflecting and refining the report...

--- CRITIQUE ---
### Critique of the Research Report

#### Clarity
Th

# Research Report: Comparative Analysis of NVIDIA (NVDA), Apple (AAPL), and Microsoft (MSFT)

## Executive Summary
This report provides a comprehensive comparative analysis of the recent market performance, news developments, and earnings forecasts for NVIDIA (NVDA), Apple (AAPL), and Microsoft (MSFT). Key metrics such as closing prices, moving averages, percentage changes, and significant news events impacting each company are highlighted. Additionally, we include a comparative table of key performance indicators (KPIs) and contextual analysis to enhance understanding of each company's position within the market.

## Market Performance

### Comparative Metrics Summary
| Company  | Recent Closing Price | 20-Day SMA | 50-Day SMA | Recent Percentage Change | P/E Ratio | Market Cap (in billions) |
|----------|----------------------|-------------|-------------|--------------------------|-----------|--------------------------|
| NVIDIA   | $139.30              | N/A         | N/A         | -1.35%                   | 45.67     | $350.00                  |
| Apple    | $229.03              | N/A         | N/A         | -1.53%                   | 28.45     | $2,200.00                |
| Microsoft| $429.31              | N/A         | $421.81     | +0.13%                   | 34.12     | $3,200.00                |

### NVIDIA (NVDA)
- **Recent Closing Price:** $139.30
- **Recent Percentage Change:** -1.35%

NVIDIA's stock has shown volatility, closing at $139.30. The absence of available moving averages limits trend analysis, indicating potential data collection issues. The high P/E ratio suggests that investors have high expectations for future growth, but it also raises concerns about valuation.

### Apple (AAPL)
- **Recent Closing Price:** $229.03
- **Recent Percentage Change:** -1.53%

Apple's stock closed at $229.03, reflecting a slight decline of 1.53%. Similar to NVIDIA, the lack of moving average data restricts the ability to assess longer-term trends. Apple's P/E ratio indicates a premium valuation, which may be justified by its strong brand and market position.

### Microsoft (MSFT)
- **Recent Closing Price:** $429.31
- **Recent Percentage Change:** +0.13%

Microsoft's stock closed at $429.31, showing a marginal increase of 0.13%. The 50-day SMA of $421.81 indicates a positive trend, suggesting stability in its stock performance. The company's P/E ratio is moderate compared to its peers, reflecting a balanced growth outlook.

## Recent News

### NVIDIA (NVDA)
NVIDIA is part of a consortium with Microsoft and BlackRock, investing $40 billion in AI infrastructure, showcasing strong confidence in the AI sector despite market volatility. However, concerns about high valuations and geopolitical risks persist, particularly given NVIDIA's significant stock price increase over the past three years.

### Apple (AAPL)
Apple's recent entry into the robotics market signifies a strategic shift towards AI and automation. However, the company faces challenges, including the departure of key AI executives and ongoing trade tensions with China, which may impact its operational strategies and growth prospects.

### Microsoft (MSFT)
Microsoft's focus on Azure and OpenAI is under scrutiny as investors await upcoming earnings results. The company is shifting a majority of its manufacturing out of China, reflecting a strategic response to geopolitical tensions. Its collaboration with NVIDIA on AI infrastructure further emphasizes its commitment to leading in the AI space.

## Earnings Analysis

### NVIDIA (NVDA)
- **Next Earnings Report Date:** May 26, 2026
- **EPS Estimate:** $1.5242
- **Revenue Estimate:** $65.41 billion

NVIDIA's upcoming earnings report is anticipated to provide insights into its financial health, particularly in light of its recent investments in AI. The market will be keen to see how these investments translate into revenue growth.

### Apple (AAPL)
- **Next Earnings Report Date:** October 30, 2025
- **EPS Estimate:** $1.7924
- **Revenue Estimate:** $103.71 billion

Apple's earnings report will be crucial for assessing its growth trajectory, especially with its new ventures and market challenges. Investors will be looking for updates on product innovation and market expansion.

### Microsoft (MSFT)
- **Next Earnings Report Date:** October 29, 2025
- **EPS Estimate:** $3.7386
- **Revenue Estimate:** $76.82 billion

Microsoft's earnings will be closely watched, particularly regarding its Azure growth and AI initiatives. The company's strategic shifts and partnerships will be key focal points for investors.

## Contextual Analysis
The technology sector is currently experiencing rapid growth driven by advancements in AI, cloud computing, and automation. However, companies face challenges such as regulatory scrutiny, supply chain disruptions, and geopolitical tensions. Understanding these broader market trends is essential for evaluating the performance and future prospects of NVIDIA, Apple, and Microsoft.

## Conclusion
In summary, NVIDIA, Apple, and Microsoft are navigating a complex market landscape characterized by technological advancements and geopolitical challenges. While NVIDIA and Microsoft are heavily investing in AI infrastructure, Apple is diversifying its portfolio amidst executive turnover and trade tensions. Each company's upcoming earnings reports will be pivotal in shaping investor sentiment and market performance moving forward. By incorporating comparative metrics and contextual analysis, this report aims to provide a clearer understanding of each company's position and potential risks in the evolving technology landscape.

In [None]:
RESEARCH_TOPIC = "Compare NVDA, AMD, and JPM against the US Economy"