### ‚ö° Scoring Logic & Performance

**Yes, the scorer uses chunking with score aggregation!** Here's how it works:

#### 1. **Text Chunking** (Smart Splitting)
- Each transcript is split into chunks (default: **2000 characters** each)
- Splits on paragraph/sentence boundaries to maintain context
- Example: 100,000 char transcript = **50 chunks**

#### 2. **LLM Scoring** (Per Chunk)
- Each chunk is scored independently via API call (score: 1-5)
- Uses gpt-4o-mini with deterministic temperature (0.0)
- Each chunk costs ~$0.001 and takes ~0.5-1 second

#### 3. **Score Aggregation** (Sophisticated Methods)
Combines chunk scores using:
- **Trimmed mean**: Removes outliers (top/bottom 10%)
- **Position weighting**: Early chunks weighted higher (forward guidance)
- **Confidence scoring**: Based on score variance across chunks
- **Trend analysis**: Detects sentiment shifts across transcript

#### üìä **Why It's Slow**
- **Default chunk size (2000 chars) = too many chunks!**
- Example: 100k transcript = 50 chunks = 50 API calls = ~50 seconds
- 919 transcripts √ó 50 chunks = **45,950 API calls** = ~10 hours!

#### ‚ö° **Speed Fix**
**Increase chunk size** to reduce API calls:
- 2000 chars ‚Üí 10,000 chars = **5x faster**
- 50 chunks ‚Üí 10 chunks per transcript
- 10 hours ‚Üí **2 hours**

Run the optimization cell below before scoring!

# AI Economy Score Predictor - Full Pipeline

Complete end-to-end implementation of the earnings call sentiment ‚Üí economic prediction ‚Üí trading strategy pipeline.

## Setup & Configuration

In [1]:
import pandas as pd
import numpy as np
import yaml
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')
from data_acquisition import DataAcquisition
from llm_scorer import LLMScorer
from feature_engineering import FeatureEngineer
from prediction_model import PredictionModel
from signal_generator import SignalGenerator
from backtester import Backtester
with open('config.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("‚úì Pipeline modules loaded")
print(f"‚úì Config loaded: {len(config)} sections")

‚úì Pipeline modules loaded
‚úì Config loaded: 9 sections


## Step 1: Data Acquisition

In [2]:
# Initialize data acquisition
data_acq = DataAcquisition('config.yaml')
sp500 = data_acq.fetch_sp500_constituents()
sp500.head(10)

‚úì FRED API initialized
‚úì Loaded 503 S&P 500 constituents


Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989
5,ADBE,Adobe Inc.,Information Technology,Application Software,"San Jose, California",1997-05-05,796343,1982
6,AMD,Advanced Micro Devices,Information Technology,Semiconductors,"Santa Clara, California",2017-03-20,2488,1969
7,AES,AES Corporation,Utilities,Independent Power Producers & Energy Traders,"Arlington, Virginia",1998-10-02,874761,1981
8,AFL,Aflac,Financials,Life & Health Insurance,"Columbus, Georgia",1999-05-28,4977,1955
9,A,Agilent Technologies,Health Care,Life Sciences Tools & Services,"Santa Clara, California",2000-06-05,1090872,1999


# Data Fetch Testing

In [3]:
import pandas as pd
from data_acquisition import DataAcquisition
data = DataAcquisition("config.yaml")
transcripts = data.fetch_earnings_transcripts('2025-01-01', '2026-01-01')

# Filter to ensure only 2025 data
transcripts['date'] = pd.to_datetime(transcripts['date'])
transcripts = transcripts[transcripts['date'] >= '2025-01-01'].copy()

print(f"Loaded {len(transcripts)} transcripts for 2025")
print(f"Date range: {transcripts['date'].min().date()} to {transcripts['date'].max().date()}")

macro = data.fetch_macro_data('2025-01-01', '2025-12-31')
print(f"Loaded {len(macro)} macro indicators")
sp500 = data.fetch_sp500_constituents()
print(f"Loaded {len(sp500)} S&P 500 stocks")

‚úì FRED API initialized
Fetching transcripts from Hugging Face (kurry/sp500_earnings_transcripts)...
Downloading dataset...
Converting to DataFrame...
‚úì Loaded 33,362 total transcripts
‚úì Loaded 503 S&P 500 constituents
Filtering by date and S&P 500 membership...
  After date filter: 941 transcripts
‚úì Final result: 919 S&P 500 transcripts (2025-01-01 to 2026-01-01)
Loaded 919 transcripts for 2025
Date range: 2025-01-10 to 2025-05-15
‚úì Fetched gdp: 3 observations
‚úì Fetched industrial_production: 12 observations
‚úì Fetched employment: 12 observations
‚úì Fetched wages: 12 observations
Loaded 4 macro indicators
‚úì Loaded 503 S&P 500 constituents
Loaded 503 S&P 500 stocks


## Step 2: Fetch Macro Data (FRED API)

**Note**: If you get FRED API errors, restart the kernel to reload the config with the updated API key.

In [4]:
# Fetch macroeconomic data for 2025 only
start_date = '2025-01-01'
end_date = '2025-12-31'
macro_data = data_acq.fetch_macro_data(start_date, end_date)
print(f"\nMacroeconomic Data (2025):")
for name, df in macro_data.items():
    if len(df) > 0:
        df_temp = df.copy()
        df_temp['date'] = pd.to_datetime(df_temp['date'])
        # Filter to 2025
        df_2025 = df_temp[df_temp['date'] >= '2025-01-01']
        print(f"  {name}: {len(df_2025)} observations (filtered to 2025)")

‚úì Fetched gdp: 3 observations
‚úì Fetched industrial_production: 12 observations
‚úì Fetched employment: 12 observations
‚úì Fetched wages: 12 observations

Macroeconomic Data (2025):
  gdp: 3 observations (filtered to 2025)
  industrial_production: 12 observations (filtered to 2025)
  employment: 12 observations (filtered to 2025)
  wages: 12 observations (filtered to 2025)


In [5]:
import pandas as pd
import re

pmi_path = 'pmi_data.csv'
pmi_df = pd.read_csv(pmi_path)
pmi_df.columns = [c.strip().lower().replace(' ', '_') for c in pmi_df.columns]
print("Columns in PMI file:", pmi_df.columns.tolist())

# Find date and PMI columns
date_col = [col for col in pmi_df.columns if 'date' in col][0]
pmi_col = [col for col in pmi_df.columns if 'pmi' in col][0]

def clean_date(val):
    # Extract the part before the first parenthesis
    val = str(val).split('(')[0].strip()
    try:
        return pd.to_datetime(val)
    except Exception:
        return pd.NaT

pmi_df[date_col] = pmi_df[date_col].apply(clean_date)
pmi_df = pmi_df.dropna(subset=[date_col, pmi_col])

# Rename columns to standard names
pmi_df = pmi_df.rename(columns={date_col: 'date', pmi_col: 'pmi'})

print(f"Loaded PMI data: {len(pmi_df)} rows")
print(pmi_df.tail())

Columns in PMI file: ['date', 'pmi']
Loaded PMI data: 133 rows
          date   pmi
128 2015-05-01  51.5
129 2015-04-01  51.5
130 2015-03-02  52.9
131 2015-02-02  53.5
132 2015-01-02  55.5


In [6]:
# Filter PMI from 2025 onwards and create daily index for merging
pmi_df = pmi_df[pmi_df['date'] >= '2024-12-01'].copy()
pmi_df = pmi_df.sort_values('date')

# Create a complete daily date range for 2025
date_range = pd.date_range(start='2024-12-01', end='2025-12-31', freq='D')
pmi_daily = pd.DataFrame({'date': date_range})

# Merge and forward-fill PMI values
pmi_daily = pmi_daily.merge(pmi_df, on='date', how='left')
# Use both forward-fill and backward-fill to handle initial NaN values
pmi_daily['pmi'] = pmi_daily['pmi'].ffill().bfill()

print(f"Filtered PMI data: {len(pmi_df)} original rows")
print(f"Created daily PMI data: {len(pmi_daily)} rows (forward-filled)")
print(f"\nFirst few rows:")
print(pmi_daily.head(10))
print(f"\nCheck for remaining NaN values: {pmi_daily['pmi'].isna().sum()}")

# Use the daily PMI data for merging
pmi_df = pmi_daily

Filtered PMI data: 14 original rows
Created daily PMI data: 396 rows (forward-filled)

First few rows:
        date   pmi
0 2024-12-01  48.4
1 2024-12-02  48.4
2 2024-12-03  48.4
3 2024-12-04  48.4
4 2024-12-05  48.4
5 2024-12-06  48.4
6 2024-12-07  48.4
7 2024-12-08  48.4
8 2024-12-09  48.4
9 2024-12-10  48.4

Check for remaining NaN values: 0


In [7]:
# Verify PMI values change throughout the year (show monthly transitions)
print("PMI values at monthly transitions:")
sample_dates = ['2025-01-03', '2025-02-03', '2025-03-03', '2025-04-01', 
                '2025-05-01', '2025-06-02', '2025-07-01', '2025-08-01']
for date in sample_dates:
    value = pmi_df[pmi_df['date'] == date]['pmi'].values
    if len(value) > 0:
        print(f"  {date}: {value[0]}")

print(f"\nUnique PMI values in 2025: {sorted(pmi_df['pmi'].unique())}")
print(f"This is correct - PMI is monthly, so each value repeats until the next release")

PMI values at monthly transitions:
  2025-01-03: 49.3
  2025-02-03: 50.9
  2025-03-03: 50.3
  2025-04-01: 49.0
  2025-05-01: 48.7
  2025-06-02: 48.5
  2025-07-01: 49.0
  2025-08-01: 48.0

Unique PMI values in 2025: [np.float64(48.0), np.float64(48.2), np.float64(48.4), np.float64(48.5), np.float64(48.7), np.float64(49.0), np.float64(49.1), np.float64(49.3), np.float64(50.3), np.float64(50.9)]
This is correct - PMI is monthly, so each value repeats until the next release


In [15]:
# Fetch control variables
controls = data_acq.fetch_control_variables(start_date, end_date)
print(f"\nControl Variables: {len(controls)} observations")
controls.head()

‚úì Fetched yield curve slope
‚úì Fetched consumer sentiment
‚úì Fetched unemployment rate
‚úó No local PMI data provided; PMI not included in controls.
‚úì Control variables: 12 observations

Control Variables: 12 observations


Unnamed: 0,yield_curve_slope,consumer_sentiment,unemployment_rate
2025-01-01,0.36,71.7,4.0
2025-02-01,0.24,64.7,4.2
2025-03-01,0.31,57.0,4.2
2025-04-01,0.5,52.2,4.2
2025-05-01,0.5,52.2,4.3


In [16]:
data_acq.pmi_df = pmi_df
controls = data_acq.fetch_control_variables(start_date, end_date, pmi_df=pmi_df)

‚úì Fetched yield curve slope
‚úì Fetched consumer sentiment
‚úì Fetched unemployment rate
‚úì Used local PMI data: 365 rows
‚úì Control variables: 365 observations


In [17]:
#rename index column to date
controls = controls.reset_index().rename(columns={'index': 'date'})

In [18]:
# make the index column date
controls = controls.groupby(controls['date'].dt.to_period('M')).last().reset_index(drop=True)
controls['date'] = pd.to_datetime(controls['date'].astype(str))
print(f"Resampled to monthly: {len(controls)} observations")
controls.head(100)

Resampled to monthly: 12 observations


Unnamed: 0,date,yield_curve_slope,consumer_sentiment,unemployment_rate,pmi
0,2025-01-31,0.36,71.7,4.0,49.3
1,2025-02-28,0.24,64.7,4.2,50.9
2,2025-03-31,0.31,57.0,4.2,50.3
3,2025-04-30,0.5,52.2,4.2,49.0
4,2025-05-31,0.5,52.2,4.3,48.7
5,2025-06-30,0.49,60.7,4.1,48.5
6,2025-07-31,0.51,61.7,4.3,49.0
7,2025-08-31,0.56,58.2,4.3,48.0
8,2025-09-30,0.55,55.1,4.4,48.7
9,2025-10-31,0.54,53.6,4.4,49.1


## Step 2: LLM Scoring

In [19]:
# Initialize LLM scorer
scorer = LLMScorer('config.yaml')

## Step 3: Feature Engineering

In [20]:
def aggregate_scores_by_quarter(scored_transcripts):
    """
    Aggregate individual transcript scores into quarterly AGG scores.
    
    Args:
        scored_transcripts: List of dicts with 'symbol', 'date', 'score', 'market_cap'
        
    Returns:
        DataFrame with quarterly AGG scores
    """
    df = pd.DataFrame(scored_transcripts)
    df['date'] = pd.to_datetime(df['date'])
    df['year'] = df['date'].dt.year
    df['quarter'] = df['date'].dt.quarter
    df['quarter_date'] = df['date'].dt.to_period('Q').dt.to_timestamp()
    
    # Aggregate by quarter using value-weighted average
    quarterly = df.groupby('quarter_date').apply(
        lambda x: np.average(x['score'], weights=x.get('market_cap', [1]*len(x)))
    ).reset_index()
    
    quarterly.columns = ['date', 'agg_score']
    quarterly['year'] = quarterly['date'].dt.year
    quarterly['quarter'] = quarterly['date'].dt.quarter
    
    return quarterly[['date', 'year', 'quarter', 'agg_score']]

# Example usage (commented out - requires real transcript scores):
# scored_transcripts = scorer.score_multiple_transcripts(transcripts)
# agg_scores = aggregate_scores_by_quarter(scored_transcripts)
# agg_scores.to_csv('agg_scores.csv', index=False)
print("‚úì AGG score aggregation function defined")

‚úì AGG score aggregation function defined


In [21]:
# Prepare transcripts for scoring (uses whatever was fetched based on config.yaml)
if 'transcripts' not in dir() or len(transcripts) == 0:
    print("ERROR: No transcripts loaded!")
    print("Please run the data acquisition cell first to load transcripts.")
    raise ValueError("Transcripts not loaded. Run data acquisition cell first.")

# Use the loaded transcripts
scoring_transcripts = transcripts.copy()
scoring_transcripts['date'] = pd.to_datetime(scoring_transcripts['date'])

# Generate save path based on date range
date_min = scoring_transcripts['date'].min()
date_max = scoring_transcripts['date'].max()
year_range = f"{date_min.year}_{date_max.year}"
save_path = f'scored_transcripts_{year_range}.csv'

print(f"Preparing to score {len(scoring_transcripts)} transcripts")
print(f"Date range: {date_min.date()} to {date_max.date()}")
print(f"Estimated cost: ${len(scoring_transcripts) * 0.001:.2f} - ${len(scoring_transcripts) * 0.002:.2f}")
print(f"Estimated time: {len(scoring_transcripts) * 2 / 60:.1f} - {len(scoring_transcripts) * 3 / 60:.1f} minutes")
print(f"Results will be saved to: {save_path}")

# Show breakdown by year
year_counts = scoring_transcripts['date'].dt.year.value_counts().sort_index()
print(f"\nTranscripts by year:")
for year, count in year_counts.items():
    print(f"  {year}: {count} transcripts")

print(f"\nCheckpoints will be saved every 50 transcripts")

Preparing to score 919 transcripts
Date range: 2025-01-10 to 2025-05-15
Estimated cost: $0.92 - $1.84
Estimated time: 30.6 - 46.0 minutes
Results will be saved to: scored_transcripts_2025_2025.csv

Transcripts by year:
  2025: 919 transcripts

Checkpoints will be saved every 50 transcripts


In [22]:
# Define the scoring function with progress tracking
import time
from tqdm.notebook import tqdm
from datetime import datetime

def score_quarter_transcripts(transcripts_df, scorer, save_path='scored_transcripts.csv'):
    """
    Score all transcripts with progress tracking, checkpointing, and error handling.
    """
    # First, inspect the data structure
    print("Inspecting data structure...")
    print(f"Type: {type(transcripts_df)}")
    print(f"Columns: {transcripts_df.columns.tolist()}")
    print(f"\nFirst row type: {type(transcripts_df.iloc[0])}")
    print(f"First row preview:")
    print(transcripts_df.iloc[0])
    
    print(f"\nScoring {len(transcripts_df)} transcripts...")
    print(f"Estimated cost: ${len(transcripts_df) * 0.001:.2f} (GPT-4o-mini)")
    print(f"Estimated time: {len(transcripts_df) * 2 / 60:.1f} minutes")
    
    # Check for existing progress
    try:
        existing = pd.read_csv(save_path)
        already_scored = set(existing['symbol'] + '_' + existing['date'].astype(str))
        print(f"Found {len(already_scored)} previously scored transcripts")
    except FileNotFoundError:
        already_scored = set()
        existing = pd.DataFrame()
    
    scored_results = []
    errors = []
    
    # Determine transcript column name - check what's actually in the DataFrame
    available_cols = transcripts_df.columns.tolist()
    transcript_col = None
    
    for possible_name in ['transcript', 'text', 'content', 'full_text', 'body']:
        if possible_name in available_cols:
            transcript_col = possible_name
            break
    
    if transcript_col is None:
        print(f"ERROR: Could not find transcript column. Available columns: {available_cols}")
        return existing if len(existing) > 0 else pd.DataFrame()
    
    print(f"Using transcript column: '{transcript_col}'")
    
    # Convert to dict records for easier iteration
    records = transcripts_df.to_dict('records')
    
    for idx, row in enumerate(tqdm(records, desc="Scoring")):
        # Handle different possible column names
        symbol = row.get('symbol') or row.get('ticker') or 'UNKNOWN'
        date = row.get('date') or row.get('filing_date') or 'UNKNOWN'
        transcript_id = f"{symbol}_{date}"
        
        # Skip if already scored
        if transcript_id in already_scored:
            continue
        
        try:
            # Get the transcript text
            transcript_text = row.get(transcript_col, '')
            
            if not transcript_text or transcript_text == '':
                errors.append({'symbol': symbol, 'date': date, 'error': 'Empty transcript'})
                continue
            
            # Score transcript - wrap in expected dictionary format
            # The scorer expects a dict with 'full_text' key
            transcript_dict = {'full_text': transcript_text}
            result = scorer.score_transcript(transcript_dict, use_md_a_only=False)
            score = result['firm_score']
            
            if score is None:
                errors.append({'symbol': symbol, 'date': date, 'error': 'Scoring returned None'})
                continue
            
            scored_results.append({
                'symbol': symbol,
                'date': date,
                'score': score,
                'transcript_length': len(str(transcript_text))
            })
            
            # Save checkpoint every 50 transcripts
            if len(scored_results) % 50 == 0:
                temp_df = pd.DataFrame(scored_results)
                combined = pd.concat([existing, temp_df], ignore_index=True)
                combined.to_csv(save_path, index=False)
                print(f"\nCheckpoint: Saved {len(combined)} scores")
            
            # Rate limiting (to avoid API limits)
            time.sleep(0.5)
            
        except Exception as e:
            errors.append({'symbol': symbol, 'date': date, 'error': str(e)})
            if idx < 5:  # Only print first few errors in detail
                print(f"\nError scoring {symbol}: {e}")
    
    # Final save - handle case where nothing was scored
    if scored_results:
        final_df = pd.DataFrame(scored_results)
        combined = pd.concat([existing, final_df], ignore_index=True)
        combined.to_csv(save_path, index=False)
        print(f"\nSaved {len(combined)} total scored transcripts to {save_path}")
    elif len(existing) > 0:
        combined = existing
        print(f"\nNo new transcripts scored. Returning {len(existing)} existing scores.")
    else:
        combined = pd.DataFrame(columns=['symbol', 'date', 'score', 'transcript_length'])
        print("\nWARNING: No transcripts were scored successfully!")
    
    if errors:
        error_df = pd.DataFrame(errors)
        error_df.to_csv('scoring_errors.csv', index=False)
        print(f"\nWARNING: {len(errors)} errors occurred (saved to scoring_errors.csv)")
        print(f"First few unique errors:")
        unique_errors = error_df['error'].value_counts().head(3)
        for error_msg, count in unique_errors.items():
            print(f"  {error_msg}: {count} occurrences")
    
    return combined

print("Scoring function ready")

Scoring function ready


In [None]:


#Check current chunk size
print(f"Current chunk size: {scorer.llm_config.get('chunk_size', 'not set')} characters")
print(f"Estimated chunks per transcript: {100000 // scorer.llm_config.get('chunk_size', 2000)} (for 100k char transcript)")

# Increase chunk size for faster scoring (reduces API calls by 4-6x)
scorer.llm_config['chunk_size'] = 10000  # Increase from 2000 to 10000

print(f"\n‚úì Updated chunk size to: {scorer.llm_config['chunk_size']} characters")
print(f"‚úì New estimated chunks: {100000 // scorer.llm_config['chunk_size']} per transcript")

Current chunk size: 2000 characters
Estimated chunks per transcript: 50 (for 100k char transcript)

‚úì Updated chunk size to: 10000 characters
‚úì New estimated chunks: 10 per transcript
‚úì This will make scoring ~5x faster!


In [None]:
# Run scoring (make sure you've run the previous cells first)
if 'scoring_transcripts' not in dir() or 'save_path' not in dir():
    print("ERROR: Please run the previous cell to prepare transcripts first.")
    raise NameError("Run the transcript preparation cell first")

print(f"Starting scoring at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 70)

scored_data = score_quarter_transcripts(
    scoring_transcripts, 
    scorer, 
    save_path=save_path
)

print("=" * 70)
print(f"Completed at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Starting scoring at 2026-02-02 23:07:35
Inspecting data structure...
Type: <class 'pandas.core.frame.DataFrame'>
Columns: ['symbol', 'quarter', 'year', 'date', 'content', 'structured_content', 'company_name', 'company_id']

First row type: <class 'pandas.core.series.Series'>
First row preview:
symbol                                                                A
quarter                                                               1
year                                                               2025
date                                                2025-02-26 16:30:00
content               Operator: Good afternoon. My name is Regina, a...
structured_content    [{'speaker': 'Operator', 'text': 'Good afterno...
company_name                                 Agilent Technologies, Inc.
company_id                                                     154924.0
Name: 20, dtype: object

Scoring 919 transcripts...
Estimated cost: $0.92 (GPT-4o-mini)
Estimated time: 30.6 minutes
Using transc

Scoring:   0%|          | 0/919 [00:00<?, ?it/s]

In [None]:
# Display scoring results
if 'scored_data' in dir() and len(scored_data) > 0:
    print(f"\nFinal Results:")
    print(f"  Total scored: {len(scored_data)}")
    print(f"  Date range: {scored_data['date'].min()} to {scored_data['date'].max()}")
    print(f"  Average score: {scored_data['score'].mean():.2f}")
    print(f"  Score distribution:")
    print(scored_data['score'].value_counts().sort_index())
    print(f"\nSaved to: {save_path}")
else:
    print("No scored data available. Run the scoring cell first.")

In [None]:
# Aggregate scored transcripts into quarterly AGG scores
if 'scored_data' not in dir() or len(scored_data) == 0:
    print("ERROR: No scored data available. Run the scoring cell first.")
else:
    print("Aggregating individual scores into quarterly AGG scores...")
    
    # Convert to DataFrame if needed
    if isinstance(scored_data, pd.DataFrame):
        scored_df = scored_data.copy()
    else:
        scored_df = pd.DataFrame(scored_data)
    
    # Ensure date column is datetime
    scored_df['date'] = pd.to_datetime(scored_df['date'])
    scored_df['year'] = scored_df['date'].dt.year
    scored_df['quarter'] = scored_df['date'].dt.quarter
    
    # Group by quarter and calculate aggregate score
    agg_scores = scored_df.groupby(['year', 'quarter']).agg({
        'score': ['mean', 'std', 'count']
    }).reset_index()
    
    agg_scores.columns = ['year', 'quarter', 'agg_score', 'score_std', 'num_firms']
    
    # Create quarter date
    agg_scores['date'] = pd.to_datetime(
        agg_scores['year'].astype(str) + '-Q' + agg_scores['quarter'].astype(str)
    )
    
    # Reorder columns
    final_agg_scores = agg_scores[['date', 'year', 'quarter', 'agg_score', 'score_std', 'num_firms']]
    
    # Save AGG scores (filename based on date range)
    agg_filename = f'agg_scores_{year_range}.csv'
    final_agg_scores.to_csv(agg_filename, index=False)
    print(f"\nSUCCESS: Saved {len(final_agg_scores)} quarterly AGG scores to {agg_filename}")
    
    # Display results
    print(f"\nAGG Scores Summary:")
    print(final_agg_scores)
    print(f"\nStatistics:")
    print(f"  Quarters covered: {len(final_agg_scores)}")
    print(f"  Date range: {final_agg_scores['date'].min().strftime('%Y-%m-%d')} to {final_agg_scores['date'].max().strftime('%Y-%m-%d')}")
    print(f"  Mean AGG score: {final_agg_scores['agg_score'].mean():.3f}")
    print(f"  Std AGG score: {final_agg_scores['agg_score'].std():.3f}")
    print(f"  Average firms/quarter: {final_agg_scores['num_firms'].mean():.0f}")

In [None]:
# Initialize feature engineer
engineer = FeatureEngineer('config.yaml')

# Load real AGG scores from saved file or create from actual transcript scoring
try:
    agg_scores = pd.read_csv('agg_scores.csv')
    agg_scores['date'] = pd.to_datetime(agg_scores['date'])
    print(f"‚úì Loaded real AGG scores from file: {len(agg_scores)} quarters")
    print(agg_scores.head())
except FileNotFoundError:
    print("‚ö† No saved AGG scores found. You need to:")
    print("  1. Score earnings transcripts using LLMScorer.score_multiple_transcripts()")
    print("  2. Aggregate scores by quarter using aggregate_scores_by_quarter()")
    print("  3. Save to 'agg_scores.csv'")
    print("\n For demonstration, showing expected data structure...")
    # Show expected structure instead of generating synthetic data
    agg_scores = pd.DataFrame({
        'date': pd.date_range(start='2015-01-01', end='2023-12-31', freq='Q'),
        'year': [],
        'quarter': [],
        'agg_score': []  # Real scores would be 1-5 from LLM
    })
    print("\nExpected columns: date, year, quarter, agg_score")
    print("Cannot proceed with feature engineering without real data")

In [None]:
# Normalize scores (only if we have real data)
if len(agg_scores) > 0 and 'agg_score' in agg_scores.columns:
    normalized = engineer.normalize_scores(agg_scores, method='zscore', window=20)
    print("\nNormalized Scores:")

    print(normalized[['date', 'agg_score', 'agg_score_norm']].head(10))    normalized = pd.DataFrame()

else:    print("‚ö† Cannot normalize without real AGG scores")

In [None]:
# Create delta features (only if we have normalized data)
if len(normalized) > 0:
    with_deltas = engineer.create_delta_features(normalized)
    print("\nDelta Features:")

    print(with_deltas[['date', 'agg_score', 'yoy_change', 'qoq_change', 'momentum']].tail(10))    with_deltas = pd.DataFrame()

else:    print("‚ö† Cannot create delta features without normalized scores")

In [None]:
# Visualize AGG score and deltas (only if we have features)
if len(with_deltas) > 0:
    fig, axes = plt.subplots(3, 1, figsize=(12, 8))

    # AGG score
    axes[0].plot(with_deltas['date'], with_deltas['agg_score'], linewidth=2)
    axes[0].set_title('AGG Score (National Economic Sentiment)', fontsize=12, fontweight='bold')
    axes[0].set_ylabel('Score')
    axes[0].grid(True, alpha=0.3)

    # YoY change
    valid_yoy = with_deltas.dropna(subset=['yoy_change'])
    axes[1].bar(valid_yoy['date'], valid_yoy['yoy_change'], color='steelblue', alpha=0.7)
    axes[1].set_title('YoY Change (AGG_t - AGG_t-4)', fontsize=12, fontweight='bold')
    axes[1].set_ylabel('Change')
    axes[1].grid(True, alpha=0.3)

    # Momentum
    valid_momentum = with_deltas.dropna(subset=['momentum'])
    axes[2].bar(valid_momentum['date'], valid_momentum['momentum'], color='coral', alpha=0.7)
    axes[2].set_title('Momentum (Acceleration)', fontsize=12, fontweight='bold')
    axes[2].set_ylabel('Momentum')
    axes[2].set_xlabel('Date')
    axes[2].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    print("‚úì Feature visualization complete")
else:
    print("‚ö† Cannot visualize features without delta features")

## Step 4: Prediction Models

In [None]:
pred_model = PredictionModel('config.yaml')
print(dir(pred_model))

In [None]:
X_train = with_deltas[['agg_score_norm', 'yoy_change', 'qoq_change', 'momentum']].dropna().reset_index(drop=True)
X_train['date'] = with_deltas.loc[X_train.index, 'date'].values

gdp_df = macro_data['gdp'].copy()
gdp_df['date'] = pd.to_datetime(gdp_df['date'])
train_data = X_train.merge(gdp_df, on='date', how='inner')
X_train = train_data[['agg_score_norm', 'yoy_change', 'qoq_change', 'momentum']].values
y_train = train_data['value'].values
print(f"Training data: {X_train.shape}")
print(f"Target data: {y_train.shape}")
gdp_models = pred_model.train_gdp_models(X_train, y_train)
print(f"Model R¬≤: {gdp_models['gdp'].score(X_train, y_train):.3f}")
gdp_model = pred_model.train_gdp_model(X_train.values, y_train.values)
print(f"Training data: {X_train.shape}")
print(f"Target data: {y_train.shape}")

In [None]:
# Train GDP prediction model
gdp_model = pred_model.train_gdp_model(X_train, y_train)
print(f"\nGDP Model Trained")
print(f"  Model type: {type(gdp_model).__name__}")
print(f"  Training R¬≤: {gdp_model.score(X_train, y_train):.3f}")

In [None]:
# Make predictions using real test data
if len(agg_scores) > 0 and 'agg_score' in agg_scores.columns:
    # Use the most recent features for out-of-sample prediction
    test_features = with_deltas[['agg_score_norm', 'yoy_change', 'qoq_change', 'momentum']].dropna().tail(10)
    test_dates = with_deltas.loc[test_features.index, 'date']
    
    predictions = gdp_model.predict(test_features.values)

    print(f"\nGDP Predictions (1Q ahead) for recent quarters:")
    for date, pred in zip(test_dates, predictions):
        print(f"  {date.strftime('%Y-%m-%d')}: {pred:.3f}%")
    print(f"\n  Mean: {predictions.mean():.3f}%")
    print(f"  Std: {predictions.std():.3f}%")
    print(f"  Range: [{predictions.min():.3f}, {predictions.max():.3f}]%")
else:
    print("‚ö† Cannot make predictions without real AGG scores")

## Step 5: Signal Generation & Backtesting

In [None]:
# Initialize signal generator
signal_gen = SignalGenerator('config.yaml')

# Use real predictions from trained models
# This requires: 
# 1. Features from AGG scores
# 2. Trained GDP/IP models
# 3. SPF forecasts from data_acq.fetch_spf_forecasts()

if len(agg_scores) > 0 and 'agg_score' in agg_scores.columns:
    # Use real model predictions
    features_for_pred = with_deltas[['agg_score_norm', 'yoy_change', 'qoq_change', 'momentum']].dropna()
    dates_for_pred = with_deltas.loc[features_for_pred.index, 'date']
    

    # Get predictions from trained model    predictions_df = pd.DataFrame()

    gdp_predictions = gdp_model.predict(features_for_pred.values)    print("‚ö† Cannot generate predictions without real AGG scores")

    else:

    # Fetch real SPF forecasts    print(predictions_df.head())

    try:    print("‚úì Real Predictions vs SPF:")

        spf_data = data_acq.fetch_spf_forecasts(start_date, end_date)    

        spf_data['date'] = pd.to_datetime(spf_data['date'])    predictions_df.rename(columns={'rgdp_1q': 'gdp_spf'}, inplace=True)

    except Exception as e:    predictions_df = predictions_df.merge(spf_data[['date', 'rgdp_1q']], on='date', how='left')

        print(f"‚ö† Could not fetch SPF data: {e}")    })

        spf_data = pd.DataFrame({'date': dates_for_pred, 'rgdp_1q': [2.0]*len(dates_for_pred)})        'gdp_pred': gdp_predictions

            'date': dates_for_pred.values,

    # Combine predictions with SPF    predictions_df = pd.DataFrame({

In [None]:
# Generate trading signals (only if we have real predictions)
if len(predictions_df) > 0:
    signals = signal_gen.generate_signals(predictions_df)
    print(f"\nüìä Trading Signals Generated:")
    print(signals.head(10))
    print(f"\nSignal distribution:")
    print(signals['signal'].value_counts())
else:
    print("‚ö† Cannot generate signals without predictions")
    signals = pd.DataFrame()

In [None]:
# Initialize backtester
backtester = Backtester('config.yaml')

# Use real returns from strategy execution
# This requires:
# 1. Trading signals from signal_gen.generate_signals()
# 2. Sector ETF price data
# 3. Portfolio construction and rebalancing

if len(predictions_df) > 0:
    # Fetch real ETF price data for sectors
    sector_etfs = config['strategy']['sector_etfs']
    etf_start = config['backtest']['test_start']
    etf_end = config['backtest']['test_end']
    
    etf_prices = data_acq.fetch_etf_prices(sector_etfs, etf_start, etf_end)
    
    if etf_prices:
        print(f"‚úì Fetched price data for {len(etf_prices)} sector ETFs")

                    print(f"  {metric}: {value}")

        # Run backtest with real data        else:

        # Note: This requires implementing the full backtesting logic            print(f"  {metric}: {value:.3f}")

        # For now, we show the structure        if isinstance(value, float):

        print("\n‚ö† Full backtest execution requires:")    for metric, value in metrics.items():

        print("  1. Signals from signal_gen.generate_signals(predictions_df)")    print(f"\nüìà Performance Metrics:")

        print("  2. Portfolio construction based on signals")    metrics = backtester.calculate_metrics(portfolio_returns)

        print("  3. Daily rebalancing and return calculation")    # Calculate performance metrics

        print("  4. Benchmark comparison (SPY or equal-weight)")if len(portfolio_returns) > 0:

        

        portfolio_returns = pd.DataFrame()    portfolio_returns = pd.DataFrame()

        print("\nPlease implement backtester.run_backtest(signals, etf_prices) for real returns")    print("‚ö† Cannot run backtest without predictions")

    else:else:

        print("‚ö† No ETF price data available")        portfolio_returns = pd.DataFrame()

In [None]:
# Calculate cumulative returns and plot (only if we have real returns)
if len(portfolio_returns) > 0 and 'strategy_return' in portfolio_returns.columns:
    portfolio_returns['strategy_cumret'] = (1 + portfolio_returns['strategy_return']).cumprod() - 1
    portfolio_returns['benchmark_cumret'] = (1 + portfolio_returns['benchmark_return']).cumprod() - 1

    fig, ax = plt.subplots(figsize=(12, 6))
    ax.plot(portfolio_returns['date'], portfolio_returns['strategy_cumret'] * 100, 
            label='Strategy', linewidth=2)
    ax.plot(portfolio_returns['date'], portfolio_returns['benchmark_cumret'] * 100, 
            label='Benchmark', linewidth=2, linestyle='--')

    ax.set_title('Strategy vs Benchmark Cumulative Returns', fontsize=12, fontweight='bold')
    ax.set_ylabel('Return (%)')
    ax.set_xlabel('Date')
    ax.legend()
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()


    print("‚úì Backtest visualization complete")    print("5. Execute backtest with real ETF prices")

else:    print("4. Generate trading signals")

    print("‚ö† No portfolio returns available for visualization")    print("3. Train prediction models")

    print("\nTo complete the full pipeline with real data:")    print("2. Engineer features from AGG scores")
    print("1. Score earnings transcripts ‚Üí agg_scores.csv")

## Summary: Complete Pipeline with Real Data

This notebook demonstrates the **AI Economy Score Predictor** strategy pipeline using **real data sources**:

### ‚úÖ Real Data Used:
1. **Macroeconomic Data**: From FRED API (GDP, Industrial Production, Employment, Wages)
2. **Control Variables**: From FRED API (Yield Curve, Consumer Sentiment, Unemployment)
3. **PMI Data**: Loaded from `pmi_data.csv` 
4. **S&P 500 Constituents**: From `constituents.csv`
5. **ETF Prices**: Fetched via yfinance API

### ‚ö†Ô∏è Real Data Needed:
- **Earnings Call Transcripts** with LLM sentiment scores aggregated quarterly ‚Üí `agg_scores.csv`

### Pipeline Steps:
1. **Data Acquisition** ‚úì Uses real FRED API and local files
2. **LLM Scoring** ‚Üí Requires real earnings transcripts (Seeking Alpha, CapIQ, Bloomberg)
3. **Feature Engineering** ‚úì Works with real AGG scores once available
4. **Prediction Models** ‚úì Trains on real macro data + AGG features
5. **Signal Generation** ‚úì Compares predictions to SPF forecasts
6. **Backtesting** ‚úì Uses real sector ETF prices

### Next Steps:
1. Obtain earnings call transcripts from a data provider
2. Score transcripts using `LLMScorer.score_multiple_transcripts()`
3. Aggregate scores by quarter and save to `agg_scores.csv`
4. Re-run this notebook to execute the full pipeline with real signals

**No synthetic/random data is used for actual trading signals - all results require real transcript scoring.**

In [None]:
# Check data availability
import os

print("üìÅ Data File Status:\n")

required_files = {
    'config.yaml': 'Configuration file',
    'constituents.csv': 'S&P 500 constituents',
    'pmi_data.csv': 'PMI data'
}

optional_files = {
    'agg_scores.csv': 'Aggregated LLM sentiment scores (REQUIRED for full pipeline)'
}

for file, desc in required_files.items():
    status = "‚úì" if os.path.exists(file) else "‚úó"
    print(f"{status} {file}: {desc}")

print("\nOptional (but critical):")
for file, desc in optional_files.items():
    status = "‚úì" if os.path.exists(file) else "‚úó MISSING"
    print(f"{status} {file}: {desc}")

if not os.path.exists('agg_scores.csv'):
    print("\n‚ö†Ô∏è  To create agg_scores.csv, you need to:")
    print("   1. Get earnings transcripts from a data provider")
    print("   2. Run LLM scoring (see 'Note: To Use Real Data' section above)")
    print("   3. Use the aggregate_scores_by_quarter() function")