## Notebook 02: LLM Analysis and Relevance Scoring

**Objective**: Use GPT to analyze regulatory relevance of collected news articles

**LLM Outputs (structured):**
- Relevance score (0-10): How relevant to institutional investment regulation
- Category: Type of regulation (monetary policy, banking, securities, etc.)
- Impact level: High/Medium/Low impact on institutional investors
- Key entities: Regulators and institutions mentioned
- Summary: One-line description
- Reasoning: Brief explanation of relevance score

**Approach:**
1. Load raw data from Notebook 01
2. Design and test prompt on sample articles
3. Batch process all articles with rate limiting
4. Validate LLM outputs
5. Save enriched dataset for dashboard

**Model**: GPT-3.5-turbo (cost-effective, sufficient for classification task)

### **Basic Library Import and Setup**

**What I am doing:**
- Import libraries for data processing and OpenAI API
- Load environment variables and configuration
- Initialize OpenAI client

**Why I'm doing this:**
- Setup environment for LLM calls
- Verify API credentials before processing
- Load config parameters for analysis

**Technical Note:** openai library, dotenv, pandas

In [None]:
import os
import json
import yaml
from datetime import datetime
from pathlib import Path
import time

import pandas as pd
import numpy as np
from openai import OpenAI
from dotenv import load_dotenv
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

In [None]:
load_dotenv()

with open('../config.yaml', 'r') as f:
    config = yaml.safe_load(f)

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

if not OPENAI_API_KEY or OPENAI_API_KEY == 'your_openai_key_here':
    print("ERROR: OpenAI API key not found")
else:
    print("OpenAI key loaded")
    client = OpenAI(api_key=OPENAI_API_KEY)
    print("OpenAI client initialized")
    
print(f"Model: {config['llm']['model']}")
print(f"Analysis threshold: {config['analysis']['relevance_threshold']}")

### **Loading Raw Data From NB01**

**What I am doing:**
- Find and load the most recent raw data file from Notebook 01
- Display basic statistics about the dataset
- Verify data structure before LLM processing

**Why I'm doing this:**
- Continue pipeline from data collection
- Validate data loaded correctly before expensive API calls
- Confirm article counts and structure

**Technical Note:** pathlib for file handling, pandas for data loading

In [None]:
raw_data_dir = Path('../data/raw')
raw_files = sorted(raw_data_dir.glob('news_raw_*.json'))

if not raw_files:
    print("ERROR: No raw data files found. Run Notebook 01 first.")
else:
    latest_file = raw_files[-1]
    print(f"Loading: {latest_file.name}")
    
    df = pd.read_json(latest_file)
    
    print(f"\nDataset loaded: {len(df)} articles")
    print(f"Markets: {df['market_name'].value_counts().to_dict()}")
    print(f"Sources: {df['source_type'].value_counts().to_dict()}")
    print(f"\nColumns: {list(df.columns)}")
    print(f"\nSample article:")
    print(f"Title: {df.iloc[0]['title']}")
    print(f"Market: {df.iloc[0]['market_name']}")

### **LLM Analysis Prompt**

**What I am doing:**
- Create system prompt defining analyst role for Norges Bank context
- Design structured JSON output with split summary (event + relevance)
- Define analysis criteria focused on portfolio impact and regulatory signals

**Why I'm doing this:**
- Separated event description and relevance provides clearer insights
- Tailored prompts produce higher quality analysis
- Structured output ensures reliable parsing for dashboard display

**Technical Note:** String templating, JSON schema design

In [None]:
SYSTEM_PROMPT = """You are a regulatory intelligence analyst for Norges Bank Investment Management, the world's largest sovereign wealth fund managing a $1.5 trillion global equity portfolio.

Your role is to identify regulatory developments that could impact institutional investment operations.

Evaluate news for:
1. Portfolio Impact: Does this affect holdings, investment strategies, or market access?
2. Systemic Risk: Does this signal broader market instability or regulatory shifts?
3. Early Warning: Is this an emerging regulatory trend before formal implementation?

Prioritize: Central bank policy, securities regulation, cross-border investment rules, and systemic financial regulation.

Assign LOW scores (0-3) to: Company-specific news, non-financial regulation, opinion pieces without regulatory substance."""

def create_analysis_prompt(article):
    """Generate analysis prompt for an article"""
    
    user_prompt = f"""Analyze this article for regulatory relevance to institutional investors:

Title: {article['title']}
Description: {article.get('description', 'N/A')}
Source: {article.get('source_name', 'Unknown')}
Market: {article['market_name']}

Return analysis in this exact JSON format:
{{
  "relevance_score": <0-10 integer, where 10 is critical regulatory impact>,
  "impact_level": "<high/medium/low>",
  "category": "<monetary_policy/banking_regulation/securities_regulation/market_infrastructure/central_banking/cross_border_investment/other>",
  "key_regulators": ["<primary regulator>", "<secondary if applicable>"],
  "what_happened": "<Brief overview of the regulatory event or announcement>",
  "why_relevant": "<Why this matters specifically for institutional investors like Norges Bank>",
  "confidence": "<high/medium/low - clarity of regulatory significance>"
}}

Be concise and actionable. Focus on investment decision-making intelligence."""
    
    return user_prompt

print("Prompt design complete")
print(f"\nCategories: {config['analysis']['categories']}")
print(f"Impact levels: {config['analysis']['impact_levels']}")

In [None]:
print("Testing prompt on sample articles\n")

test_articles = df.sample(3, random_state=42)

for idx, article in test_articles.iterrows():
    print(f"\n{'='*60}")
    print(f"TEST ARTICLE: {article['title'][:80]}...")
    print(f"Market: {article['market_name']}")
    print(f"{'='*60}\n")
    
    try:
        response = client.chat.completions.create(
            model=config['llm']['model'],
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": create_analysis_prompt(article)}
            ],
            temperature=config['llm']['temperature'],
            max_tokens=config['llm']['max_tokens'],
            response_format={"type": "json_object"}
        )
        
        analysis = json.loads(response.choices[0].message.content)
        
        print("LLM Analysis:")
        print(f"Relevance Score: {analysis.get('relevance_score')}/10")
        print(f"Impact Level: {analysis.get('impact_level')}")
        print(f"Category: {analysis.get('category')}")
        print(f"Key Regulators: {analysis.get('key_regulators')}")
        print(f"\nWhat Happened: {analysis.get('what_happened')}")
        print(f"Why Relevant: {analysis.get('why_relevant')}")
        print(f"Confidence: {analysis.get('confidence')}")
        
    except Exception as e:
        print(f"ERROR: {e}")

print("\n\nPrompt test complete. Review outputs before full batch processing.")

In [None]:
print(json.dumps(analysis, indent=2))

### **Batch Processing**

**What I am doing:**
- Loop through all 652 articles and call OpenAI API for each
- Add rate limiting delays to avoid API throttling
- Store analysis results with error handling for failed articles

**Why I'm doing this:**
- Generate relevance scores and categorization for entire dataset
- Progress bar shows estimated completion time (~5-7 minutes)
- Error handling ensures one bad article doesn't stop entire batch

**Technical Note:** tqdm for progress tracking, time.sleep for rate limiting, try-except for error handling

In [None]:
print(f"Starting batch analysis of {len(df)} articles")
print(f"Estimated time: {len(df) * 1.5 / 60:.1f} minutes\n")

results = []
errors = []

for idx, article in tqdm(df.iterrows(), total=len(df), desc="Analyzing articles"):
    try:
        response = client.chat.completions.create(
            model=config['llm']['model'],
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": create_analysis_prompt(article)}
            ],
            temperature=config['llm']['temperature'],
            max_tokens=config['llm']['max_tokens'],
            response_format={"type": "json_object"}
        )
        
        analysis = json.loads(response.choices[0].message.content)
        analysis['article_index'] = idx
        results.append(analysis)
        
        time.sleep(config['analysis']['rate_limit_delay'])
        
    except Exception as e:
        errors.append({'index': idx, 'title': article['title'], 'error': str(e)})
        results.append({
            'article_index': idx,
            'relevance_score': 0,
            'impact_level': 'low',
            'category': 'other',
            'key_regulators': [],
            'what_happened': 'Analysis failed',
            'why_relevant': 'N/A',
            'confidence': 'low'
        })

print(f"\n\nAnalysis complete!")
print(f"Successful: {len(results) - len(errors)}")
print(f"Errors: {len(errors)}")

if errors:
    print("\nFirst few errors:")
    for err in errors[:3]:
        print(f"  - {err['title'][:60]}... : {err['error']}")

What I am doing:

Convert LLM analysis results to DataFrame
Merge analysis with original article data on index
Verify merge completed successfully with all expected columns

Why I'm doing this:

Creates single enriched dataset with articles and analysis
Enables filtering by relevance scores and categories
Prepares data structure for dashboard consumption

Technical Note: pandas DataFrame operations, merge validation

In [None]:
print("Merging analysis with article data\n")

analysis_df = pd.DataFrame(results)

df_enriched = df.copy()
df_enriched['article_index'] = df_enriched.index

df_enriched = df_enriched.merge(
    analysis_df,
    on='article_index',
    how='left'
)

print(f"Enriched dataset shape: {df_enriched.shape}")
print(f"New columns added: {[col for col in analysis_df.columns if col != 'article_index']}")

print("\nSample enriched article:")
sample = df_enriched.iloc[0]
print(f"Title: {sample['title']}")
print(f"Market: {sample['market_name']}")
print(f"Relevance: {sample['relevance_score']}/10")
print(f"Impact: {sample['impact_level']}")
print(f"Category: {sample['category']}")
print(f"What happened: {sample['what_happened']}")

print("\nMerge successful!")

### **Validating Analysis Results**

**What I am doing:**
- Check distribution of relevance scores across markets
- Count articles by category and impact level
- Identify high-relevance articles for dashboard spotlight

**Why I'm doing this:**
- Ensure LLM analysis is reasonable (not all 10s or all 0s)
- Understand category distribution for visualization planning
- Find top articles to feature prominently in dashboard

**Technical Note:** pandas value_counts, groupby, filtering

In [None]:

print("\nRelevance Score Distribution:")
print(df_enriched['relevance_score'].describe())

print(f"\nScore breakdown:")
print(df_enriched['relevance_score'].value_counts().sort_index())


In [None]:
print("\nArticles by Category:")
print(df_enriched['category'].value_counts())

print("\nArticles by Impact Level:")
print(df_enriched['impact_level'].value_counts())

In [None]:
print("\nAverage Relevance by Market:")
print(df_enriched.groupby('market_name')['relevance_score'].mean().sort_values(ascending=False))

print("\nHigh-Impact Articles (score >= 7):")
high_relevance = df_enriched[df_enriched['relevance_score'] >= 7]
print(f"Count: {len(high_relevance)} ({len(high_relevance)/len(df_enriched)*100:.1f}%)")

print("\nTop 5 Highest Scored Articles:")
top_articles = df_enriched.nlargest(5, 'relevance_score')[['title', 'market_name', 'relevance_score', 'category']]
for idx, row in top_articles.iterrows():
    print(f"\n[{row['relevance_score']}/10] {row['title'][:70]}...")
    print(f"  Market: {row['market_name']} | Category: {row['category']}")

### **Saving Data-Processed**

In [None]:
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
output_dir = Path('../data/processed')
output_dir.mkdir(parents=True, exist_ok=True)

output_file = output_dir / f'analyzed_articles_{timestamp}.json'
df_enriched.to_json(output_file, orient='records', date_format='iso', indent=2)

print(f"Saved to: {output_file}")
print(f"Total articles analyzed: {len(df_enriched)}")

summary = {
    'analysis_timestamp': timestamp,
    'total_articles': len(df_enriched),
    'llm_model': config['llm']['model'],
    'relevance_stats': {
        'mean': float(df_enriched['relevance_score'].mean()),
        'median': float(df_enriched['relevance_score'].median()),
        'high_relevance_count': int((df_enriched['relevance_score'] >= 7).sum())
    },
    'categories': df_enriched['category'].value_counts().to_dict(),
    'impact_levels': df_enriched['impact_level'].value_counts().to_dict(),
    'by_market': df_enriched.groupby('market_name')['relevance_score'].agg(['count', 'mean']).to_dict()
}

summary_file = output_dir / f'analysis_summary_{timestamp}.json'
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"\nSummary saved to: {summary_file}")
print("\nNotebook 02 complete! Ready for visualization in Notebook 03.")