# Day 03: Sentiment Analysis for Trading Signals

## Week 19: NLP & Alternative Data

---

## Learning Objectives

1. **Understand sentiment analysis fundamentals** for financial applications
2. **Implement lexicon-based methods** (VADER, Loughran-McDonald)
3. **Apply ML-based sentiment models** (FinBERT, distilBERT)
4. **Convert sentiment scores to trading signals**
5. **Backtest sentiment-based strategies**

---

## Table of Contents

1. [Introduction to Financial Sentiment Analysis](#1-introduction)
2. [Text Preprocessing for Financial Data](#2-preprocessing)
3. [Lexicon-Based Sentiment Analysis](#3-lexicon)
4. [ML-Based Sentiment (FinBERT)](#4-finbert)
5. [Aggregating Sentiment Signals](#5-aggregation)
6. [Building Trading Signals from Sentiment](#6-signals)
7. [Backtesting Sentiment Strategies](#7-backtesting)
8. [Interview Questions](#8-interview)

---

## 1. Introduction to Financial Sentiment Analysis <a id='1-introduction'></a>

### Why Sentiment Matters in Trading

- **Market psychology**: Prices reflect collective sentiment
- **Information asymmetry**: Text data contains alpha not yet in prices
- **Lead-lag relationships**: Sentiment often precedes price moves

### Sources of Sentiment Data

| Source | Latency | Signal Quality | Cost |
|--------|---------|----------------|------|
| News Articles | Minutes | High | High |
| Social Media | Seconds | Variable | Low |
| Earnings Calls | Hours | Very High | Medium |
| SEC Filings | Days | High | Low |
| Analyst Reports | Hours | Very High | High |

In [None]:
# Core imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# NLP imports
import re
import string
from collections import Counter

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("Core libraries loaded successfully!")

In [None]:
# Install required packages (run once)
# !pip install vaderSentiment nltk transformers torch yfinance

In [None]:
# NLP-specific imports
try:
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    print("✓ VADER loaded")
except ImportError:
    print("✗ Install: pip install vaderSentiment")

try:
    import nltk
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    nltk.download('wordnet', quiet=True)
    from nltk.tokenize import word_tokenize, sent_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    print("✓ NLTK loaded")
except ImportError:
    print("✗ Install: pip install nltk")

try:
    import yfinance as yf
    print("✓ yfinance loaded")
except ImportError:
    print("✗ Install: pip install yfinance")

---

## 2. Text Preprocessing for Financial Data <a id='2-preprocessing'></a>

### Key Preprocessing Steps

1. **Lowercasing** - Normalize case
2. **Remove noise** - HTML, special characters
3. **Tokenization** - Split into words
4. **Remove stopwords** - Keep meaningful words
5. **Lemmatization** - Reduce to root forms
6. **Handle financial terms** - Preserve important phrases

In [None]:
class FinancialTextPreprocessor:
    """
    Text preprocessor optimized for financial text.
    Preserves important financial terms and phrases.
    """
    
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()
        
        # Financial terms to preserve (don't remove as stopwords)
        self.financial_terms = {
            'up', 'down', 'high', 'low', 'above', 'below',
            'buy', 'sell', 'long', 'short', 'bull', 'bear',
            'gain', 'loss', 'profit', 'revenue', 'earnings'
        }
        
        # Remove financial terms from stopwords
        self.stop_words -= self.financial_terms
        
        # Important financial bigrams/phrases
        self.financial_phrases = [
            'beat expectations', 'miss expectations',
            'guidance raised', 'guidance lowered',
            'stock buyback', 'dividend increase',
            'revenue growth', 'profit margin',
            'market share', 'price target'
        ]
    
    def clean_text(self, text: str) -> str:
        """Basic text cleaning."""
        if not isinstance(text, str):
            return ""
        
        # Lowercase
        text = text.lower()
        
        # Remove HTML tags
        text = re.sub(r'<[^>]+>', '', text)
        
        # Remove URLs
        text = re.sub(r'http\S+|www\S+', '', text)
        
        # Remove special characters but keep financial symbols
        text = re.sub(r'[^a-zA-Z0-9\s\$\%\.\-]', '', text)
        
        # Remove extra whitespace
        text = ' '.join(text.split())
        
        return text
    
    def tokenize(self, text: str) -> list:
        """Tokenize and lemmatize text."""
        text = self.clean_text(text)
        tokens = word_tokenize(text)
        
        # Remove stopwords and lemmatize
        processed = [
            self.lemmatizer.lemmatize(token)
            for token in tokens
            if token not in self.stop_words and len(token) > 2
        ]
        
        return processed
    
    def process(self, text: str) -> str:
        """Full preprocessing pipeline returning string."""
        tokens = self.tokenize(text)
        return ' '.join(tokens)


# Initialize preprocessor
preprocessor = FinancialTextPreprocessor()

# Example
sample_text = """
Apple Inc. (AAPL) reported Q4 earnings that BEAT analyst expectations!
Revenue grew 15% YoY to $89.5B. The company raised guidance for next quarter.
Stock price jumped 5% in after-hours trading. Analysts remain bullish.
"""

print("Original text:")
print(sample_text)
print("\nProcessed text:")
print(preprocessor.process(sample_text))
print("\nTokens:")
print(preprocessor.tokenize(sample_text))

---

## 3. Lexicon-Based Sentiment Analysis <a id='3-lexicon'></a>

### 3.1 VADER (Valence Aware Dictionary and sEntiment Reasoner)

VADER is specifically designed for social media but works well for financial news:
- **Compound score**: -1 (most negative) to +1 (most positive)
- Handles emojis, slang, and intensifiers
- Fast and doesn't require training

In [None]:
class VADERSentimentAnalyzer:
    """
    VADER-based sentiment analyzer for financial text.
    """
    
    def __init__(self):
        self.analyzer = SentimentIntensityAnalyzer()
        
        # Add financial lexicon updates
        financial_lexicon = {
            'bullish': 3.0,
            'bearish': -3.0,
            'outperform': 2.5,
            'underperform': -2.5,
            'upgrade': 2.5,
            'downgrade': -2.5,
            'beat': 2.0,
            'miss': -2.0,
            'exceeds': 2.0,
            'disappoints': -2.0,
            'rally': 2.0,
            'plunge': -2.5,
            'surge': 2.5,
            'crash': -3.0,
            'soar': 2.5,
            'tumble': -2.0,
            'bankruptcy': -3.5,
            'default': -3.0,
            'dividend': 1.5,
            'buyback': 1.5,
            'guidance': 0.5,
            'momentum': 1.0,
            'volatility': -0.5,
            'recession': -2.5,
            'growth': 1.5,
            'profit': 1.5,
            'loss': -1.5,
        }
        
        self.analyzer.lexicon.update(financial_lexicon)
    
    def analyze(self, text: str) -> dict:
        """Get sentiment scores for text."""
        scores = self.analyzer.polarity_scores(text)
        return scores
    
    def get_compound(self, text: str) -> float:
        """Get compound sentiment score."""
        return self.analyze(text)['compound']
    
    def classify(self, text: str, threshold: float = 0.05) -> str:
        """Classify sentiment as positive, negative, or neutral."""
        compound = self.get_compound(text)
        
        if compound >= threshold:
            return 'positive'
        elif compound <= -threshold:
            return 'negative'
        else:
            return 'neutral'


# Initialize analyzer
vader = VADERSentimentAnalyzer()

# Test on financial headlines
headlines = [
    "Apple beats earnings expectations, stock surges 5%",
    "Tesla misses delivery targets, shares plunge in premarket",
    "Fed holds rates steady, market remains uncertain",
    "Amazon announces $10B stock buyback program",
    "Bank warns of potential recession risks ahead",
    "Tech rally continues as investors remain bullish",
    "Company files for bankruptcy after years of losses"
]

print("VADER Sentiment Analysis Results:")
print("=" * 80)

results = []
for headline in headlines:
    scores = vader.analyze(headline)
    sentiment = vader.classify(headline)
    results.append({
        'headline': headline[:50] + '...' if len(headline) > 50 else headline,
        'compound': scores['compound'],
        'positive': scores['pos'],
        'negative': scores['neg'],
        'neutral': scores['neu'],
        'sentiment': sentiment
    })

vader_df = pd.DataFrame(results)
print(vader_df.to_string(index=False))

### 3.2 Loughran-McDonald Financial Sentiment Dictionary

The **Loughran-McDonald** dictionary is specifically designed for financial text:
- Developed from 10-K filings
- Different word lists: Positive, Negative, Uncertainty, Litigious, etc.
- More accurate for formal financial documents

In [None]:
class LoughranMcDonaldSentiment:
    """
    Loughran-McDonald dictionary-based sentiment for financial text.
    Uses a simplified version of the dictionary.
    """
    
    def __init__(self):
        # Simplified Loughran-McDonald word lists
        self.positive_words = {
            'accomplish', 'accomplishment', 'achieve', 'achievement', 'advantage',
            'beneficial', 'benefit', 'best', 'better', 'boost', 'breakthrough',
            'creative', 'delight', 'deliver', 'desirable', 'dream', 'easy',
            'effective', 'efficiency', 'efficient', 'enhance', 'enjoy', 'enthusiasm',
            'excellent', 'exceptional', 'exciting', 'exclusive', 'favorable', 'gain',
            'good', 'great', 'grow', 'growth', 'happy', 'highest', 'ideal',
            'improve', 'improvement', 'incredible', 'innovative', 'leader', 'leadership',
            'opportunity', 'optimal', 'optimistic', 'outperform', 'outstanding', 'perfect',
            'positive', 'profitable', 'profitability', 'progress', 'prosper', 'record',
            'reward', 'rewarding', 'solid', 'strength', 'strengthen', 'strong',
            'succeed', 'success', 'successful', 'superior', 'surpass', 'top', 'upturn',
            'win', 'winner', 'winning'
        }
        
        self.negative_words = {
            'abandon', 'adverse', 'against', 'allegation', 'argue', 'bad', 'bankruptcy',
            'blame', 'breach', 'burden', 'catastrophe', 'challenge', 'claim', 'closure',
            'collapse', 'concern', 'conflict', 'crisis', 'critical', 'damage', 'danger',
            'decline', 'default', 'deficit', 'delay', 'deteriorate', 'difficult',
            'difficulty', 'disappoint', 'disappointing', 'disaster', 'disclose', 'disclosure',
            'doubt', 'downturn', 'drop', 'failure', 'fall', 'fear', 'fraud', 'harm',
            'hurt', 'impair', 'impairment', 'impossible', 'inability', 'inadequate',
            'investigation', 'lawsuit', 'layoff', 'liquidation', 'litigation', 'lose',
            'loss', 'losses', 'negative', 'negligence', 'obstacle', 'penalty', 'plunge',
            'poor', 'problem', 'recall', 'recession', 'restructuring', 'risk', 'risky',
            'scandal', 'setback', 'shortage', 'slowdown', 'slump', 'struggle', 'subprime',
            'terminate', 'threat', 'trouble', 'tumble', 'uncertain', 'uncertainty',
            'underperform', 'unfavorable', 'violation', 'volatile', 'volatility',
            'weak', 'weakness', 'worse', 'worsen', 'worst', 'writedown', 'writeoff'
        }
        
        self.uncertainty_words = {
            'almost', 'anticipate', 'apparent', 'appear', 'approximate', 'assume',
            'believe', 'conditional', 'confuse', 'contingency', 'contingent', 'could',
            'depend', 'doubt', 'estimate', 'expect', 'forecast', 'hope', 'if',
            'indefinite', 'indicate', 'likelihood', 'likely', 'may', 'maybe', 'might',
            'pending', 'perhaps', 'possibility', 'possible', 'possibly', 'potential',
            'predict', 'probable', 'probably', 'project', 'risk', 'roughly', 'seem',
            'sometimes', 'suggest', 'suppose', 'uncertain', 'uncertainty', 'unclear',
            'unknown', 'unlikely', 'unpredictable', 'unsure', 'variable', 'volatility'
        }
    
    def analyze(self, text: str) -> dict:
        """Analyze text using Loughran-McDonald dictionary."""
        words = text.lower().split()
        total_words = len(words)
        
        if total_words == 0:
            return {'positive': 0, 'negative': 0, 'uncertainty': 0, 'sentiment': 0}
        
        pos_count = sum(1 for w in words if w in self.positive_words)
        neg_count = sum(1 for w in words if w in self.negative_words)
        unc_count = sum(1 for w in words if w in self.uncertainty_words)
        
        # Normalize by total words
        pos_score = pos_count / total_words
        neg_score = neg_count / total_words
        unc_score = unc_count / total_words
        
        # Net sentiment
        sentiment = (pos_count - neg_count) / total_words
        
        return {
            'positive': pos_score,
            'negative': neg_score,
            'uncertainty': unc_score,
            'sentiment': sentiment,
            'pos_words': pos_count,
            'neg_words': neg_count
        }


# Test Loughran-McDonald
lm = LoughranMcDonaldSentiment()

# Example: 10-K excerpt
text_10k = """
Our business faces significant risks and uncertainties that could materially affect 
our future financial results. Competition continues to increase, and we may face 
challenges in maintaining our market position. However, we believe our strong 
leadership and innovative products position us well for future growth and success.
"""

lm_scores = lm.analyze(text_10k)
print("Loughran-McDonald Analysis:")
print(f"  Positive score:    {lm_scores['positive']:.4f} ({lm_scores['pos_words']} words)")
print(f"  Negative score:    {lm_scores['negative']:.4f} ({lm_scores['neg_words']} words)")
print(f"  Uncertainty score: {lm_scores['uncertainty']:.4f}")
print(f"  Net sentiment:     {lm_scores['sentiment']:.4f}")

---

## 4. ML-Based Sentiment (FinBERT) <a id='4-finbert'></a>

### FinBERT: BERT Pre-trained on Financial Text

FinBERT is a BERT model fine-tuned on financial text:
- Better context understanding than lexicon methods
- Handles negation and complex sentences
- Three classes: positive, negative, neutral

In [None]:
# Check if transformers is available
try:
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch
    TRANSFORMERS_AVAILABLE = True
    print("✓ Transformers library available")
except ImportError:
    TRANSFORMERS_AVAILABLE = False
    print("✗ Transformers not available. Install with: pip install transformers torch")

In [None]:
class FinBERTSentimentAnalyzer:
    """
    FinBERT-based sentiment analyzer.
    Uses the ProsusAI/finbert model from Hugging Face.
    """
    
    def __init__(self, model_name: str = "ProsusAI/finbert"):
        if not TRANSFORMERS_AVAILABLE:
            raise ImportError("transformers library required")
        
        print(f"Loading {model_name}...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model.eval()
        
        # FinBERT labels
        self.labels = ['positive', 'negative', 'neutral']
        print("Model loaded successfully!")
    
    def analyze(self, text: str) -> dict:
        """Analyze sentiment of a single text."""
        # Tokenize
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=512
        )
        
        # Get predictions
        with torch.no_grad():
            outputs = self.model(**inputs)
            probabilities = torch.softmax(outputs.logits, dim=1)
        
        probs = probabilities[0].tolist()
        
        return {
            'label': self.labels[np.argmax(probs)],
            'positive': probs[0],
            'negative': probs[1],
            'neutral': probs[2],
            'compound': probs[0] - probs[1]  # Net sentiment
        }
    
    def analyze_batch(self, texts: list, batch_size: int = 8) -> list:
        """Analyze sentiment of multiple texts."""
        results = []
        
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            
            inputs = self.tokenizer(
                batch,
                return_tensors="pt",
                truncation=True,
                padding=True,
                max_length=512
            )
            
            with torch.no_grad():
                outputs = self.model(**inputs)
                probabilities = torch.softmax(outputs.logits, dim=1)
            
            for j, probs in enumerate(probabilities.tolist()):
                results.append({
                    'text': batch[j][:50] + '...' if len(batch[j]) > 50 else batch[j],
                    'label': self.labels[np.argmax(probs)],
                    'positive': probs[0],
                    'negative': probs[1],
                    'neutral': probs[2],
                    'compound': probs[0] - probs[1]
                })
        
        return results


# Note: Only run if transformers is available and you have enough memory
if TRANSFORMERS_AVAILABLE:
    print("\nFinBERT is available. Uncomment below to load and test.")
    print("Note: First run will download ~440MB model.")

# Uncomment to test:
# finbert = FinBERTSentimentAnalyzer()
# result = finbert.analyze("Apple stock surges after beating earnings expectations")
# print(result)

In [None]:
# Simulated FinBERT results for demonstration
# (Use actual FinBERT in production)

def simulate_finbert_results(headlines: list) -> pd.DataFrame:
    """
    Simulate FinBERT-like results for demonstration.
    In practice, use actual FinBERT model.
    """
    np.random.seed(42)
    
    # Use VADER as proxy and add noise
    vader = VADERSentimentAnalyzer()
    
    results = []
    for headline in headlines:
        compound = vader.get_compound(headline)
        
        # Convert to probabilities (simulated)
        if compound > 0.2:
            pos = 0.6 + np.random.uniform(0, 0.3)
            neg = np.random.uniform(0, 0.15)
            neu = 1 - pos - neg
            label = 'positive'
        elif compound < -0.2:
            neg = 0.6 + np.random.uniform(0, 0.3)
            pos = np.random.uniform(0, 0.15)
            neu = 1 - pos - neg
            label = 'negative'
        else:
            neu = 0.5 + np.random.uniform(0, 0.3)
            pos = np.random.uniform(0.1, 0.25)
            neg = 1 - neu - pos
            label = 'neutral'
        
        results.append({
            'headline': headline[:50] + '...' if len(headline) > 50 else headline,
            'label': label,
            'positive': round(pos, 3),
            'negative': round(neg, 3),
            'neutral': round(neu, 3),
            'compound': round(pos - neg, 3)
        })
    
    return pd.DataFrame(results)


# Compare VADER vs "FinBERT" (simulated)
print("Simulated FinBERT Results (for demonstration):")
print("=" * 80)
finbert_df = simulate_finbert_results(headlines)
print(finbert_df.to_string(index=False))

---

## 5. Aggregating Sentiment Signals <a id='5-aggregation'></a>

### Methods for Aggregating Multiple Sentiment Scores

When dealing with multiple news articles or sources:

1. **Simple Average**: Mean of all sentiment scores
2. **Time-Weighted**: Recent articles weighted more heavily
3. **Volume-Weighted**: Weight by article importance/reach
4. **Exponential Moving Average**: Smooth signal over time

In [None]:
class SentimentAggregator:
    """
    Aggregates multiple sentiment scores into a single signal.
    """
    
    @staticmethod
    def simple_average(sentiments: list) -> float:
        """Simple mean of sentiment scores."""
        if not sentiments:
            return 0.0
        return np.mean(sentiments)
    
    @staticmethod
    def time_weighted_average(
        sentiments: list,
        timestamps: list,
        half_life_hours: float = 24
    ) -> float:
        """
        Time-weighted average with exponential decay.
        More recent articles have higher weight.
        """
        if not sentiments:
            return 0.0
        
        now = max(timestamps)
        decay_rate = np.log(2) / half_life_hours
        
        weights = []
        for ts in timestamps:
            hours_ago = (now - ts).total_seconds() / 3600
            weight = np.exp(-decay_rate * hours_ago)
            weights.append(weight)
        
        weights = np.array(weights)
        weights /= weights.sum()  # Normalize
        
        return np.dot(sentiments, weights)
    
    @staticmethod
    def volume_weighted_average(
        sentiments: list,
        volumes: list
    ) -> float:
        """
        Volume-weighted average (e.g., by article views or source importance).
        """
        if not sentiments:
            return 0.0
        
        volumes = np.array(volumes)
        weights = volumes / volumes.sum()
        
        return np.dot(sentiments, weights)
    
    @staticmethod
    def ema(sentiments: pd.Series, span: int = 5) -> pd.Series:
        """Exponential moving average of sentiment."""
        return sentiments.ewm(span=span, adjust=False).mean()


# Generate sample data
np.random.seed(42)
n_articles = 20

# Simulate news articles over past 48 hours
base_time = datetime(2025, 1, 23, 9, 30)
news_data = pd.DataFrame({
    'timestamp': [base_time - timedelta(hours=np.random.uniform(0, 48)) for _ in range(n_articles)],
    'headline': [f"News article {i+1} about AAPL" for i in range(n_articles)],
    'sentiment': np.random.uniform(-0.5, 0.5, n_articles),
    'views': np.random.randint(1000, 100000, n_articles)  # Article views
})

news_data = news_data.sort_values('timestamp').reset_index(drop=True)

# Calculate aggregated sentiment
aggregator = SentimentAggregator()

simple_avg = aggregator.simple_average(news_data['sentiment'].tolist())
time_weighted = aggregator.time_weighted_average(
    news_data['sentiment'].tolist(),
    news_data['timestamp'].tolist(),
    half_life_hours=12
)
volume_weighted = aggregator.volume_weighted_average(
    news_data['sentiment'].tolist(),
    news_data['views'].tolist()
)

print("Sentiment Aggregation Methods:")
print(f"  Simple Average:      {simple_avg:.4f}")
print(f"  Time-Weighted (12h): {time_weighted:.4f}")
print(f"  Volume-Weighted:     {volume_weighted:.4f}")

In [None]:
# Visualize sentiment over time with EMA
news_data = news_data.set_index('timestamp').sort_index()
news_data['sentiment_ema'] = aggregator.ema(news_data['sentiment'], span=5)

fig, ax = plt.subplots(figsize=(12, 5))

ax.scatter(news_data.index, news_data['sentiment'], 
           alpha=0.6, s=news_data['views']/2000, label='Individual Articles')
ax.plot(news_data.index, news_data['sentiment_ema'], 
        color='red', linewidth=2, label='EMA (span=5)')
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax.axhline(y=simple_avg, color='green', linestyle=':', label=f'Simple Avg: {simple_avg:.3f}')

ax.set_xlabel('Timestamp')
ax.set_ylabel('Sentiment Score')
ax.set_title('News Sentiment Over Time (Size = Article Views)')
ax.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

---

## 6. Building Trading Signals from Sentiment <a id='6-signals'></a>

### Signal Generation Approaches

1. **Threshold-Based**: Go long if sentiment > threshold
2. **Z-Score**: Trade when sentiment deviates from historical mean
3. **Sentiment Momentum**: Trade sentiment changes
4. **Cross-Sectional**: Rank stocks by sentiment, long/short extremes

In [None]:
class SentimentSignalGenerator:
    """
    Generates trading signals from sentiment data.
    """
    
    def __init__(self, lookback: int = 20):
        self.lookback = lookback
    
    def threshold_signal(
        self,
        sentiment: pd.Series,
        long_threshold: float = 0.1,
        short_threshold: float = -0.1
    ) -> pd.Series:
        """
        Simple threshold-based signal.
        Returns: 1 (long), -1 (short), 0 (neutral)
        """
        signal = pd.Series(0, index=sentiment.index)
        signal[sentiment > long_threshold] = 1
        signal[sentiment < short_threshold] = -1
        return signal
    
    def zscore_signal(
        self,
        sentiment: pd.Series,
        zscore_threshold: float = 1.0
    ) -> pd.Series:
        """
        Z-score based signal using rolling statistics.
        Trade when sentiment deviates significantly from recent mean.
        """
        rolling_mean = sentiment.rolling(self.lookback).mean()
        rolling_std = sentiment.rolling(self.lookback).std()
        
        zscore = (sentiment - rolling_mean) / rolling_std
        
        signal = pd.Series(0, index=sentiment.index)
        signal[zscore > zscore_threshold] = 1
        signal[zscore < -zscore_threshold] = -1
        
        return signal
    
    def momentum_signal(
        self,
        sentiment: pd.Series,
        change_threshold: float = 0.05
    ) -> pd.Series:
        """
        Sentiment momentum signal.
        Trade based on change in sentiment.
        """
        sentiment_change = sentiment.diff()
        
        signal = pd.Series(0, index=sentiment.index)
        signal[sentiment_change > change_threshold] = 1
        signal[sentiment_change < -change_threshold] = -1
        
        return signal
    
    def composite_signal(
        self,
        sentiment: pd.Series,
        price: pd.Series = None
    ) -> pd.Series:
        """
        Composite signal combining multiple methods.
        Returns continuous signal strength (-1 to 1).
        """
        # Individual signals
        threshold_sig = self.threshold_signal(sentiment)
        zscore_sig = self.zscore_signal(sentiment)
        momentum_sig = self.momentum_signal(sentiment)
        
        # Combine with equal weights
        composite = (threshold_sig + zscore_sig + momentum_sig) / 3
        
        return composite


# Generate synthetic daily sentiment data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=250, freq='B')

# Simulate sentiment with mean-reversion and trends
sentiment_noise = np.random.randn(250) * 0.1
sentiment_trend = np.sin(np.linspace(0, 4*np.pi, 250)) * 0.2
sentiment_series = pd.Series(
    sentiment_trend + sentiment_noise,
    index=dates,
    name='sentiment'
)

# Generate signals
signal_gen = SentimentSignalGenerator(lookback=20)

signals_df = pd.DataFrame({
    'sentiment': sentiment_series,
    'threshold_signal': signal_gen.threshold_signal(sentiment_series),
    'zscore_signal': signal_gen.zscore_signal(sentiment_series),
    'momentum_signal': signal_gen.momentum_signal(sentiment_series),
    'composite_signal': signal_gen.composite_signal(sentiment_series)
})

print("Signal Statistics:")
print(signals_df.describe().round(3))

In [None]:
# Visualize signals
fig, axes = plt.subplots(4, 1, figsize=(14, 10), sharex=True)

# Sentiment
axes[0].plot(signals_df['sentiment'], color='blue', linewidth=1)
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0].axhline(y=0.1, color='green', linestyle=':', alpha=0.7)
axes[0].axhline(y=-0.1, color='red', linestyle=':', alpha=0.7)
axes[0].set_ylabel('Sentiment')
axes[0].set_title('Raw Sentiment Score')
axes[0].fill_between(signals_df.index, 0, signals_df['sentiment'], 
                     where=signals_df['sentiment'] > 0, color='green', alpha=0.3)
axes[0].fill_between(signals_df.index, 0, signals_df['sentiment'], 
                     where=signals_df['sentiment'] < 0, color='red', alpha=0.3)

# Threshold signal
axes[1].plot(signals_df['threshold_signal'], color='purple', linewidth=1, drawstyle='steps-post')
axes[1].set_ylabel('Signal')
axes[1].set_title('Threshold-Based Signal')
axes[1].set_ylim(-1.5, 1.5)

# Z-score signal
axes[2].plot(signals_df['zscore_signal'], color='orange', linewidth=1, drawstyle='steps-post')
axes[2].set_ylabel('Signal')
axes[2].set_title('Z-Score Signal')
axes[2].set_ylim(-1.5, 1.5)

# Composite signal
axes[3].plot(signals_df['composite_signal'], color='black', linewidth=1.5)
axes[3].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[3].set_ylabel('Signal')
axes[3].set_xlabel('Date')
axes[3].set_title('Composite Signal (Continuous)')
axes[3].fill_between(signals_df.index, 0, signals_df['composite_signal'], 
                     where=signals_df['composite_signal'] > 0, color='green', alpha=0.3)
axes[3].fill_between(signals_df.index, 0, signals_df['composite_signal'], 
                     where=signals_df['composite_signal'] < 0, color='red', alpha=0.3)

plt.tight_layout()
plt.show()

---

## 7. Backtesting Sentiment Strategies <a id='7-backtesting'></a>

### Simple Backtest Framework

Test sentiment signals against actual price data.

In [None]:
class SentimentBacktester:
    """
    Simple backtester for sentiment-based strategies.
    """
    
    def __init__(self, prices: pd.Series, sentiment: pd.Series):
        # Align data
        self.data = pd.DataFrame({
            'price': prices,
            'sentiment': sentiment
        }).dropna()
        
        self.data['returns'] = self.data['price'].pct_change()
    
    def run_backtest(
        self,
        signal: pd.Series,
        transaction_cost: float = 0.001
    ) -> pd.DataFrame:
        """
        Run backtest with given signal.
        Signal should be -1, 0, or 1.
        """
        # Align signal with data
        bt = self.data.copy()
        bt['signal'] = signal.reindex(bt.index).fillna(0)
        
        # Shift signal by 1 to avoid look-ahead bias
        bt['position'] = bt['signal'].shift(1).fillna(0)
        
        # Calculate strategy returns
        bt['strategy_returns'] = bt['position'] * bt['returns']
        
        # Transaction costs
        bt['trades'] = bt['position'].diff().abs()
        bt['tc'] = bt['trades'] * transaction_cost
        bt['strategy_returns_net'] = bt['strategy_returns'] - bt['tc']
        
        # Cumulative returns
        bt['cum_returns'] = (1 + bt['returns']).cumprod() - 1
        bt['cum_strategy'] = (1 + bt['strategy_returns_net']).cumprod() - 1
        
        return bt
    
    def calculate_metrics(self, bt: pd.DataFrame) -> dict:
        """Calculate performance metrics."""
        returns = bt['strategy_returns_net'].dropna()
        
        total_return = (1 + returns).prod() - 1
        annual_return = (1 + total_return) ** (252 / len(returns)) - 1
        volatility = returns.std() * np.sqrt(252)
        sharpe = annual_return / volatility if volatility > 0 else 0
        
        # Max drawdown
        cum_returns = (1 + returns).cumprod()
        rolling_max = cum_returns.expanding().max()
        drawdown = (cum_returns - rolling_max) / rolling_max
        max_drawdown = drawdown.min()
        
        # Win rate
        winning_trades = (returns > 0).sum()
        total_trades = (returns != 0).sum()
        win_rate = winning_trades / total_trades if total_trades > 0 else 0
        
        return {
            'Total Return': f"{total_return:.2%}",
            'Annual Return': f"{annual_return:.2%}",
            'Volatility': f"{volatility:.2%}",
            'Sharpe Ratio': f"{sharpe:.2f}",
            'Max Drawdown': f"{max_drawdown:.2%}",
            'Win Rate': f"{win_rate:.2%}",
            'Total Trades': int(bt['trades'].sum() / 2)
        }

In [None]:
# Download real price data
try:
    ticker = "SPY"
    prices = yf.download(ticker, start='2024-01-01', end='2025-01-01', progress=False)['Close']
    prices = prices.squeeze()
    print(f"Downloaded {len(prices)} days of {ticker} data")
except Exception as e:
    print(f"Error downloading data: {e}")
    # Create synthetic prices
    np.random.seed(42)
    dates = pd.date_range('2024-01-01', periods=250, freq='B')
    returns = np.random.randn(250) * 0.01 + 0.0003
    prices = pd.Series(100 * np.exp(np.cumsum(returns)), index=dates, name='Close')
    print("Using synthetic price data")

In [None]:
# Generate synthetic sentiment that has some predictive power
np.random.seed(42)

# Make sentiment partially predictive of returns
price_returns = prices.pct_change().shift(-1)  # Next day returns
noise = pd.Series(np.random.randn(len(prices)) * 0.15, index=prices.index)

# Sentiment = weak signal + noise
synthetic_sentiment = 0.3 * np.sign(price_returns.fillna(0)) + noise
synthetic_sentiment = synthetic_sentiment.clip(-1, 1)
synthetic_sentiment.name = 'sentiment'

# Run backtest
backtester = SentimentBacktester(prices, synthetic_sentiment)

# Generate signal
signal_generator = SentimentSignalGenerator(lookback=20)
trading_signal = signal_generator.threshold_signal(
    synthetic_sentiment,
    long_threshold=0.1,
    short_threshold=-0.1
)

# Run backtest
bt_results = backtester.run_backtest(trading_signal, transaction_cost=0.001)

# Calculate metrics
metrics = backtester.calculate_metrics(bt_results)

print("\nBacktest Results (Threshold Signal):")
print("=" * 40)
for metric, value in metrics.items():
    print(f"  {metric}: {value}")

In [None]:
# Compare different signal methods
signal_methods = {
    'Threshold': signal_generator.threshold_signal(synthetic_sentiment),
    'Z-Score': signal_generator.zscore_signal(synthetic_sentiment),
    'Momentum': signal_generator.momentum_signal(synthetic_sentiment),
    'Composite': np.sign(signal_generator.composite_signal(synthetic_sentiment))
}

all_metrics = {}
all_bt = {}

for name, signal in signal_methods.items():
    bt = backtester.run_backtest(signal)
    all_bt[name] = bt
    all_metrics[name] = backtester.calculate_metrics(bt)

# Create comparison DataFrame
comparison_df = pd.DataFrame(all_metrics).T
print("\nSignal Method Comparison:")
print("=" * 80)
print(comparison_df.to_string())

In [None]:
# Plot cumulative returns
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Cumulative returns comparison
ax1 = axes[0]
ax1.plot(bt_results['cum_returns'] * 100, label='Buy & Hold', color='gray', linewidth=2)

colors = ['blue', 'green', 'orange', 'purple']
for (name, bt), color in zip(all_bt.items(), colors):
    ax1.plot(bt['cum_strategy'] * 100, label=name, color=color, linewidth=1.5)

ax1.set_ylabel('Cumulative Return (%)')
ax1.set_title('Sentiment Strategy Performance Comparison')
ax1.legend(loc='upper left')
ax1.grid(True, alpha=0.3)

# Sentiment and signal
ax2 = axes[1]
ax2.plot(synthetic_sentiment, color='blue', alpha=0.6, label='Sentiment')
ax2.fill_between(bt_results.index, 0, bt_results['position'], 
                  where=bt_results['position'] > 0, color='green', alpha=0.3, label='Long')
ax2.fill_between(bt_results.index, 0, bt_results['position'], 
                  where=bt_results['position'] < 0, color='red', alpha=0.3, label='Short')
ax2.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax2.set_ylabel('Sentiment / Position')
ax2.set_xlabel('Date')
ax2.set_title('Sentiment Score and Trading Position')
ax2.legend(loc='upper right')

plt.tight_layout()
plt.show()

---

## 8. Interview Questions <a id='8-interview'></a>

### Conceptual Questions

**Q1: What are the key differences between lexicon-based and ML-based sentiment analysis?**

<details>
<summary>Answer</summary>

| Aspect | Lexicon-Based | ML-Based |
|--------|---------------|----------|
| Speed | Very fast | Slower (inference) |
| Interpretability | High (word counts) | Lower (black box) |
| Context | Limited (word-level) | Rich (sentence-level) |
| Negation handling | Poor | Good |
| Domain adaptation | Requires custom lexicons | Requires fine-tuning |
| Training data | None needed | Requires labeled data |

</details>

**Q2: How would you handle look-ahead bias in sentiment-based trading signals?**

<details>
<summary>Answer</summary>

1. **Timestamp alignment**: Use only sentiment available before the trading decision
2. **Signal lag**: Shift signal by 1 period before calculating returns
3. **Point-in-time data**: Use data as it was known at that moment (no revisions)
4. **News timing**: Account for news publication time vs. market hours
5. **Embargo periods**: Consider when earnings calls become public

</details>

**Q3: How does sentiment decay affect trading signals?**

<details>
<summary>Answer</summary>

- **Information half-life**: News impact decays over time
- **Signal dilution**: Old news becomes less relevant
- **Regime changes**: Market conditions affect sentiment-price relationship
- **Solution**: Use time-weighted aggregation or exponential decay
- **Typical decay**: 12-48 hours for news, longer for earnings

</details>

In [None]:
# Interview Coding Challenge: Sentiment-Price Correlation Analysis

def analyze_sentiment_price_relationship(
    sentiment: pd.Series,
    prices: pd.Series,
    lags: list = [0, 1, 2, 3, 5]
) -> pd.DataFrame:
    """
    Analyze lead-lag relationship between sentiment and returns.
    
    Parameters:
    -----------
    sentiment : pd.Series
        Daily sentiment scores
    prices : pd.Series
        Daily prices
    lags : list
        Number of days to lag returns (positive = future returns)
    
    Returns:
    --------
    DataFrame with correlation analysis
    """
    returns = prices.pct_change()
    
    results = []
    for lag in lags:
        # Shift returns (positive lag = future returns)
        if lag >= 0:
            lagged_returns = returns.shift(-lag)
            description = f"Returns t+{lag}"
        else:
            lagged_returns = returns.shift(-lag)
            description = f"Returns t{lag}"
        
        # Align and calculate correlation
        aligned = pd.concat([sentiment, lagged_returns], axis=1).dropna()
        aligned.columns = ['sentiment', 'returns']
        
        corr = aligned['sentiment'].corr(aligned['returns'])
        
        # Information coefficient (rank correlation)
        ic = aligned['sentiment'].corr(aligned['returns'], method='spearman')
        
        results.append({
            'Lag': lag,
            'Description': description,
            'Pearson Corr': round(corr, 4),
            'Spearman IC': round(ic, 4),
            'N': len(aligned)
        })
    
    return pd.DataFrame(results)


# Run analysis
correlation_analysis = analyze_sentiment_price_relationship(
    synthetic_sentiment,
    prices,
    lags=[0, 1, 2, 3, 5, 10]
)

print("Sentiment-Price Lead-Lag Analysis:")
print("=" * 60)
print(correlation_analysis.to_string(index=False))
print("\nInterpretation:")
print("- Positive correlation at t+1 suggests sentiment predicts next-day returns")
print("- Higher Spearman IC indicates better rank-ordering ability")
print("- Decay in correlation over lags shows information half-life")

### Practical Interview Question

**Q4: Design a sentiment-based trading system for a fund processing 10,000 news articles daily.**

<details>
<summary>Answer</summary>

**Architecture:**
```
News Feed → Pre-processing → Entity Extraction → Sentiment Scoring → Aggregation → Signal Generation → Execution
```

**Key Components:**

1. **Data Ingestion**:
   - Multiple news sources (Reuters, Bloomberg, social media)
   - Streaming pipeline (Kafka/Kinesis)
   - Deduplication and filtering

2. **NLP Pipeline**:
   - Named Entity Recognition (link articles to tickers)
   - FinBERT for sentiment scoring
   - Batch processing for efficiency (GPU clusters)

3. **Signal Generation**:
   - Time-weighted aggregation per ticker
   - Cross-sectional ranking
   - Confidence scoring based on article volume

4. **Risk Management**:
   - Position limits based on sentiment confidence
   - Correlation with existing positions
   - Drawdown controls

5. **Monitoring**:
   - Real-time sentiment dashboard
   - Model drift detection
   - PnL attribution to sentiment signals

</details>

---

## Summary

### Key Takeaways

1. **Preprocessing matters**: Financial text requires domain-specific handling

2. **Multiple approaches**: Lexicon (fast, interpretable) vs ML (accurate, contextual)

3. **Aggregation is crucial**: How you combine multiple signals affects performance

4. **Avoid look-ahead bias**: Always lag signals before calculating returns

5. **Transaction costs**: Include realistic costs in backtests

### Next Steps

- Day 04: News Event Detection
- Day 05: Entity Recognition for Finance
- Day 06: Topic Modeling on Earnings Calls

---

*Notebook created for Week 19: NLP & Alternative Data*