# MarketPulse Analytics Studio - Complete Walkthrough
## Cazandra Aporbo, MS
## May-Sep 2025

This notebook walks through my entire MarketPulse system. I'll show you how sentiment analysis, feature engineering, and ensemble learning come together to predict market movements. 

I wrote this notebook to be readable by humans. Every variable name tells you what it does. Every comment explains why, not just what. By the end, you'll understand how to build a trading system that actually works.

Let's dive in.

In [None]:
# I always start with imports organized by purpose
# Makes it easier to see what the code actually does

# Data manipulation - the foundation of everything
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Visualization - because numbers without pictures are just numbers
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt

# Machine learning - where the magic happens
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# My custom modules - this is where I put months of work
import sys
sys.path.append('../')  # Add parent directory to path

from core.sentiment_engine import SentimentEngine, FinancialLexicon
from core.feature_factory import FeatureFactory, FeatureConfig
from core.model_ensemble import ModelEnsemble, EnsembleConfig

# Styling for prettier outputs
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 20)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

print("Everything loaded. Let's build something cool.")

## Step 1: Generate Market Data

I'm using synthetic data here so anyone can run this notebook without API keys. In production, you'd pull from Yahoo Finance or your broker's API. The synthetic data mimics real market behavior - trends, volatility clusters, the works.

In [None]:
def create_realistic_market_data(ticker='DEMO', days=365):
    """
    I create synthetic market data that actually looks like real markets.
    Random walks are for academics. Real markets trend, mean-revert, and cluster.
    """
    
    # Start from a year ago
    dates = pd.date_range(end=datetime.now(), periods=days, freq='D')
    
    # I use a combination of trends and noise to make it realistic
    trend_strength = 0.0003  # Slight upward bias, like real markets
    volatility = 0.02  # 2% daily vol is about right for stocks
    
    # Generate returns with volatility clustering
    # Markets go through calm and stormy periods
    daily_returns = []
    current_vol = volatility
    
    for i in range(days):
        # Volatility clusters - if yesterday was volatile, today probably is too
        if i > 0 and abs(daily_returns[-1]) > volatility * 1.5:
            current_vol = volatility * 1.5  # Elevated volatility
        else:
            current_vol = current_vol * 0.9 + volatility * 0.1  # Decay back to normal
        
        # Generate return with trend and noise
        daily_return = np.random.normal(trend_strength, current_vol)
        daily_returns.append(daily_return)
    
    # Convert returns to prices
    # I start at 100 because it's a nice round number
    initial_price = 100
    prices = initial_price * np.exp(np.cumsum(daily_returns))
    
    # Create OHLC data - the four numbers that define a trading day
    # I add some intraday variation to make it realistic
    market_data = pd.DataFrame(index=dates)
    market_data['Close'] = prices
    
    # Open is usually close to previous close (gaps happen but not always)
    market_data['Open'] = market_data['Close'].shift(1) * np.random.uniform(0.995, 1.005, days)
    market_data['Open'].fillna(initial_price, inplace=True)
    
    # High and Low show the day's range
    # Bigger moves mean bigger ranges
    daily_range = abs(daily_returns) + 0.005  # Minimum 0.5% range
    market_data['High'] = market_data[['Open', 'Close']].max(axis=1) * (1 + daily_range)
    market_data['Low'] = market_data[['Open', 'Close']].min(axis=1) * (1 - daily_range)
    
    # Volume - higher on big move days (people trade more when stuff happens)
    base_volume = 10_000_000  # 10 million shares baseline
    volume_multiplier = 1 + abs(daily_returns) * 10  # Big moves = big volume
    market_data['Volume'] = (base_volume * volume_multiplier * 
                             np.random.uniform(0.8, 1.2, days)).astype(int)
    
    # Add the ticker for reference
    market_data['Ticker'] = ticker
    
    return market_data

# Generate our test data
market_data = create_realistic_market_data(ticker='TECH', days=500)

print(f"Created {len(market_data)} days of market data")
print(f"Price range: ${market_data['Close'].min():.2f} to ${market_data['Close'].max():.2f}")
print(f"\nLast 5 days:")
market_data.tail()

## Step 2: Generate News Sentiment

Real markets react to news. I'll create synthetic news that correlates with price moves, but not perfectly. Sometimes good news is ignored. Sometimes rumors move markets more than facts. That's the reality I'm modeling here.

In [None]:
def generate_market_news(market_data, news_per_day=3):
    """
    I generate synthetic news that somewhat correlates with price moves.
    The correlation isn't perfect because markets aren't efficient.
    Sometimes the news follows price, sometimes price follows news.
    """
    
    # Calculate daily returns to base sentiment on
    returns = market_data['Close'].pct_change()
    
    # News headlines that sound real
    # I spent way too long making these sound authentic
    bullish_templates = [
        "TECH beats earnings expectations by {:.1%}, raises full-year guidance",
        "Analysts upgrade TECH to buy, cite strong fundamentals",
        "TECH announces breakthrough product, stock rallies",
        "Institutional investors increase TECH holdings by {:.0f}%",
        "TECH CEO discusses expansion plans, market reacts positively",
        "Breaking: TECH wins major contract worth ${:.0f}M",
        "TECH stock hits new 52-week high on strong volume"
    ]
    
    bearish_templates = [
        "TECH misses revenue estimates, shares tumble {:.1%}",
        "Concerns grow over TECH valuation metrics",
        "TECH faces regulatory scrutiny, uncertainty weighs on stock",
        "Major investor reduces TECH position by {:.0f}%",
        "TECH announces layoffs, restructuring costs mount",
        "Supply chain issues impact TECH production forecasts",
        "TECH loses market share to competitors, analysts concerned"
    ]
    
    neutral_templates = [
        "TECH trading flat as investors await earnings report",
        "Market watches TECH for breakout signals",
        "TECH maintains steady course amid market volatility",
        "Analysts mixed on TECH near-term prospects",
        "TECH consolidating after recent moves"
    ]
    
    # Sources with different reliability
    reliable_sources = ['Bloomberg', 'Reuters', 'WSJ', 'Financial Times']
    medium_sources = ['MarketWatch', 'CNBC', 'Yahoo Finance']
    noise_sources = ['TradingBlog', 'StockTwits', 'Reddit']
    
    all_news = []
    
    for date, row in market_data.iterrows():
        daily_return = returns.loc[date]
        
        # Generate multiple news items per day
        for news_item in range(news_per_day):
            # News sentiment loosely follows price
            # But I add noise because markets aren't perfectly efficient
            sentiment_bias = daily_return * 20  # Scale return to sentiment
            sentiment_noise = np.random.normal(0, 0.3)  # Random noise
            
            # Final sentiment with some randomness
            true_sentiment = np.clip(sentiment_bias + sentiment_noise, -1, 1)
            
            # Pick headline based on sentiment
            if true_sentiment > 0.2:
                headline = np.random.choice(bullish_templates)
                # Format with random numbers for realism
                headline = headline.format(abs(daily_return) * np.random.uniform(1, 3))
            elif true_sentiment < -0.2:
                headline = np.random.choice(bearish_templates)
                headline = headline.format(abs(daily_return) * np.random.uniform(1, 3))
            else:
                headline = np.random.choice(neutral_templates)
            
            # Pick source based on sentiment magnitude
            # Big news comes from reliable sources
            if abs(true_sentiment) > 0.5:
                source = np.random.choice(reliable_sources)
            elif abs(true_sentiment) > 0.2:
                source = np.random.choice(medium_sources)
            else:
                source = np.random.choice(noise_sources)
            
            # Add some time variation during the day
            hours_offset = np.random.randint(0, 24)
            timestamp = date + timedelta(hours=hours_offset)
            
            all_news.append({
                'timestamp': timestamp,
                'headline': headline,
                'source': source,
                'true_sentiment': true_sentiment  # I keep this for validation
            })
    
    return pd.DataFrame(all_news)

# Generate news for our market data
news_data = generate_market_news(market_data, news_per_day=5)

print(f"Generated {len(news_data)} news articles")
print(f"Sources: {news_data['source'].value_counts().to_dict()}")
print(f"\nRecent headlines:")
for _, article in news_data.tail(5).iterrows():
    print(f"  [{article['source']}] {article['headline'][:60]}...")

## Step 3: Sentiment Analysis

Now I'll run my sentiment engine on the news. This shows how different sources get weighted differently and how sentiment momentum is calculated.

In [None]:
# Initialize my sentiment engine
sentiment_brain = SentimentEngine()

# Process all the news
print("Analyzing sentiment for all articles...")
analyzed_articles = sentiment_brain.analyze_batch(
    news_data.to_dict('records')
)

print(f"\nSentiment Statistics:")
print(f"  Average sentiment: {analyzed_articles['weighted_score'].mean():.3f}")
print(f"  Sentiment volatility: {analyzed_articles['weighted_score'].std():.3f}")
print(f"  Most positive day: {analyzed_articles['weighted_score'].max():.3f}")
print(f"  Most negative day: {analyzed_articles['weighted_score'].min():.3f}")

# Show how sentiment momentum works
# This is the secret sauce - it's not just sentiment, it's how fast it's changing
latest_sentiment = analyzed_articles.tail(20)

# Visualize sentiment dynamics
fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=(
        'Raw Sentiment Scores',
        'Sentiment Momentum (3-day vs 10-day)',
        'Sentiment Acceleration'
    ),
    shared_xaxes=True
)

# Raw sentiment
fig.add_trace(
    go.Scatter(
        x=latest_sentiment['timestamp'],
        y=latest_sentiment['weighted_score'],
        mode='lines+markers',
        name='Weighted Sentiment',
        line=dict(color='#4a9b7f', width=2)
    ),
    row=1, col=1
)

# Sentiment momentum
fig.add_trace(
    go.Scatter(
        x=latest_sentiment['timestamp'],
        y=latest_sentiment['sentiment_momentum'],
        mode='lines',
        name='Momentum',
        line=dict(color='#457b9d', width=2),
        fill='tozeroy'
    ),
    row=2, col=1
)

# Sentiment acceleration
fig.add_trace(
    go.Scatter(
        x=latest_sentiment['timestamp'],
        y=latest_sentiment['sentiment_acceleration'],
        mode='lines',
        name='Acceleration',
        line=dict(color='#c0504d', width=2)
    ),
    row=3, col=1
)

fig.update_layout(
    height=700,
    showlegend=False,
    title_text="Sentiment Dynamics - The Three Dimensions"
)

fig.show()

# Aggregate sentiment to daily level for merging with market data
daily_sentiment = analyzed_articles.groupby(
    analyzed_articles['timestamp'].dt.date
).agg({
    'weighted_score': 'mean',
    'confidence': 'mean',
    'text': 'count'  # News volume
}).rename(columns={'text': 'news_count'})

daily_sentiment.index = pd.to_datetime(daily_sentiment.index)

print(f"\nAggregated to {len(daily_sentiment)} daily sentiment scores")

## Step 4: Feature Engineering

This is where data science becomes art. I combine price, volume, technical indicators, and sentiment into features that actually predict something. Started with 200+ features, kept the ones that matter.

In [None]:
# Merge market data with sentiment
# I need to be careful about alignment here - markets and news run on different clocks
combined_data = market_data.copy()
combined_data['sentiment'] = daily_sentiment['weighted_score']
combined_data['news_count'] = daily_sentiment['news_count']

# Forward fill sentiment for days without news (weekends, holidays)
combined_data['sentiment'].fillna(method='ffill', inplace=True)
combined_data['news_count'].fillna(0, inplace=True)

# Initialize my feature factory
feature_config = FeatureConfig(
    fast_window=5,
    medium_window=20,
    slow_window=50,
    max_features=30  # Keep it manageable
)

feature_wizard = FeatureFactory(config=feature_config)

# Create all features
print("Engineering features from raw data...")
feature_matrix = feature_wizard.create_features(
    combined_data,
    include_sentiment=True
)

print(f"\nCreated {len(feature_matrix.columns)} features")
print(f"Feature categories:")

# I like to see what types of features I have
feature_categories = {
    'Price': [col for col in feature_matrix.columns if 'price' in col.lower() or 'return' in col.lower()],
    'Volume': [col for col in feature_matrix.columns if 'volume' in col.lower()],
    'Volatility': [col for col in feature_matrix.columns if 'vol' in col.lower() or 'atr' in col.lower()],
    'Momentum': [col for col in feature_matrix.columns if 'rsi' in col or 'roc' in col or 'macd' in col],
    'Sentiment': [col for col in feature_matrix.columns if 'sent' in col.lower()],
}

for category, features in feature_categories.items():
    print(f"  {category}: {len(features)} features")

# Show correlation between key features
key_features = ['returns', 'sentiment', 'volume_ratio', 'volatility_20', 'rsi']
available_features = [f for f in key_features if f in feature_matrix.columns]

if len(available_features) > 0:
    correlation_matrix = feature_matrix[available_features].corr()
    
    fig = px.imshow(
        correlation_matrix,
        text_auto=True,
        color_continuous_scale='RdBu',
        zmin=-1, zmax=1,
        title="Feature Correlation Matrix - Looking for Relationships"
    )
    fig.update_layout(height=500)
    fig.show()

# Show feature importance (based on variance)
feature_importance = feature_wizard.get_feature_importance()
if feature_importance:
    top_features = sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)[:10]
    
    print("\nTop 10 Features by Importance:")
    for rank, (feature, score) in enumerate(top_features, 1):
        print(f"  {rank}. {feature}: {score:.4f}")

## Step 5: Create Trading Signals

I need to create labels for supervised learning. In this case, I'm predicting whether tomorrow's price will be up or down. Simple but effective.

In [None]:
# Create target variable - next day's direction
# I'm predicting tomorrow based on today's features
tomorrow_return = combined_data['Close'].pct_change().shift(-1)
trading_signal = (tomorrow_return > 0).astype(int)

# Remove the last row (no tomorrow for that one)
feature_matrix = feature_matrix[:-1]
trading_signal = trading_signal[:-1]

# Remove any remaining NaN values
# I need clean data for the models
clean_mask = ~(feature_matrix.isna().any(axis=1) | trading_signal.isna())
X_clean = feature_matrix[clean_mask]
y_clean = trading_signal[clean_mask]

print(f"Final dataset: {len(X_clean)} samples with {len(X_clean.columns)} features")
print(f"Target distribution:")
print(f"  Up days (1): {y_clean.sum()} ({y_clean.mean():.1%})")
print(f"  Down days (0): {len(y_clean) - y_clean.sum()} ({1-y_clean.mean():.1%})")

# Split data temporally (never randomly for time series!)
# I use 80% for training, 20% for testing
split_point = int(len(X_clean) * 0.8)

X_train = X_clean.iloc[:split_point]
X_test = X_clean.iloc[split_point:]
y_train = y_clean.iloc[:split_point]
y_test = y_clean.iloc[split_point:]

print(f"\nTraining set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nFeatures look like this:")
X_train.head()

## Step 6: Train the Ensemble

Now for the fun part - training multiple models and combining them. Each model sees the problem differently, and that diversity is our strength.

In [None]:
# Configure the ensemble
# I've tuned these weights based on backtesting
ensemble_config = EnsembleConfig(
    model_weights={
        'logistic': 0.3,     # Simple but stable
        'random_forest': 0.7  # Captures non-linearity
    },
    n_splits=3,  # For time series cross-validation
    min_accuracy=0.52  # Better than coin flip
)

# Initialize the ensemble
market_oracle = ModelEnsemble(config=ensemble_config)

# Train all models
print("Training the ensemble (this takes a moment)...\n")
training_metrics = market_oracle.train(
    X_train, 
    y_train,
    validate=True  # Use time series cross-validation
)

# Display training results
print("\n" + "="*50)
print("TRAINING RESULTS")
print("="*50)

for model_name, metrics in training_metrics.items():
    print(f"\n{model_name.upper()}:")
    print(f"  Accuracy:  {metrics.accuracy:.3f}")
    print(f"  Precision: {metrics.precision:.3f}")
    print(f"  Recall:    {metrics.recall:.3f}")
    print(f"  F1 Score:  {metrics.f1:.3f}")

# Get ensemble weights
final_weights = market_oracle.get_model_weights()
print("\n" + "="*50)
print("ENSEMBLE WEIGHTS (Performance-Adjusted)")
print("="*50)
for model, weight in final_weights.items():
    print(f"  {model}: {weight:.2%}")

## Step 7: Make Predictions

Time to see how well our ensemble performs on data it's never seen. This is the moment of truth.

In [None]:
# Make predictions on test set
print("Making predictions on test set...\n")

# Get predictions from ensemble
ensemble_predictions = market_oracle.predict(X_test, method='weighted')

# Also get individual model predictions for comparison
individual_predictions = {}
for model_name, model in market_oracle.models.items():
    individual_predictions[model_name] = model.predict(X_test)

# Calculate performance metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print("="*50)
print("TEST SET PERFORMANCE")
print("="*50)

# Ensemble performance
ensemble_accuracy = accuracy_score(y_test, ensemble_predictions)
print(f"\nENSEMBLE ACCURACY: {ensemble_accuracy:.3f}")

# Individual model performance
print("\nIndividual Model Accuracies:")
for model_name, predictions in individual_predictions.items():
    accuracy = accuracy_score(y_test, predictions)
    print(f"  {model_name}: {accuracy:.3f}")

# Detailed classification report for ensemble
print("\n" + "="*50)
print("ENSEMBLE CLASSIFICATION REPORT")
print("="*50)
print(classification_report(
    y_test, 
    ensemble_predictions,
    target_names=['DOWN', 'UP']
))

# Confusion Matrix
cm = confusion_matrix(y_test, ensemble_predictions)
print("\nConfusion Matrix:")
print("                Predicted")
print("                DOWN   UP")
print(f"Actual DOWN     {cm[0,0]:4d}  {cm[0,1]:4d}")
print(f"       UP       {cm[1,0]:4d}  {cm[1,1]:4d}")

# Calculate some trading metrics
# This is what actually matters for making money
true_positives = cm[1,1]  # Correctly predicted up days
false_positives = cm[0,1]  # Incorrectly predicted up days
true_negatives = cm[0,0]  # Correctly predicted down days
false_negatives = cm[1,0]  # Incorrectly predicted down days

if (true_positives + false_positives) > 0:
    precision_up = true_positives / (true_positives + false_positives)
    print(f"\nWhen we predict UP, we're right {precision_up:.1%} of the time")

if (true_negatives + false_negatives) > 0:
    precision_down = true_negatives / (true_negatives + false_negatives)
    print(f"When we predict DOWN, we're right {precision_down:.1%} of the time")

## Step 8: Visualize Performance

Numbers are nice, but pictures tell the story better. Let's see how our predictions look over time.

In [None]:
# Create a results dataframe for visualization
results_df = pd.DataFrame({
    'date': X_test.index,
    'actual': y_test.values,
    'predicted': ensemble_predictions,
    'correct': (y_test.values == ensemble_predictions).astype(int)
})

# Add prices for context
results_df['price'] = combined_data.loc[X_test.index, 'Close'].values

# Calculate cumulative accuracy over time
# I want to see if the model gets better or worse over time
results_df['cumulative_accuracy'] = results_df['correct'].expanding().mean()

# Create visualization
fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=(
        'Stock Price with Predictions',
        'Prediction Accuracy Over Time',
        'Rolling 20-Day Accuracy'
    ),
    shared_xaxes=True,
    vertical_spacing=0.05
)

# Plot 1: Price with prediction markers
# Green dots for correct predictions, red for wrong ones
fig.add_trace(
    go.Scatter(
        x=results_df['date'],
        y=results_df['price'],
        mode='lines',
        name='Price',
        line=dict(color='#2E86AB', width=2)
    ),
    row=1, col=1
)

# Add correct predictions
correct_mask = results_df['correct'] == 1
fig.add_trace(
    go.Scatter(
        x=results_df[correct_mask]['date'],
        y=results_df[correct_mask]['price'],
        mode='markers',
        name='Correct',
        marker=dict(color='#4a9b7f', size=6)
    ),
    row=1, col=1
)

# Add wrong predictions
wrong_mask = results_df['correct'] == 0
fig.add_trace(
    go.Scatter(
        x=results_df[wrong_mask]['date'],
        y=results_df[wrong_mask]['price'],
        mode='markers',
        name='Wrong',
        marker=dict(color='#c0504d', size=6)
    ),
    row=1, col=1
)

# Plot 2: Cumulative accuracy
fig.add_trace(
    go.Scatter(
        x=results_df['date'],
        y=results_df['cumulative_accuracy'],
        mode='lines',
        name='Cumulative Accuracy',
        line=dict(color='#F18F01', width=2),
        fill='tozeroy'
    ),
    row=2, col=1
)

# Add 50% reference line (random guessing)
fig.add_hline(
    y=0.5, 
    line_dash="dash", 
    line_color="gray",
    row=2, col=1
)

# Plot 3: Rolling accuracy
results_df['rolling_accuracy'] = results_df['correct'].rolling(20).mean()
fig.add_trace(
    go.Scatter(
        x=results_df['date'],
        y=results_df['rolling_accuracy'],
        mode='lines',
        name='20-Day Rolling',
        line=dict(color='#8B5A3C', width=2)
    ),
    row=3, col=1
)

# Add 50% reference line
fig.add_hline(
    y=0.5, 
    line_dash="dash", 
    line_color="gray",
    row=3, col=1
)

fig.update_layout(
    height=900,
    title_text="Model Performance Analysis - How Did We Do?",
    showlegend=False
)

fig.update_yaxes(title_text="Price ($)", row=1, col=1)
fig.update_yaxes(title_text="Accuracy", tickformat=".0%", row=2, col=1)
fig.update_yaxes(title_text="Accuracy", tickformat=".0%", row=3, col=1)

fig.show()

# Summary statistics
print("\n" + "="*50)
print("PERFORMANCE SUMMARY")
print("="*50)
print(f"Overall Accuracy: {results_df['correct'].mean():.1%}")
print(f"Best 20-day Period: {results_df['rolling_accuracy'].max():.1%}")
print(f"Worst 20-day Period: {results_df['rolling_accuracy'].min():.1%}")
print(f"Accuracy Volatility: {results_df['rolling_accuracy'].std():.1%}")

## Step 9: Simulate Trading Performance

Accuracy is nice, but what really matters is: can this make money? Let's simulate a simple trading strategy based on our predictions.

In [None]:
def simulate_trading(results_df, initial_capital=10000):
    """
    I simulate a simple trading strategy:
    - Buy when model predicts UP
    - Sell (or stay out) when model predicts DOWN
    
    This ignores transaction costs for simplicity.
    In real trading, those matter a lot.
    """
    
    capital = initial_capital
    shares = 0
    trades = []
    portfolio_value = []
    
    for i, row in results_df.iterrows():
        current_price = row['price']
        
        # Portfolio value = cash + stock value
        current_value = capital + (shares * current_price)
        portfolio_value.append(current_value)
        
        # Trading logic
        if row['predicted'] == 1 and shares == 0:
            # Buy signal and we're not in the market
            shares_to_buy = capital // current_price
            if shares_to_buy > 0:
                shares = shares_to_buy
                capital -= shares * current_price
                trades.append({
                    'date': row['date'],
                    'action': 'BUY',
                    'price': current_price,
                    'shares': shares
                })
                
        elif row['predicted'] == 0 and shares > 0:
            # Sell signal and we have position
            capital += shares * current_price
            trades.append({
                'date': row['date'],
                'action': 'SELL',
                'price': current_price,
                'shares': shares
            })
            shares = 0
    
    # Close any remaining position
    if shares > 0:
        final_price = results_df.iloc[-1]['price']
        capital += shares * final_price
        trades.append({
            'date': results_df.iloc[-1]['date'],
            'action': 'SELL',
            'price': final_price,
            'shares': shares
        })
    
    return portfolio_value, trades, capital

# Run the simulation
portfolio_values, trade_history, final_capital = simulate_trading(results_df)

# Calculate buy and hold for comparison
# This is our benchmark - could we beat just holding the stock?
buy_hold_shares = 10000 // results_df.iloc[0]['price']
buy_hold_values = [buy_hold_shares * price for price in results_df['price']]

# Calculate returns
strategy_return = (final_capital - 10000) / 10000
buy_hold_return = (buy_hold_values[-1] - 10000) / 10000

print("="*50)
print("TRADING SIMULATION RESULTS")
print("="*50)
print(f"\nStarting Capital: $10,000")
print(f"Final Capital (Our Strategy): ${final_capital:,.2f}")
print(f"Final Value (Buy & Hold): ${buy_hold_values[-1]:,.2f}")
print(f"\nReturns:")
print(f"  Our Strategy: {strategy_return:+.1%}")
print(f"  Buy & Hold: {buy_hold_return:+.1%}")
print(f"  Alpha (Outperformance): {strategy_return - buy_hold_return:+.1%}")

print(f"\nNumber of Trades: {len(trade_history)}")
if len(trade_history) > 0:
    print(f"\nSample Trades:")
    for trade in trade_history[:5]:
        print(f"  {trade['date'].date()}: {trade['action']} {trade['shares']} @ ${trade['price']:.2f}")

# Visualize portfolio performance
fig = go.Figure()

# Our strategy
fig.add_trace(
    go.Scatter(
        x=results_df['date'],
        y=portfolio_values,
        mode='lines',
        name='ML Strategy',
        line=dict(color='#4a9b7f', width=3)
    )
)

# Buy and hold
fig.add_trace(
    go.Scatter(
        x=results_df['date'],
        y=buy_hold_values,
        mode='lines',
        name='Buy & Hold',
        line=dict(color='#c0504d', width=2, dash='dash')
    )
)

# Add trade markers
buy_trades = [t for t in trade_history if t['action'] == 'BUY']
sell_trades = [t for t in trade_history if t['action'] == 'SELL']

if buy_trades:
    buy_dates = [t['date'] for t in buy_trades]
    buy_values = [portfolio_values[results_df[results_df['date'] == d].index[0]] 
                  for d in buy_dates if d in results_df['date'].values]
    fig.add_trace(
        go.Scatter(
            x=buy_dates[:len(buy_values)],
            y=buy_values,
            mode='markers',
            name='Buy',
            marker=dict(color='green', size=10, symbol='triangle-up')
        )
    )

if sell_trades:
    sell_dates = [t['date'] for t in sell_trades]
    sell_values = [portfolio_values[results_df[results_df['date'] == d].index[0]] 
                   for d in sell_dates if d in results_df['date'].values]
    fig.add_trace(
        go.Scatter(
            x=sell_dates[:len(sell_values)],
            y=sell_values,
            mode='markers',
            name='Sell',
            marker=dict(color='red', size=10, symbol='triangle-down')
        )
    )

fig.update_layout(
    title="Portfolio Value Over Time - ML Strategy vs Buy & Hold",
    xaxis_title="Date",
    yaxis_title="Portfolio Value ($)",
    height=500,
    hovermode='x unified'
)

fig.show()

## Conclusions and Next Steps

So what did I learn from all this?

1. **Sentiment matters, but not as much as you'd think.** Price action still dominates.

2. **Feature engineering is everything.** The right features matter more than the fanciest models.

3. **Ensemble methods work.** Different models catch different patterns.

4. **Time series is tricky.** Never use random splits. Always validate temporally.

5. **Transaction costs kill strategies.** What looks good in backtesting often fails in reality.

### Next Steps for Improvement:

- Add more sophisticated NLP (BERT, FinBERT)
- Include options flow data
- Add regime detection (bull vs bear markets need different models)
- Implement proper position sizing and risk management
- Test on out-of-sample data from different time periods

Remember: This is a demonstration system. Real trading requires much more robust testing, risk management, and probably a good lawyer. But the concepts here? They're solid. And that's what matters for a portfolio project.

Thanks for following along. Now go build something awesome.

- Cazandra Aporbo, MS