# Bitcoin Trading Sentiment Analysis: Exploring Market Sentiment vs Trader Performance

## Objective
Explore the relationship between Bitcoin market sentiment (Fear/Greed Index) and trader performance on Hyperliquid. Our goal is to uncover hidden patterns and deliver actionable insights for smarter trading strategies.

---

## Datasets
1. **Fear/Greed Index**: Date, Classification (Fear/Greed)
2. **Hyperliquid Historical Trader Data**: account, symbol, execution price, size, side, time, start position, event, closedPnL, leverage, etc.

---

**Analysis Pipeline:**
1. Load and explore datasets
2. Data preprocessing and cleaning
3. Merge datasets on date/time
4. Exploratory data analysis
5. Sentiment distribution analysis
6. Trader performance metrics
7. Correlation analysis: sentiment vs performance
8. Pattern discovery and feature engineering
9. Visualize key insights
10. Statistical analysis

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from pathlib import Path
import sys

# Add src directory to path
sys.path.insert(0, str(Path('../src').resolve()))

# Configure plotting
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
%matplotlib inline

print("‚úì Libraries imported successfully!")

‚úì Libraries imported successfully!


In [3]:
# Visualize performance by sentiment
if pnl_col and 'sentiment' in merged_df.columns:
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # Box plot: PnL by Sentiment
    merged_df.boxplot(column=pnl_col, by='sentiment', ax=axes[0, 0])
    axes[0, 0].set_title('PnL Distribution by Sentiment', fontsize=14, fontweight='bold')
    axes[0, 0].set_xlabel('Sentiment')
    axes[0, 0].set_ylabel('PnL ($)')
    axes[0, 0].axhline(0, color='red', linestyle='--', alpha=0.5)
    plt.sca(axes[0, 0])
    plt.xticks(rotation=45)
    
    # Bar chart: Mean PnL by Sentiment
    mean_pnl = merged_df.groupby('sentiment')[pnl_col].mean()
    axes[0, 1].bar(mean_pnl.index, mean_pnl.values, edgecolor='black', alpha=0.7)
    axes[0, 1].set_title('Average PnL by Sentiment', fontsize=14, fontweight='bold')
    axes[0, 1].set_xlabel('Sentiment')
    axes[0, 1].set_ylabel('Average PnL ($)')
    axes[0, 1].axhline(0, color='red', linestyle='--', alpha=0.5)
    axes[0, 1].tick_params(axis='x', rotation=45)
    axes[0, 1].grid(alpha=0.3)
    
    # Win Rate by Sentiment
    win_rate_by_sent = merged_df.groupby('sentiment').apply(
        lambda x: (x[pnl_col] > 0).sum() / len(x) * 100
    )
    axes[1, 0].bar(win_rate_by_sent.index, win_rate_by_sent.values, edgecolor='black', alpha=0.7, color='green')
    axes[1, 0].set_title('Win Rate by Sentiment', fontsize=14, fontweight='bold')
    axes[1, 0].set_xlabel('Sentiment')
    axes[1, 0].set_ylabel('Win Rate (%)')
    axes[1, 0].tick_params(axis='x', rotation=45)
    axes[1, 0].grid(alpha=0.3)
    axes[1, 0].set_ylim(0, 100)
    
    # Violin plot: PnL distribution by Sentiment
    sentiment_order = sorted(merged_df['sentiment'].dropna().unique())
    sns.violinplot(data=merged_df, x='sentiment', y=pnl_col, order=sentiment_order, ax=axes[1, 1])
    axes[1, 1].set_title('PnL Distribution (Violin) by Sentiment', fontsize=14, fontweight='bold')
    axes[1, 1].set_xlabel('Sentiment')
    axes[1, 1].set_ylabel('PnL ($)')
    axes[1, 1].axhline(0, color='red', linestyle='--', alpha=0.5)
    axes[1, 1].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

NameError: name 'pnl_col' is not defined

## Summary and Key Insights

### Recap of Findings

This analysis explored the relationship between Bitcoin market sentiment (Fear/Greed Index) and trader performance on Hyperliquid. Key areas investigated:

1. **Data Overview**: Loaded and validated both datasets
2. **Preprocessing**: Cleaned, aligned, and merged sentiment with trading data
3. **Sentiment Patterns**: Analyzed distribution and duration of Fear/Greed periods
4. **Performance Metrics**: Calculated win rates, PnL, leverage usage
5. **Correlations**: Examined relationships between sentiment and performance
6. **Lag Effects**: Tested whether past sentiment predicts current performance
7. **Transitions**: Analyzed performance during sentiment shifts
8. **Statistical Tests**: Validated findings with hypothesis testing

### Next Steps

1. **Build Predictive Models**: Use engineered features to predict PnL
2. **Trader Segmentation**: Cluster traders by behavior patterns
3. **Strategy Development**: Design counter-trend vs trend-following strategies
4. **Real-time Integration**: Connect to live sentiment feeds
5. **Backtesting**: Validate strategies with historical simulations

### How to Export Results

```python
# Save merged dataset
merged_enriched.to_csv('../deliverables/merged_sentiment_trading_data.csv', index=False)

# Save key metrics
perf_by_sentiment.to_csv('../deliverables/performance_by_sentiment.csv')
```

In [None]:
# ANOVA - Test if PnL differs across ALL sentiment categories
if pnl_col and 'sentiment' in merged_df.columns:
    sentiment_groups = [merged_df[merged_df['sentiment'] == sent][pnl_col].dropna() 
                       for sent in merged_df['sentiment'].dropna().unique()]
    
    # Filter out empty groups
    sentiment_groups = [group for group in sentiment_groups if len(group) > 0]
    
    if len(sentiment_groups) >= 2:
        f_stat, anova_pvalue = scipy_stats.f_oneway(*sentiment_groups)
        
        print("\n" + "="*60)
        print("ANOVA TEST: PnL Across All Sentiment Categories")
        print("="*60)
        print(f"\nH0: All sentiment categories have equal mean PnL")
        print(f"H1: At least one sentiment category differs\n")
        print(f"F-statistic: {f_stat:.4f}")
        print(f"P-value: {anova_pvalue:.4f}")
        print(f"Result: {'SIGNIFICANT' if anova_pvalue < 0.05 else 'NOT SIGNIFICANT'} at Œ±=0.05")
        print("="*60)
        
        if anova_pvalue < 0.05:
            print("\n‚úì Significant difference detected! Further post-hoc analysis recommended.")

In [None]:
# Hypothesis Test: Are PnL differences between Fear and Greed significant?
from scipy import stats as scipy_stats

if pnl_col and 'sentiment' in merged_df.columns:
    fear_data = merged_df[merged_df['sentiment'].str.contains('Fear', na=False)][pnl_col].dropna()
    greed_data = merged_df[merged_df['sentiment'].str.contains('Greed', na=False)][pnl_col].dropna()
    
    if len(fear_data) > 0 and len(greed_data) > 0:
        # T-test
        t_stat, p_value = scipy_stats.ttest_ind(fear_data, greed_data)
        
        # Mann-Whitney U test (non-parametric alternative)
        u_stat, u_pvalue = scipy_stats.mannwhitneyu(fear_data, greed_data, alternative='two-sided')
        
        print("="*60)
        print("HYPOTHESIS TEST: Fear vs Greed Performance")
        print("="*60)
        print(f"\nH0: No difference in PnL between Fear and Greed periods")
        print(f"H1: Significant difference exists\n")
        
        print(f"Fear Sentiment:")
        print(f"  Sample size: {len(fear_data)}")
        print(f"  Mean PnL: ${fear_data.mean():.2f}")
        print(f"  Median PnL: ${fear_data.median():.2f}")
        print(f"  Std Dev: ${fear_data.std():.2f}")
        
        print(f"\nGreed Sentiment:")
        print(f"  Sample size: {len(greed_data)}")
        print(f"  Mean PnL: ${greed_data.mean():.2f}")
        print(f"  Median PnL: ${greed_data.median():.2f}")
        print(f"  Std Dev: ${greed_data.std():.2f}")
        
        print(f"\nIndependent T-Test:")
        print(f"  T-statistic: {t_stat:.4f}")
        print(f"  P-value: {p_value:.4f}")
        print(f"  Result: {'SIGNIFICANT' if p_value < 0.05 else 'NOT SIGNIFICANT'} at Œ±=0.05")
        
        print(f"\nMann-Whitney U Test (non-parametric):")
        print(f"  U-statistic: {u_stat:.4f}")
        print(f"  P-value: {u_pvalue:.4f}")
        print(f"  Result: {'SIGNIFICANT' if u_pvalue < 0.05 else 'NOT SIGNIFICANT'} at Œ±=0.05")
        
        print("\n" + "="*60)
        
        # Visualization
        fig, axes = plt.subplots(1, 2, figsize=(15, 6))
        
        axes[0].hist(fear_data, bins=50, alpha=0.6, label='Fear', edgecolor='black', color='red')
        axes[0].hist(greed_data, bins=50, alpha=0.6, label='Greed', edgecolor='black', color='green')
        axes[0].set_title('PnL Distribution: Fear vs Greed', fontsize=14, fontweight='bold')
        axes[0].set_xlabel('PnL ($)')
        axes[0].set_ylabel('Frequency')
        axes[0].legend()
        axes[0].grid(alpha=0.3)
        
        axes[1].boxplot([fear_data, greed_data], labels=['Fear', 'Greed'])
        axes[1].set_title('PnL Comparison: Fear vs Greed', fontsize=14, fontweight='bold')
        axes[1].set_ylabel('PnL ($)')
        axes[1].axhline(0, color='red', linestyle='--', alpha=0.5)
        axes[1].grid(alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    else:
        print("‚ö†Ô∏è Insufficient data for Fear vs Greed comparison")

## 10. Statistical Analysis

Perform hypothesis testing to validate our findings.

In [None]:
# 3D scatter plot: Sentiment, Leverage, PnL (if leverage available)
leverage_cols = [col for col in merged_df.columns if 'leverage' in col.lower()]

if pnl_col and 'sentiment_numeric' in merged_df.columns and leverage_cols:
    leverage_col = leverage_cols[0]
    
    # Sample data if too large (for performance)
    sample_size = min(5000, len(merged_df))
    sample_df = merged_df.sample(n=sample_size, random_state=42)
    
    fig = px.scatter_3d(sample_df, x='sentiment_numeric', y=leverage_col, z=pnl_col,
                        color=pnl_col, color_continuous_scale='RdYlGn',
                        title=f'3D View: Sentiment vs Leverage vs PnL (Sample: {sample_size} records)',
                        labels={'sentiment_numeric': 'Sentiment', 
                               leverage_col: 'Leverage', 
                               pnl_col: 'PnL ($)'},
                        hover_data=['date'] if 'date' in sample_df.columns else None)
    
    fig.update_layout(height=700)
    fig.show()

In [None]:
# Interactive scatter: Daily PnL vs Sentiment with trend line
if pnl_col and 'sentiment_numeric' in merged_df.columns:
    daily_data = merged_df.groupby('date').agg({
        pnl_col: 'sum',
        'sentiment_numeric': 'first'
    }).reset_index()
    
    fig = px.scatter(daily_data, x='sentiment_numeric', y=pnl_col,
                     title='Daily PnL vs Sentiment (with Trendline)',
                     labels={'sentiment_numeric': 'Sentiment', pnl_col: 'Daily PnL ($)'},
                     trendline='ols', trendline_color_override='red',
                     hover_data=['date'])
    
    fig.add_hline(y=0, line_dash="dash", line_color="gray", annotation_text="Break-even")
    fig.add_vline(x=0, line_dash="dash", line_color="gray", annotation_text="Neutral")
    fig.update_layout(height=600)
    fig.show()

In [None]:
# Interactive time series: PnL and Sentiment overlay
if pnl_col and 'date' in merged_df.columns and 'sentiment_numeric' in merged_df.columns:
    daily_agg = merged_df.groupby('date').agg({
        pnl_col: 'sum',
        'sentiment_numeric': 'first'
    }).reset_index()
    
    daily_agg['cumulative_pnl'] = daily_agg[pnl_col].cumsum()
    
    # Create subplot with dual y-axes
    fig = make_subplots(specs=[[{"secondary_y": True}]])
    
    # Add cumulative PnL
    fig.add_trace(
        go.Scatter(x=daily_agg['date'], y=daily_agg['cumulative_pnl'], 
                   name='Cumulative PnL', mode='lines', line=dict(width=2)),
        secondary_y=False
    )
    
    # Add sentiment
    fig.add_trace(
        go.Scatter(x=daily_agg['date'], y=daily_agg['sentiment_numeric'], 
                   name='Sentiment', mode='lines', line=dict(dash='dot', color='orange', width=2)),
        secondary_y=True
    )
    
    fig.update_xaxes(title_text="Date")
    fig.update_yaxes(title_text="Cumulative PnL ($)", secondary_y=False)
    fig.update_yaxes(title_text="Sentiment (Numeric)", secondary_y=True)
    fig.update_layout(title_text="Trading Performance vs Market Sentiment Over Time", 
                      height=600, hovermode='x unified')
    
    fig.show()

## 9. Visualize Key Insights

Create comprehensive interactive visualizations with Plotly.

In [None]:
# Day of week patterns
if 'day_of_week' in merged_enriched.columns and pnl_col:
    dow_performance = merged_enriched.groupby('day_of_week')[pnl_col].agg(['mean', 'count'])
    dow_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    dow_performance.index = [dow_names[i] for i in dow_performance.index]
    
    print("\n" + "="*60)
    print("PERFORMANCE BY DAY OF WEEK")
    print("="*60)
    display(dow_performance)
    
    plt.figure(figsize=(10, 6))
    plt.bar(dow_performance.index, dow_performance['mean'], edgecolor='black', alpha=0.7)
    plt.axhline(0, color='red', linestyle='--', alpha=0.5)
    plt.title('Average PnL by Day of Week', fontsize=14, fontweight='bold')
    plt.xlabel('Day of Week')
    plt.ylabel('Average PnL ($)')
    plt.xticks(rotation=45)
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
# Sentiment transition analysis - how does changing sentiment affect performance?
if pnl_col and 'sentiment' in merged_df.columns:
    transitions = analyzer.sentiment_transition_analysis(pnl_col)
    
    print("\n" + "="*60)
    print("SENTIMENT TRANSITION ANALYSIS")
    print("="*60)
    print("Performance during sentiment shifts:\n")
    display(transitions.head(15))
    
    # Visualize top transitions
    top_transitions = transitions.head(10)
    
    fig = px.bar(top_transitions.reset_index(), x='transition', y='mean_pnl',
                 title='Average PnL by Sentiment Transition',
                 labels={'transition': 'Sentiment Transition', 'mean_pnl': 'Average PnL ($)'},
                 hover_data=['count', 'median_pnl'])
    fig.add_hline(y=0, line_dash="dash", line_color="red")
    fig.update_layout(height=500, xaxis_tickangle=-45)
    fig.show()

In [None]:
# Create additional features
from features import create_rolling_features, create_time_features, create_sentiment_features

# Add time features
merged_enriched = create_time_features(merged_df)

# Add rolling PnL features (if pnl_col exists)
if pnl_col:
    merged_enriched = create_rolling_features(
        merged_enriched, 
        columns=[pnl_col],
        windows=[3, 7],
        group_col='account' if 'account' in merged_enriched.columns else None
    )

# Add sentiment features
if 'sentiment' in merged_enriched.columns:
    merged_enriched = create_sentiment_features(merged_enriched, windows=[3, 7])

print("‚úì Features engineered successfully")
print(f"  New shape: {merged_enriched.shape}")
print(f"  New columns: {merged_enriched.shape[1] - merged_df.shape[1]}")

# Show sample of new features
new_cols = [col for col in merged_enriched.columns if col not in merged_df.columns]
print(f"\nNew feature columns: {new_cols[:10]}...")  # Show first 10

## 8. Pattern Discovery and Feature Engineering

Identify patterns and create engineered features for deeper analysis.

In [None]:
# Leverage analysis by sentiment
leverage_cols = [col for col in merged_df.columns if 'leverage' in col.lower()]

if leverage_cols and 'sentiment' in merged_df.columns:
    leverage_col = leverage_cols[0]
    leverage_by_sent = analyzer.leverage_by_sentiment(leverage_col)
    
    print("\n" + "="*60)
    print("LEVERAGE USAGE BY SENTIMENT")
    print("="*60)
    display(leverage_by_sent)
    
    # Visualize
    plt.figure(figsize=(10, 6))
    merged_df.boxplot(column=leverage_col, by='sentiment')
    plt.title('Leverage Distribution by Market Sentiment', fontsize=14, fontweight='bold')
    plt.suptitle('')  # Remove default title
    plt.xlabel('Sentiment', fontsize=12)
    plt.ylabel('Leverage', fontsize=12)
    plt.xticks(rotation=45)
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è Leverage data not available")

In [None]:
# Lag analysis - does past sentiment predict current performance?
if pnl_col:
    lag_results = analyzer.lag_analysis(pnl_col, max_lag=7)
    
    print("\n" + "="*60)
    print("LAG ANALYSIS: Past Sentiment vs Current Performance")
    print("="*60)
    display(lag_results)
    
    # Visualize lag correlations
    plt.figure(figsize=(10, 6))
    plt.bar(lag_results['lag_days'], lag_results['correlation'], edgecolor='black', alpha=0.7)
    plt.axhline(0, color='red', linestyle='--', alpha=0.5)
    plt.xlabel('Lag (Days)', fontsize=12)
    plt.ylabel('Correlation with PnL', fontsize=12)
    plt.title('Sentiment Lag Effect on Trading Performance', fontsize=14, fontweight='bold')
    plt.grid(alpha=0.3)
    
    # Annotate with significance
    for idx, row in lag_results.iterrows():
        if row['p_value'] < 0.05:
            plt.text(row['lag_days'], row['correlation'], '**', ha='center', va='bottom', fontsize=14, color='red')
    
    plt.tight_layout()
    plt.show()
    print("\n** indicates statistically significant (p < 0.05)")

In [None]:
# Correlation analysis with p-values
if pnl_col and 'sentiment_numeric' in merged_df.columns:
    metrics_to_analyze = [col for col in merged_df.columns if any(keyword in col.lower() 
                          for keyword in ['pnl', 'size', 'leverage', 'win'])]
    
    if metrics_to_analyze:
        corr_results = analyzer.correlation_analysis(metrics_to_analyze)
        
        print("\n" + "="*60)
        print("CORRELATION ANALYSIS: Sentiment vs Metrics")
        print("="*60)
        display(corr_results.sort_values('correlation', key=abs, ascending=False))
        print("="*60)

In [None]:
# Correlation matrix - numeric columns only
numeric_cols = merged_df.select_dtypes(include=[np.number]).columns.tolist()

# Filter to relevant columns
relevant_cols = [col for col in numeric_cols if any(keyword in col.lower() 
                 for keyword in ['pnl', 'sentiment', 'leverage', 'size', 'win'])]

if len(relevant_cols) > 2:
    corr_matrix = merged_df[relevant_cols].corr()
    
    plt.figure(figsize=(12, 10))
    sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
                square=True, linewidths=1, cbar_kws={"shrink": 0.8})
    plt.title('Correlation Heatmap: Sentiment vs Performance Metrics', fontsize=16, fontweight='bold', pad=20)
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è Not enough numeric columns for correlation analysis")

In [None]:
# Performance by sentiment category
from analysis import SentimentPerformanceAnalyzer

if pnl_col and 'sentiment' in merged_df.columns:
    analyzer = SentimentPerformanceAnalyzer(merged_df)
    
    perf_by_sentiment = analyzer.performance_by_sentiment(pnl_col)
    
    print("="*60)
    print("PERFORMANCE BY SENTIMENT CATEGORY")
    print("="*60)
    display(perf_by_sentiment)
    print("="*60)

## 7. Correlation Analysis: Sentiment vs Performance

Analyze the relationship between market sentiment and trading performance.

In [None]:
# Account-level performance (if account column exists)
if 'account' in merged_df.columns and pnl_col:
    account_performance = merged_df.groupby('account').agg({
        pnl_col: ['sum', 'mean', 'count'],
    }).round(2)
    
    account_performance.columns = ['Total_PnL', 'Avg_PnL', 'Trade_Count']
    account_performance = account_performance.sort_values('Total_PnL', ascending=False)
    
    print("\nTop 10 Traders by Total PnL:")
    display(account_performance.head(10))
    
    print("\nBottom 10 Traders by Total PnL:")
    display(account_performance.tail(10))
    
    # Distribution of account performance
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    axes[0].hist(account_performance['Total_PnL'], bins=50, edgecolor='black', alpha=0.7)
    axes[0].set_title('Distribution of Total PnL by Account', fontsize=14, fontweight='bold')
    axes[0].set_xlabel('Total PnL ($)')
    axes[0].set_ylabel('Number of Accounts')
    axes[0].axvline(0, color='red', linestyle='--', label='Break-even')
    axes[0].legend()
    axes[0].grid(alpha=0.3)
    
    axes[1].scatter(account_performance['Trade_Count'], account_performance['Total_PnL'], alpha=0.6)
    axes[1].set_title('Trade Count vs Total PnL', fontsize=14, fontweight='bold')
    axes[1].set_xlabel('Number of Trades')
    axes[1].set_ylabel('Total PnL ($)')
    axes[1].axhline(0, color='red', linestyle='--', alpha=0.5)
    axes[1].grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Overall performance metrics
if pnl_col:
    total_pnl = merged_df[pnl_col].sum()
    win_rate = (merged_df[pnl_col] > 0).sum() / len(merged_df) * 100
    avg_win = merged_df[merged_df[pnl_col] > 0][pnl_col].mean()
    avg_loss = merged_df[merged_df[pnl_col] < 0][pnl_col].mean()
    profit_factor = abs(merged_df[merged_df[pnl_col] > 0][pnl_col].sum() / 
                       merged_df[merged_df[pnl_col] < 0][pnl_col].sum()) if merged_df[merged_df[pnl_col] < 0][pnl_col].sum() != 0 else np.inf
    
    print("="*60)
    print("OVERALL PERFORMANCE METRICS")
    print("="*60)
    print(f"Total PnL: ${total_pnl:,.2f}")
    print(f"Win Rate: {win_rate:.2f}%")
    print(f"Average Win: ${avg_win:,.2f}")
    print(f"Average Loss: ${avg_loss:,.2f}")
    print(f"Profit Factor: {profit_factor:.2f}")
    print(f"Risk/Reward Ratio: {abs(avg_win/avg_loss):.2f}" if avg_loss != 0 else "Risk/Reward: N/A")
    print("="*60)

## 6. Trader Performance Metrics

Calculate and visualize key performance indicators for traders.

In [None]:
# Sentiment frequency and duration analysis
if 'sentiment' in sentiment_clean.columns:
    # Calculate sentiment streaks (consecutive days)
    sentiment_clean['sentiment_change'] = (sentiment_clean['sentiment'] != sentiment_clean['sentiment'].shift()).astype(int)
    sentiment_clean['streak_id'] = sentiment_clean['sentiment_change'].cumsum()
    
    streak_lengths = sentiment_clean.groupby(['streak_id', 'sentiment']).size().reset_index(name='duration')
    
    print("Sentiment Streak Statistics:")
    print(streak_lengths.groupby('sentiment')['duration'].describe())
    
    # Visualize streak durations
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    for sentiment_val in streak_lengths['sentiment'].unique():
        subset = streak_lengths[streak_lengths['sentiment'] == sentiment_val]['duration']
        axes[0].hist(subset, alpha=0.6, label=sentiment_val, bins=20, edgecolor='black')
    
    axes[0].set_title('Distribution of Sentiment Streak Durations', fontsize=14, fontweight='bold')
    axes[0].set_xlabel('Days in Streak')
    axes[0].set_ylabel('Frequency')
    axes[0].legend()
    axes[0].grid(alpha=0.3)
    
    # Pie chart of total days by sentiment
    sentiment_totals = sentiment_clean['sentiment'].value_counts()
    axes[1].pie(sentiment_totals.values, labels=sentiment_totals.index, autopct='%1.1f%%', startangle=90)
    axes[1].set_title('Proportion of Days by Sentiment', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()

In [None]:
# Sentiment over time - line chart with numeric encoding
if 'sentiment_numeric' in sentiment_clean.columns:
    fig = px.line(sentiment_clean, x='date', y='sentiment_numeric', 
                  title='Bitcoin Market Sentiment Over Time',
                  labels={'sentiment_numeric': 'Sentiment (-2=Extreme Fear, 2=Extreme Greed)', 'date': 'Date'},
                  hover_data=['sentiment'])
    
    fig.add_hline(y=0, line_dash="dash", line_color="gray", annotation_text="Neutral")
    fig.update_layout(height=500, hovermode='x unified')
    fig.show()
else:
    print("‚ö†Ô∏è Sentiment numeric encoding not available")

## 5. Sentiment Distribution Analysis

Deep dive into how sentiment (Fear/Greed) is distributed over time.

In [None]:
# Distribution of key metrics
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

if pnl_col:
    # PnL distribution
    axes[0, 0].hist(merged_df[pnl_col].dropna(), bins=50, edgecolor='black', alpha=0.7)
    axes[0, 0].axvline(merged_df[pnl_col].mean(), color='red', linestyle='--', label=f'Mean: ${merged_df[pnl_col].mean():.2f}')
    axes[0, 0].set_title('PnL Distribution', fontsize=14, fontweight='bold')
    axes[0, 0].set_xlabel('PnL ($)')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].legend()
    axes[0, 0].grid(alpha=0.3)

# Sentiment distribution
if 'sentiment' in merged_df.columns:
    sentiment_counts = merged_df['sentiment'].value_counts()
    axes[0, 1].bar(sentiment_counts.index, sentiment_counts.values, edgecolor='black', alpha=0.7)
    axes[0, 1].set_title('Sentiment Distribution', fontsize=14, fontweight='bold')
    axes[0, 1].set_xlabel('Sentiment')
    axes[0, 1].set_ylabel('Count')
    axes[0, 1].grid(alpha=0.3)
    axes[0, 1].tick_params(axis='x', rotation=45)

# Trade count distribution over time
if 'date' in merged_df.columns and pnl_col:
    daily_trades = merged_df.groupby('date').size()
    axes[1, 0].plot(daily_trades.index, daily_trades.values, alpha=0.7)
    axes[1, 0].set_title('Daily Trading Activity', fontsize=14, fontweight='bold')
    axes[1, 0].set_xlabel('Date')
    axes[1, 0].set_ylabel('Number of Records')
    axes[1, 0].grid(alpha=0.3)
    axes[1, 0].tick_params(axis='x', rotation=45)

# Cumulative PnL over time
if pnl_col and 'date' in merged_df.columns:
    cumulative_pnl = merged_df.groupby('date')[pnl_col].sum().cumsum()
    axes[1, 1].plot(cumulative_pnl.index, cumulative_pnl.values, linewidth=2, alpha=0.8)
    axes[1, 1].set_title('Cumulative PnL Over Time', fontsize=14, fontweight='bold')
    axes[1, 1].set_xlabel('Date')
    axes[1, 1].set_ylabel('Cumulative PnL ($)')
    axes[1, 1].grid(alpha=0.3)
    axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Summary statistics for numeric columns
print("="*60)
print("SUMMARY STATISTICS")
print("="*60)
display(merged_df.describe())

## 4. Exploratory Data Analysis (EDA)

Summary statistics and initial exploration of the merged dataset.

In [None]:
# Check merge quality - how many records matched?
print("Merge Quality Check:")
print(f"Records with sentiment data: {merged_df['sentiment'].notna().sum():,} ({merged_df['sentiment'].notna().sum()/len(merged_df)*100:.1f}%)")
print(f"Records without sentiment: {merged_df['sentiment'].isna().sum():,}")

# Show sentiment distribution in merged data
if 'sentiment' in merged_df.columns:
    print("\nSentiment Distribution in Merged Data:")
    print(merged_df['sentiment'].value_counts())
    print(f"\nSentiment Numeric Distribution:")
    print(merged_df['sentiment_numeric'].value_counts().sort_index())

In [None]:
# Identify PnL column for analysis
pnl_columns = [col for col in merged_df.columns if 'pnl' in col.lower() or 'closedpnl' in col.lower()]
print(f"PnL columns found: {pnl_columns}")

if pnl_columns:
    pnl_col = pnl_columns[0]  # Use first PnL column
    print(f"\nUsing '{pnl_col}' as main PnL metric")
    
    # Basic PnL stats
    print(f"\nPnL Statistics:")
    print(f"  Total PnL: ${merged_df[pnl_col].sum():,.2f}")
    print(f"  Mean PnL: ${merged_df[pnl_col].mean():,.2f}")
    print(f"  Median PnL: ${merged_df[pnl_col].median():,.2f}")
    print(f"  Std Dev: ${merged_df[pnl_col].std():,.2f}")
    print(f"  Win Rate: {(merged_df[pnl_col] > 0).sum() / len(merged_df) * 100:.2f}%")
else:
    pnl_col = None
    print("‚ö†Ô∏è No PnL column found. Some analyses may be limited.")

In [None]:
# Merge trading performance with sentiment (including lags)
merged_df = preprocessor.merge_with_sentiment(
    daily_performance, 
    sentiment_clean,
    lag_days=[0, 1, 3, 7]
)

print("\n" + "="*60)
print("‚úì Datasets merged successfully")
print(f"  Total records: {len(merged_df):,}")
print(f"  Columns: {len(merged_df.columns)}")
print(f"  Date range: {merged_df['date'].min()} to {merged_df['date'].max()}")
print("="*60)

display(merged_df.head(15))

## 3. Merge Datasets on Date/Time

Merge sentiment data with trader performance data, including lagged sentiment features.

In [None]:
# Check for missing values after preprocessing
print("Missing Values in Historical Data:")
missing_hist = historical_clean.isnull().sum()
print(missing_hist[missing_hist > 0] if any(missing_hist > 0) else "None")

print("\nMissing Values in Sentiment Data:")
missing_sent = sentiment_clean.isnull().sum()
print(missing_sent[missing_sent > 0] if any(missing_sent > 0) else "None")

In [None]:
# Aggregate trading data to daily account-level metrics
daily_performance = preprocessor.aggregate_daily_performance(historical_clean)

print("\n" + "="*60)
print("‚úì Trading data aggregated to daily performance")
display(daily_performance.head(15))

In [None]:
# Preprocess historical trading data
historical_clean = preprocessor.preprocess_historical(historical_df)

print("\n" + "="*60)
print("‚úì Historical data preprocessed")
print(f"  Date range: {historical_clean['date'].min()} to {historical_clean['date'].max()}")
print(f"  Total trades: {len(historical_clean):,}")
display(historical_clean.head(10))

In [None]:
# Preprocess both datasets
from preprocessing import DataPreprocessor, create_sentiment_numeric_encoding

preprocessor = DataPreprocessor()

# Preprocess sentiment data
sentiment_clean = preprocessor.preprocess_sentiment(sentiment_df)
sentiment_clean = create_sentiment_numeric_encoding(sentiment_clean)

print("\n" + "="*60)
print("‚úì Sentiment data preprocessed")
print(f"  Date range: {sentiment_clean['date'].min()} to {sentiment_clean['date'].max()}")
print(f"  Total days: {len(sentiment_clean)}")
display(sentiment_clean.head(10))

## 2. Data Preprocessing and Cleaning

Clean both datasets, handle missing values, and standardize column formats.

In [None]:
# Data types and memory usage
print("HISTORICAL DATA INFO:")
print(historical_df.info())

print("\n" + "="*60)
print("\nSENTIMENT DATA INFO:")
print(sentiment_df.info())

In [None]:
# Display basic information about the datasets
print("="*60)
print("HISTORICAL TRADING DATA")
print("="*60)
print(f"Shape: {historical_df.shape}")
print(f"\nColumns ({len(historical_df.columns)}):")
print(historical_df.columns.tolist())
print(f"\nFirst few rows:")
display(historical_df.head())

print("\n" + "="*60)
print("SENTIMENT DATA (Fear/Greed Index)")
print("="*60)
print(f"Shape: {sentiment_df.shape}")
print(f"\nColumns ({len(sentiment_df.columns)}):")
print(sentiment_df.columns.tolist())
print(f"\nFirst few rows:")
display(sentiment_df.head())

In [None]:
# Load datasets using custom data loader
from data_loader import DataLoader

loader = DataLoader(data_dir='../data')

try:
    historical_df, sentiment_df = loader.load_all()
    print("\n" + "="*60)
    print("‚úì Datasets loaded successfully!")
    print("="*60)
except FileNotFoundError as e:
    print("\n‚ö†Ô∏è ERROR: Datasets not found!")
    print("\nüì• NEXT STEPS:")
    print("1. Download 'historical_data.csv' from:")
    print("   https://drive.google.com/file/d/1IAfLZwu6rJzyWKgBToqwSmmVYU6VbjVs/view")
    print("\n2. Download 'fear_greed_index.csv' from:")
    print("   https://drive.google.com/file/d/1PgQC0tO8XN-wqkNyghWc_-mnrYv_nhSf/view")
    print("\n3. Place both files in the '../data/' directory")
    print("\n4. Re-run this cell")
    print("="*60)
    raise

In [None]:
# Validate data quality
validation_hist = loader.validate_data(historical_df, "Historical Trading Data")
validation_sent = loader.validate_data(sentiment_df, "Sentiment Data")

## 1. Load and Explore Datasets

First, we'll import necessary libraries and load both datasets.