# Kalshi Sentiment Analysis

This notebook analyzes the correlation between public sentiment and Kalshi prediction market prices.

## Project Overview
We aim to determine whether public sentiment (from social media, news, etc.) correlates with or predicts movements in Kalshi prediction market prices.

## 1. Setup and Imports

In [None]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

# Import our modules
from kalshi_api import KalshiDataCollector, fetch_sample_markets
from sentiment_analyzer import SentimentAnalyzer, create_sample_text_data
from data_processor import DataProcessor
from statistical_analysis import StatisticalAnalyzer
from visualizations import Visualizer

print("✓ All modules imported successfully")

## 2. Data Collection

### 2.1 Collect Market Price Data

In [None]:
# Initialize data collector
collector = KalshiDataCollector()

# For this demo, we'll use sample data
# In a real scenario, you would fetch actual Kalshi data
market_name = "Will Biden win the 2024 Presidential Election?"
market_df = collector.create_sample_market_data(market_name, days=30)

print(f"Collected {len(market_df)} days of market data")
market_df.head()

### 2.2 Collect and Analyze Sentiment Data

In [None]:
# Create sample text data (in real scenario, use web scraping or datasets)
text_df = create_sample_text_data(market_name, days=30)

print(f"Collected {len(text_df)} text samples")
print(f"\nSample texts:")
text_df.head()

In [None]:
# Analyze sentiment
analyzer = SentimentAnalyzer()
sentiment_df = analyzer.analyze_dataframe(text_df)

print("Sentiment analysis complete!")
sentiment_df[['date', 'text', 'sentiment_label', 'sentiment_normalized']].head(10)

## 3. Data Processing and Alignment

In [None]:
# Process and combine data
processor = DataProcessor()
combined_df = processor.prepare_analysis_dataset(market_df, sentiment_df)

print("Data processing complete!")
print(f"\nCombined dataset shape: {combined_df.shape}")
print(f"\nColumns: {combined_df.columns.tolist()}")
combined_df.head()

In [None]:
# Get summary statistics
summary = processor.get_summary_statistics(combined_df)
print("\nSummary Statistics:")
summary

## 4. Statistical Analysis

### 4.1 Correlation Analysis

In [None]:
# Run comprehensive statistical analysis
stat_analyzer = StatisticalAnalyzer()
results = stat_analyzer.calculate_metrics(combined_df)

print("Statistical analysis complete!")
print(f"\nCorrelation: {results['correlation']['correlation']:.3f}")
print(f"P-value: {results['correlation']['p_value']:.4f}")
print(f"Significant: {results['correlation']['significant']}")

### 4.2 Lead-Lag Analysis

In [None]:
# Display lead-lag results
print("Lead-Lag Analysis:")
results['lead_lag']

In [None]:
# Find strongest relationship
print("\nStrongest Relationship:")
print(results['strongest_relationship']['interpretation'])
print(f"Correlation: {results['strongest_relationship']['correlation']:.3f}")

### 4.3 Granger Causality Test

In [None]:
# Check if sentiment Granger-causes price
if 'error' not in results['sentiment_causes_price']:
    print("Sentiment → Price:")
    print(results['sentiment_causes_price']['interpretation'])
else:
    print("Granger test error:", results['sentiment_causes_price']['error'])

### 4.4 Generate Full Report

In [None]:
# Generate and display comprehensive report
report = stat_analyzer.generate_report(results)
print(report)

## 5. Visualizations

In [None]:
# Initialize visualizer
viz = Visualizer(output_dir='../outputs/')

### 5.1 Time Series Plot

In [None]:
viz.plot_time_series(
    combined_df,
    title=f"{market_name}: Price vs Sentiment",
    save_name="time_series.png"
)

### 5.2 Scatter Plot with Correlation

In [None]:
viz.plot_scatter(
    combined_df,
    title="Sentiment vs Market Price Correlation",
    save_name="scatter.png"
)

### 5.3 Lead-Lag Analysis Plot

In [None]:
viz.plot_lead_lag(
    results['lead_lag'],
    title="Lead-Lag Correlation Analysis",
    save_name="lead_lag.png"
)

### 5.4 Comprehensive Dashboard

In [None]:
viz.create_dashboard(
    combined_df,
    results['lead_lag'],
    results,
    market_name=market_name,
    save_name="dashboard.png"
)

## 6. Key Findings and Conclusions

In [None]:
print("=" * 70)
print("KEY FINDINGS")
print("=" * 70)

# Correlation
corr = results['correlation']
if corr['significant']:
    direction = "positive" if corr['correlation'] > 0 else "negative"
    print(f"\n1. There is a significant {direction} correlation ({corr['correlation']:.3f})")
    print(f"   between sentiment and market prices (p={corr['p_value']:.4f})")
else:
    print(f"\n1. No significant correlation found (p={corr['p_value']:.4f})")

# Lead-lag
strongest = results['strongest_relationship']
print(f"\n2. {strongest['interpretation']}")
print(f"   Correlation: {strongest['correlation']:.3f}")

if strongest['lag'] < 0:
    print(f"   → Sentiment may be a LEADING indicator of price")
elif strongest['lag'] > 0:
    print(f"   → Price may be a LEADING indicator of sentiment")
else:
    print(f"   → Sentiment and price move together contemporaneously")

# Regression
if 'error' not in results['regression']:
    r2 = results['regression']['r2']
    print(f"\n3. Regression R² = {r2:.3f}")
    print(f"   Sentiment explains {r2*100:.1f}% of price variance")

print("\n" + "=" * 70)

## 7. Save Results

In [None]:
# Save combined dataset
combined_df.to_csv('../data/processed/combined_analysis.csv', index=False)
print("✓ Saved combined dataset")

# Save report
with open('../outputs/analysis_report.txt', 'w') as f:
    f.write(report)
print("✓ Saved analysis report")

# Save lead-lag results
results['lead_lag'].to_csv('../outputs/lead_lag_results.csv', index=False)
print("✓ Saved lead-lag results")

print("\nAll results saved to outputs/ directory")

## Next Steps

1. **Expand Data Sources**: Add real Twitter, Reddit, and news data
2. **More Markets**: Analyze multiple Kalshi markets
3. **Advanced Models**: Try more sophisticated sentiment models or fine-tuning
4. **Prediction**: Build predictive model using sentiment features
5. **Real-time**: Set up pipeline for real-time analysis