# Week 18: Alternative Data - Satellites, Geo, Web Scraping

## ðŸŽ¯ Learning Objectives

By the end of this week, you will understand:
- **Alternative Data Types**: Non-traditional data sources
- **Web Scraping**: Extracting data from websites
- **Geolocation Data**: Foot traffic, shipping
- **Data Quality**: Validation and cleaning

---

## Why Alternative Data?

Traditional data (prices, financials) is crowded. Edge comes from:
- **Timing**: Know before earnings
- **Granularity**: Store-level vs. company-level
- **Unique insights**: See what others can't

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("âœ… Libraries loaded!")
print("ðŸ“š Week 18: Alternative Data")

---

## Part 1: Types of Alternative Data

### Categories

| Type | Examples | Use Case |
|------|----------|----------|
| Satellite | Parking lots, oil tanks | Retail, energy |
| Geolocation | Foot traffic, ships | Retail, commodities |
| Web/Social | Reviews, job postings | Sentiment, growth |
| Transactions | Credit cards, receipts | Revenue nowcasting |
| Sensor | IoT, weather | Agriculture, energy |

In [None]:
# Simulate satellite parking lot data
n_weeks = 52
dates = pd.date_range('2023-01-01', periods=n_weeks, freq='W')

# True company revenue (what we want to predict)
true_revenue = 1000 + np.cumsum(np.random.randn(n_weeks) * 50) + 100 * np.sin(np.linspace(0, 4*np.pi, n_weeks))

# Satellite-observed parking lot occupancy (leading indicator)
parking_occupancy = 0.6 + 0.3 * (true_revenue - true_revenue.mean()) / true_revenue.std() + np.random.randn(n_weeks) * 0.1
parking_occupancy = np.clip(parking_occupancy, 0.2, 0.95)

# Quarterly reported revenue (lagged, what market sees)
quarterly_revenue = pd.Series(true_revenue).rolling(13).mean().shift(4)  # 4 week reporting lag

df_alt = pd.DataFrame({
    'date': dates,
    'true_revenue': true_revenue,
    'parking_occupancy': parking_occupancy,
    'reported_revenue': quarterly_revenue
}).set_index('date')

print("Satellite Data vs. Reported Financials")
print("="*50)
print(f"Correlation (True Revenue vs Parking): {np.corrcoef(true_revenue, parking_occupancy)[0,1]:.3f}")

In [None]:
# Visualize
fig, axes = plt.subplots(3, 1, figsize=(12, 8), sharex=True)

axes[0].plot(df_alt.index, df_alt['parking_occupancy'], 'b-', label='Parking Occupancy')
axes[0].set_ylabel('Occupancy Rate')
axes[0].legend()
axes[0].set_title('Satellite Data (Real-time)')

axes[1].plot(df_alt.index, df_alt['true_revenue'], 'g-', label='True Revenue')
axes[1].plot(df_alt.index, df_alt['reported_revenue'], 'r--', label='Reported (Lagged)', alpha=0.7)
axes[1].set_ylabel('Revenue')
axes[1].legend()
axes[1].set_title('Revenue: True vs. Reported')

# Information advantage
info_advantage = df_alt['parking_occupancy'].rolling(4).mean() - df_alt['parking_occupancy'].rolling(4).mean().shift(4)
axes[2].bar(df_alt.index, info_advantage.fillna(0), color='purple', alpha=0.6)
axes[2].set_ylabel('Info Advantage')
axes[2].set_title('Trading Signal (Satellite vs. Reported)')

plt.tight_layout()
plt.show()

---

## Part 2: Web Scraping Basics

### Key Libraries

- **requests**: HTTP requests
- **BeautifulSoup**: HTML parsing
- **Selenium**: Dynamic content

### Ethical Considerations

- Respect robots.txt
- Rate limiting
- Terms of service

In [None]:
# Web scraping example (conceptual - won't run without actual URL)
import json

def scrape_job_postings(company_name):
    """Conceptual job posting scraper"""
    # In practice, you would:
    # 1. Use requests to get the page
    # 2. Parse HTML with BeautifulSoup
    # 3. Extract job counts by department
    
    # Simulated data
    return {
        'company': company_name,
        'engineering': np.random.randint(50, 200),
        'sales': np.random.randint(20, 100),
        'research': np.random.randint(10, 50),
        'timestamp': pd.Timestamp.now()
    }

# Simulate tracking job postings over time
companies = ['TechCo', 'FinanceCorp', 'RetailInc']
job_data = []

for date in pd.date_range('2023-01-01', periods=12, freq='M'):
    for company in companies:
        data = scrape_job_postings(company)
        data['date'] = date
        job_data.append(data)

df_jobs = pd.DataFrame(job_data)
print("Job Posting Data Sample:")
print(df_jobs.head(10))

---

## Part 3: Data Quality Checks

### Common Issues

1. **Missing data**: Gaps in coverage
2. **Survivorship**: Only see winners
3. **Look-ahead**: Data available only in hindsight
4. **Noise**: Signal-to-noise ratio

In [None]:
def data_quality_report(df, time_col='date'):
    """Generate data quality report"""
    report = {
        'total_rows': len(df),
        'date_range': f"{df[time_col].min()} to {df[time_col].max()}",
        'missing_pct': df.isnull().sum().sum() / (len(df) * len(df.columns)) * 100,
        'duplicates': df.duplicated().sum(),
    }
    
    # Column-specific
    for col in df.select_dtypes(include=[np.number]).columns:
        report[f'{col}_missing'] = df[col].isnull().sum()
        report[f'{col}_zeros'] = (df[col] == 0).sum()
    
    return report

# Example
report = data_quality_report(df_alt.reset_index(), time_col='date')
print("Data Quality Report")
print("="*50)
for k, v in report.items():
    print(f"{k}: {v}")

---

## Part 4: Building Alt Data Trading Signal

### Process

1. **Acquire**: Get raw data
2. **Clean**: Handle missing, outliers
3. **Aggregate**: Combine to company level
4. **Normalize**: Make comparable
5. **Signal**: Convert to trading signal

In [None]:
# Complete example: Satellite to trading signal
def create_alt_data_signal(parking_data, price_data):
    """Create trading signal from satellite parking data"""
    
    # 1. Rolling average (smooth noise)
    smooth_parking = parking_data.rolling(4).mean()
    
    # 2. Z-score normalization
    zscore = (smooth_parking - smooth_parking.rolling(13).mean()) / smooth_parking.rolling(13).std()
    
    # 3. Generate signal
    signal = np.sign(zscore)
    
    return signal

# Simulate price data
price_returns = 0.001 * df_alt['true_revenue'].pct_change() + np.random.randn(len(df_alt)) * 0.02
prices = 100 * (1 + price_returns).cumprod()

# Create signal
signal = create_alt_data_signal(df_alt['parking_occupancy'], prices)

# Backtest
strategy_returns = signal.shift(1) * price_returns
strategy_returns = strategy_returns.dropna()

sharpe = strategy_returns.mean() / strategy_returns.std() * np.sqrt(52)
print(f"\nAlt Data Strategy Sharpe: {sharpe:.2f}")

---

## Interview Questions

### Conceptual
1. What makes alternative data valuable?
2. How do you evaluate alt data before buying?
3. What are the risks of relying on alt data?

### Technical
1. How do you handle missing satellite images?
2. What's the latency of different alt data sources?
3. How do you validate alt data predictions?

### Finance-Specific
1. How do you prevent alpha decay with alt data?
2. What's the typical cost structure for alt data?
3. How do you combine multiple alt data sources?

---

## Key Takeaways

| Data Type | Latency | Cost | Alpha Potential |
|-----------|---------|------|----------------|
| Satellite | Days | High | High (if novel) |
| Web Scrape | Hours | Low | Medium |
| Transactions | Days | Very High | High |