# Day 5: Alternative Data Sources for Trading

## Learning Objectives
- Understand the landscape of alternative data in quantitative finance
- Learn to source, process, and analyze various alternative data types
- Build alpha signals from satellite imagery, web scraping, and social media
- Implement data quality frameworks for alternative data
- Understand regulatory and ethical considerations

---

## 1. Introduction to Alternative Data

### What is Alternative Data?

Alternative data refers to non-traditional data sources used by investors to gain insights beyond conventional financial statements and market data.

### Categories of Alternative Data

| Category | Examples | Use Cases |
|----------|----------|----------|
| **Satellite & Geolocation** | Parking lot imagery, ship tracking, crop monitoring | Retail sales, supply chain, commodities |
| **Web & Social Media** | Twitter sentiment, Reddit discussions, web traffic | Market sentiment, product launches |
| **Transaction Data** | Credit card purchases, POS data | Consumer spending, revenue nowcasting |
| **Sensor & IoT** | Weather sensors, industrial sensors | Agriculture, energy, manufacturing |
| **Text & NLP** | SEC filings, earnings calls, news | Event-driven strategies |
| **App Usage** | Mobile app downloads, usage metrics | Company growth, user engagement |

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# For web scraping
import requests
from bs4 import BeautifulSoup
import json
import time
import re

# For data processing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from scipy import stats

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Alternative Data Analysis Environment Ready!")

## 2. Satellite & Geolocation Data

### 2.1 Parking Lot Analysis (Retail Nowcasting)

Satellite imagery of retail parking lots can predict quarterly revenue before official announcements.

In [None]:
class ParkingLotAnalyzer:
    """
    Simulates satellite parking lot analysis for retail revenue prediction.
    In production, this would use actual satellite imagery APIs.
    """
    
    def __init__(self, n_stores: int = 100):
        self.n_stores = n_stores
        self.store_capacity = np.random.randint(50, 500, n_stores)
        
    def generate_parking_data(self, n_days: int = 90) -> pd.DataFrame:
        """
        Generate simulated parking lot occupancy data.
        
        In reality, this comes from:
        - Orbital Insight
        - RS Metrics
        - Geospatial Insight
        """
        dates = pd.date_range(end=datetime.now(), periods=n_days, freq='D')
        
        data = []
        for date in dates:
            # Seasonal pattern (higher in Q4 for retail)
            month = date.month
            seasonal_factor = 1.0 + 0.3 * (month in [11, 12]) + 0.1 * (month in [6, 7])
            
            # Day of week pattern (weekends higher)
            dow_factor = 1.2 if date.dayofweek >= 5 else 1.0
            
            # Random trend component
            trend = 1.0 + 0.001 * (date - dates[0]).days
            
            for store_id in range(self.n_stores):
                base_occupancy = 0.4 + np.random.normal(0, 0.1)
                occupancy = base_occupancy * seasonal_factor * dow_factor * trend
                occupancy = np.clip(occupancy + np.random.normal(0, 0.05), 0, 1)
                
                car_count = int(occupancy * self.store_capacity[store_id])
                
                data.append({
                    'date': date,
                    'store_id': store_id,
                    'car_count': car_count,
                    'capacity': self.store_capacity[store_id],
                    'occupancy_rate': occupancy
                })
        
        return pd.DataFrame(data)
    
    def calculate_aggregate_metrics(self, df: pd.DataFrame) -> pd.DataFrame:
        """Calculate aggregate parking metrics for revenue prediction."""
        
        daily_agg = df.groupby('date').agg({
            'car_count': 'sum',
            'occupancy_rate': 'mean'
        }).reset_index()
        
        # Calculate rolling metrics
        daily_agg['car_count_7d_avg'] = daily_agg['car_count'].rolling(7).mean()
        daily_agg['occupancy_7d_avg'] = daily_agg['occupancy_rate'].rolling(7).mean()
        daily_agg['yoy_change'] = daily_agg['car_count'].pct_change(365) if len(daily_agg) > 365 else np.nan
        
        # Week-over-week change
        daily_agg['wow_change'] = daily_agg['car_count_7d_avg'].pct_change(7)
        
        return daily_agg


# Generate and analyze parking lot data
analyzer = ParkingLotAnalyzer(n_stores=100)
parking_data = analyzer.generate_parking_data(n_days=180)
agg_metrics = analyzer.calculate_aggregate_metrics(parking_data)

print("Parking Lot Data Summary:")
print(f"Total observations: {len(parking_data):,}")
print(f"Date range: {parking_data['date'].min().date()} to {parking_data['date'].max().date()}")
print(f"\nAggregate Metrics (last 5 days):")
print(agg_metrics[['date', 'car_count', 'occupancy_rate', 'wow_change']].tail())

In [None]:
# Visualize parking lot trends
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Total car count over time
ax1 = axes[0, 0]
ax1.plot(agg_metrics['date'], agg_metrics['car_count'], alpha=0.3, label='Daily')
ax1.plot(agg_metrics['date'], agg_metrics['car_count_7d_avg'], linewidth=2, label='7-day Avg')
ax1.set_title('Aggregate Parking Lot Traffic')
ax1.set_xlabel('Date')
ax1.set_ylabel('Total Car Count')
ax1.legend()

# Occupancy rate distribution
ax2 = axes[0, 1]
ax2.hist(parking_data['occupancy_rate'], bins=50, edgecolor='black', alpha=0.7)
ax2.axvline(parking_data['occupancy_rate'].mean(), color='red', linestyle='--', label=f"Mean: {parking_data['occupancy_rate'].mean():.2f}")
ax2.set_title('Occupancy Rate Distribution')
ax2.set_xlabel('Occupancy Rate')
ax2.set_ylabel('Frequency')
ax2.legend()

# Week-over-week changes
ax3 = axes[1, 0]
valid_wow = agg_metrics.dropna(subset=['wow_change'])
colors = ['green' if x > 0 else 'red' for x in valid_wow['wow_change']]
ax3.bar(valid_wow['date'], valid_wow['wow_change'] * 100, color=colors, alpha=0.7)
ax3.axhline(0, color='black', linewidth=0.5)
ax3.set_title('Week-over-Week Traffic Change')
ax3.set_xlabel('Date')
ax3.set_ylabel('WoW Change (%)')

# Day of week pattern
ax4 = axes[1, 1]
parking_data['dayofweek'] = parking_data['date'].dt.dayofweek
dow_avg = parking_data.groupby('dayofweek')['occupancy_rate'].mean()
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
ax4.bar(days, dow_avg.values, color='steelblue', edgecolor='black')
ax4.set_title('Average Occupancy by Day of Week')
ax4.set_xlabel('Day of Week')
ax4.set_ylabel('Avg Occupancy Rate')

plt.tight_layout()
plt.show()

### 2.2 Ship Tracking (AIS Data)

Automatic Identification System (AIS) data tracks global shipping, useful for:
- Commodity flow analysis
- Supply chain monitoring
- Trade flow prediction

In [None]:
class ShipTrackingAnalyzer:
    """
    Simulates AIS ship tracking data analysis.
    
    Real providers include:
    - MarineTraffic
    - VesselFinder
    - Spire Global
    - Kpler
    """
    
    def __init__(self):
        self.vessel_types = ['Tanker', 'Bulk Carrier', 'Container', 'LNG Carrier']
        self.routes = [
            ('Shanghai', 'Los Angeles', 'Pacific'),
            ('Rotterdam', 'New York', 'Atlantic'),
            ('Singapore', 'Dubai', 'Indian Ocean'),
            ('Houston', 'Amsterdam', 'Atlantic'),
            ('Santos', 'Shanghai', 'Pacific')
        ]
        
    def generate_vessel_data(self, n_vessels: int = 500, n_days: int = 90) -> pd.DataFrame:
        """Generate simulated vessel tracking data."""
        
        data = []
        dates = pd.date_range(end=datetime.now(), periods=n_days, freq='D')
        
        for vessel_id in range(n_vessels):
            vessel_type = np.random.choice(self.vessel_types)
            route = self.routes[np.random.randint(len(self.routes))]
            
            # Capacity based on vessel type (in DWT - deadweight tonnage)
            capacity_ranges = {
                'Tanker': (50000, 300000),
                'Bulk Carrier': (30000, 200000),
                'Container': (10000, 150000),
                'LNG Carrier': (80000, 180000)
            }
            capacity = np.random.randint(*capacity_ranges[vessel_type])
            
            # Generate daily positions for this vessel
            for i, date in enumerate(dates):
                # Simulate cargo utilization
                cargo_util = np.clip(np.random.normal(0.85, 0.1), 0.5, 1.0)
                
                # Simulate speed (knots)
                avg_speed = np.random.normal(14, 2)
                
                data.append({
                    'date': date,
                    'vessel_id': f'V{vessel_id:04d}',
                    'vessel_type': vessel_type,
                    'origin': route[0],
                    'destination': route[1],
                    'ocean': route[2],
                    'capacity_dwt': capacity,
                    'cargo_utilization': cargo_util,
                    'cargo_tonnes': int(capacity * cargo_util),
                    'speed_knots': avg_speed
                })
        
        return pd.DataFrame(data)
    
    def calculate_trade_flows(self, df: pd.DataFrame) -> pd.DataFrame:
        """Calculate aggregate trade flow metrics."""
        
        # Daily cargo by route and vessel type
        daily_flows = df.groupby(['date', 'vessel_type', 'destination']).agg({
            'cargo_tonnes': 'sum',
            'vessel_id': 'count',
            'cargo_utilization': 'mean'
        }).reset_index()
        daily_flows.columns = ['date', 'vessel_type', 'destination', 'total_cargo', 'vessel_count', 'avg_utilization']
        
        return daily_flows


# Generate ship tracking data
ship_analyzer = ShipTrackingAnalyzer()
vessel_data = ship_analyzer.generate_vessel_data(n_vessels=200, n_days=60)
trade_flows = ship_analyzer.calculate_trade_flows(vessel_data)

print("Ship Tracking Data Summary:")
print(f"\nVessel Type Distribution:")
print(vessel_data.groupby('vessel_type').agg({
    'vessel_id': 'nunique',
    'cargo_tonnes': 'mean'
}).round(0))

In [None]:
# Visualize shipping data
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Cargo flow by vessel type
ax1 = axes[0, 0]
vessel_type_cargo = trade_flows.groupby(['date', 'vessel_type'])['total_cargo'].sum().unstack()
vessel_type_cargo.plot(ax=ax1, linewidth=2)
ax1.set_title('Daily Cargo Flow by Vessel Type')
ax1.set_xlabel('Date')
ax1.set_ylabel('Total Cargo (tonnes)')
ax1.legend(title='Vessel Type')

# Destination analysis
ax2 = axes[0, 1]
dest_cargo = vessel_data.groupby('destination')['cargo_tonnes'].sum().sort_values(ascending=True)
dest_cargo.plot(kind='barh', ax=ax2, color='steelblue', edgecolor='black')
ax2.set_title('Total Cargo by Destination')
ax2.set_xlabel('Total Cargo (tonnes)')

# Utilization trends
ax3 = axes[1, 0]
daily_util = vessel_data.groupby('date')['cargo_utilization'].mean()
ax3.plot(daily_util.index, daily_util.values, linewidth=2)
ax3.axhline(daily_util.mean(), color='red', linestyle='--', label=f'Mean: {daily_util.mean():.2%}')
ax3.set_title('Average Fleet Cargo Utilization')
ax3.set_xlabel('Date')
ax3.set_ylabel('Utilization Rate')
ax3.legend()

# Speed distribution by vessel type
ax4 = axes[1, 1]
vessel_data.boxplot(column='speed_knots', by='vessel_type', ax=ax4)
ax4.set_title('Speed Distribution by Vessel Type')
ax4.set_xlabel('Vessel Type')
ax4.set_ylabel('Speed (knots)')
plt.suptitle('')

plt.tight_layout()
plt.show()

## 3. Web Scraping & Traffic Data

### 3.1 Web Traffic Analysis

Website traffic data can predict company performance, especially for e-commerce and SaaS businesses.

In [None]:
class WebTrafficAnalyzer:
    """
    Simulates web traffic analysis for trading signals.
    
    Real providers include:
    - SimilarWeb
    - SEMrush
    - Alexa (discontinued)
    - Cloudflare Radar
    """
    
    def __init__(self):
        self.companies = {
            'AMZN': {'base_visits': 2.5e9, 'volatility': 0.1},
            'SHOP': {'base_visits': 500e6, 'volatility': 0.15},
            'EBAY': {'base_visits': 800e6, 'volatility': 0.12},
            'ETSY': {'base_visits': 400e6, 'volatility': 0.18},
            'MELI': {'base_visits': 600e6, 'volatility': 0.14}
        }
    
    def generate_traffic_data(self, n_days: int = 180) -> pd.DataFrame:
        """Generate simulated web traffic data."""
        
        dates = pd.date_range(end=datetime.now(), periods=n_days, freq='D')
        data = []
        
        for ticker, params in self.companies.items():
            # Generate base traffic with trend and seasonality
            for i, date in enumerate(dates):
                # Trend component (slight growth)
                trend = 1 + 0.0005 * i
                
                # Seasonal component (holiday boost)
                month = date.month
                seasonal = 1 + 0.4 * (month in [11, 12]) + 0.1 * (month in [6, 7])
                
                # Day of week effect
                dow = date.dayofweek
                dow_effect = 1.1 if dow < 5 else 0.9
                
                # Random noise
                noise = np.random.normal(1, params['volatility'])
                
                visits = params['base_visits'] * trend * seasonal * dow_effect * noise
                
                # Engagement metrics
                avg_duration = np.random.normal(180, 30)  # seconds
                pages_per_visit = np.random.normal(5, 1)
                bounce_rate = np.random.normal(0.4, 0.05)
                
                data.append({
                    'date': date,
                    'ticker': ticker,
                    'visits': int(visits),
                    'unique_visitors': int(visits * np.random.uniform(0.6, 0.8)),
                    'avg_duration_sec': max(60, avg_duration),
                    'pages_per_visit': max(1, pages_per_visit),
                    'bounce_rate': np.clip(bounce_rate, 0.2, 0.7)
                })
        
        return pd.DataFrame(data)
    
    def calculate_traffic_signals(self, df: pd.DataFrame) -> pd.DataFrame:
        """Calculate trading signals from web traffic data."""
        
        signals = []
        
        for ticker in df['ticker'].unique():
            ticker_data = df[df['ticker'] == ticker].sort_values('date').copy()
            
            # Calculate rolling metrics
            ticker_data['visits_7d_avg'] = ticker_data['visits'].rolling(7).mean()
            ticker_data['visits_30d_avg'] = ticker_data['visits'].rolling(30).mean()
            
            # Year-over-year comparison (or 90-day for shorter data)
            ticker_data['visits_yoy'] = ticker_data['visits'].pct_change(90)
            
            # Momentum signal: short-term vs long-term
            ticker_data['momentum_signal'] = (
                ticker_data['visits_7d_avg'] / ticker_data['visits_30d_avg'] - 1
            )
            
            # Engagement quality score
            ticker_data['engagement_score'] = (
                ticker_data['avg_duration_sec'] / 180 * 0.4 +
                ticker_data['pages_per_visit'] / 5 * 0.4 +
                (1 - ticker_data['bounce_rate']) * 0.2
            )
            
            signals.append(ticker_data)
        
        return pd.concat(signals, ignore_index=True)


# Generate web traffic data
traffic_analyzer = WebTrafficAnalyzer()
traffic_data = traffic_analyzer.generate_traffic_data(n_days=180)
traffic_signals = traffic_analyzer.calculate_traffic_signals(traffic_data)

print("Web Traffic Data Summary:")
print(traffic_data.groupby('ticker').agg({
    'visits': ['mean', 'std'],
    'bounce_rate': 'mean'
}).round(2))

In [None]:
# Visualize web traffic signals
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Traffic trends by company
ax1 = axes[0, 0]
for ticker in traffic_signals['ticker'].unique():
    ticker_data = traffic_signals[traffic_signals['ticker'] == ticker]
    ax1.plot(ticker_data['date'], ticker_data['visits_7d_avg'] / 1e6, label=ticker)
ax1.set_title('7-Day Average Web Traffic')
ax1.set_xlabel('Date')
ax1.set_ylabel('Visits (Millions)')
ax1.legend()

# Momentum signals
ax2 = axes[0, 1]
latest_signals = traffic_signals.groupby('ticker').last().reset_index()
colors = ['green' if x > 0 else 'red' for x in latest_signals['momentum_signal']]
ax2.barh(latest_signals['ticker'], latest_signals['momentum_signal'] * 100, color=colors)
ax2.axvline(0, color='black', linewidth=0.5)
ax2.set_title('Current Momentum Signal (7d vs 30d Avg)')
ax2.set_xlabel('Momentum (%)')

# Engagement score over time
ax3 = axes[1, 0]
for ticker in traffic_signals['ticker'].unique():
    ticker_data = traffic_signals[traffic_signals['ticker'] == ticker]
    ax3.plot(ticker_data['date'], ticker_data['engagement_score'].rolling(7).mean(), label=ticker)
ax3.set_title('Engagement Quality Score (7-day Avg)')
ax3.set_xlabel('Date')
ax3.set_ylabel('Engagement Score')
ax3.legend()

# Bounce rate comparison
ax4 = axes[1, 1]
bounce_rates = traffic_data.groupby('ticker')['bounce_rate'].mean().sort_values()
ax4.barh(bounce_rates.index, bounce_rates.values * 100, color='coral', edgecolor='black')
ax4.set_title('Average Bounce Rate by Company')
ax4.set_xlabel('Bounce Rate (%)')

plt.tight_layout()
plt.show()

### 3.2 Web Scraping Best Practices

Ethical web scraping considerations:
- Respect `robots.txt`
- Implement rate limiting
- Identify your scraper with a proper User-Agent
- Cache responses to minimize requests

In [None]:
class EthicalWebScraper:
    """
    A framework for ethical web scraping with rate limiting and caching.
    """
    
    def __init__(self, base_delay: float = 1.0, user_agent: str = None):
        self.base_delay = base_delay
        self.cache = {}
        self.last_request_time = {}
        
        self.headers = {
            'User-Agent': user_agent or 'ResearchBot/1.0 (Academic Research; contact@example.edu)'
        }
    
    def check_robots_txt(self, url: str) -> dict:
        """Check robots.txt for the domain (simulated)."""
        from urllib.parse import urlparse
        
        parsed = urlparse(url)
        domain = parsed.netloc
        
        # In production, actually fetch and parse robots.txt
        # This is a placeholder showing the structure
        return {
            'domain': domain,
            'crawl_delay': self.base_delay,
            'allowed': True,
            'disallowed_paths': ['/admin', '/api/private']
        }
    
    def rate_limit(self, domain: str):
        """Implement rate limiting per domain."""
        current_time = time.time()
        
        if domain in self.last_request_time:
            elapsed = current_time - self.last_request_time[domain]
            if elapsed < self.base_delay:
                time.sleep(self.base_delay - elapsed)
        
        self.last_request_time[domain] = time.time()
    
    def get_cached_or_fetch(self, url: str, max_age_hours: int = 24) -> dict:
        """
        Get from cache or fetch with rate limiting.
        Returns simulated response for demonstration.
        """
        from urllib.parse import urlparse
        
        # Check cache
        if url in self.cache:
            cached = self.cache[url]
            age_hours = (datetime.now() - cached['timestamp']).total_seconds() / 3600
            if age_hours < max_age_hours:
                return {'source': 'cache', 'data': cached['data']}
        
        # Rate limit
        domain = urlparse(url).netloc
        self.rate_limit(domain)
        
        # In production, this would actually make the request
        # response = requests.get(url, headers=self.headers)
        
        # Simulated response
        data = {
            'url': url,
            'status': 200,
            'content': f'Simulated content from {url}'
        }
        
        # Cache the response
        self.cache[url] = {
            'timestamp': datetime.now(),
            'data': data
        }
        
        return {'source': 'fetch', 'data': data}


# Demonstrate ethical scraping framework
scraper = EthicalWebScraper(base_delay=2.0)

# Check robots.txt
robots_info = scraper.check_robots_txt('https://example.com/data')
print("Robots.txt Information:")
print(json.dumps(robots_info, indent=2))

# Demonstrate caching
print("\nFirst request (fetch):")
result1 = scraper.get_cached_or_fetch('https://example.com/data')
print(f"Source: {result1['source']}")

print("\nSecond request (cache):")
result2 = scraper.get_cached_or_fetch('https://example.com/data')
print(f"Source: {result2['source']}")

## 4. Social Media & Sentiment Data

### 4.1 Social Media Volume & Sentiment Analysis

In [None]:
class SocialMediaAnalyzer:
    """
    Analyze social media data for trading signals.
    
    Real providers include:
    - Stocktwits API
    - Twitter/X API
    - Reddit API
    - Alternative.me (Crypto Fear & Greed)
    """
    
    def __init__(self):
        self.tickers = ['AAPL', 'TSLA', 'NVDA', 'AMD', 'GME', 'AMC']
        
    def generate_social_data(self, n_days: int = 90) -> pd.DataFrame:
        """Generate simulated social media data."""
        
        dates = pd.date_range(end=datetime.now(), periods=n_days, freq='D')
        data = []
        
        for date in dates:
            for ticker in self.tickers:
                # Base mention volume varies by ticker popularity
                base_mentions = {
                    'AAPL': 50000, 'TSLA': 80000, 'NVDA': 60000,
                    'AMD': 30000, 'GME': 40000, 'AMC': 35000
                }
                
                # Add random events (earnings, news)
                event_multiplier = 1 + np.random.exponential(0.3) if np.random.random() < 0.1 else 1
                
                mentions = int(base_mentions[ticker] * np.random.lognormal(0, 0.3) * event_multiplier)
                
                # Sentiment distribution (positive, negative, neutral)
                base_sentiment = np.random.normal(0.55, 0.1)  # Slight positive bias
                positive_ratio = np.clip(base_sentiment, 0.2, 0.8)
                negative_ratio = np.clip(np.random.normal(0.25, 0.05), 0.1, 0.5)
                neutral_ratio = 1 - positive_ratio - negative_ratio
                neutral_ratio = max(0, neutral_ratio)
                
                # Normalize
                total = positive_ratio + negative_ratio + neutral_ratio
                positive_ratio /= total
                negative_ratio /= total
                neutral_ratio /= total
                
                data.append({
                    'date': date,
                    'ticker': ticker,
                    'mentions': mentions,
                    'positive_ratio': positive_ratio,
                    'negative_ratio': negative_ratio,
                    'neutral_ratio': neutral_ratio,
                    'sentiment_score': positive_ratio - negative_ratio,  # Net sentiment
                    'unique_users': int(mentions * np.random.uniform(0.3, 0.6)),
                    'avg_engagement': np.random.lognormal(2, 0.5)
                })
        
        return pd.DataFrame(data)
    
    def calculate_sentiment_signals(self, df: pd.DataFrame) -> pd.DataFrame:
        """Calculate trading signals from social sentiment."""
        
        signals = []
        
        for ticker in df['ticker'].unique():
            ticker_data = df[df['ticker'] == ticker].sort_values('date').copy()
            
            # Rolling metrics
            ticker_data['mentions_7d_avg'] = ticker_data['mentions'].rolling(7).mean()
            ticker_data['sentiment_7d_avg'] = ticker_data['sentiment_score'].rolling(7).mean()
            
            # Z-score of mentions (unusual activity detector)
            ticker_data['mentions_zscore'] = (
                (ticker_data['mentions'] - ticker_data['mentions'].rolling(30).mean()) /
                ticker_data['mentions'].rolling(30).std()
            )
            
            # Sentiment momentum
            ticker_data['sentiment_momentum'] = ticker_data['sentiment_7d_avg'].diff(7)
            
            # Combined signal: high sentiment + unusual volume
            ticker_data['combined_signal'] = (
                ticker_data['sentiment_score'] * 0.5 +
                np.clip(ticker_data['mentions_zscore'], -2, 2) / 4 * 0.5
            )
            
            signals.append(ticker_data)
        
        return pd.concat(signals, ignore_index=True)


# Generate social media data
social_analyzer = SocialMediaAnalyzer()
social_data = social_analyzer.generate_social_data(n_days=90)
social_signals = social_analyzer.calculate_sentiment_signals(social_data)

print("Social Media Data Summary:")
print(social_data.groupby('ticker').agg({
    'mentions': 'mean',
    'sentiment_score': 'mean',
    'unique_users': 'mean'
}).round(2))

In [None]:
# Visualize social sentiment signals
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Mentions volume by ticker
ax1 = axes[0, 0]
for ticker in ['TSLA', 'NVDA', 'GME']:
    ticker_data = social_signals[social_signals['ticker'] == ticker]
    ax1.plot(ticker_data['date'], ticker_data['mentions_7d_avg'] / 1000, label=ticker)
ax1.set_title('Social Media Mentions (7-day Avg)')
ax1.set_xlabel('Date')
ax1.set_ylabel('Mentions (Thousands)')
ax1.legend()

# Sentiment heatmap
ax2 = axes[0, 1]
pivot_sentiment = social_data.pivot_table(
    index='ticker', 
    columns=social_data['date'].dt.week, 
    values='sentiment_score',
    aggfunc='mean'
)
sns.heatmap(pivot_sentiment, cmap='RdYlGn', center=0, ax=ax2, cbar_kws={'label': 'Sentiment'})
ax2.set_title('Weekly Sentiment Heatmap by Ticker')
ax2.set_xlabel('Week')

# Unusual activity detection
ax3 = axes[1, 0]
recent_signals = social_signals.groupby('ticker').last().reset_index()
colors = ['red' if abs(x) > 1.5 else 'gray' for x in recent_signals['mentions_zscore']]
ax3.bar(recent_signals['ticker'], recent_signals['mentions_zscore'], color=colors)
ax3.axhline(1.5, color='red', linestyle='--', alpha=0.5, label='Alert Threshold')
ax3.axhline(-1.5, color='red', linestyle='--', alpha=0.5)
ax3.set_title('Mentions Z-Score (Unusual Activity)')
ax3.set_xlabel('Ticker')
ax3.set_ylabel('Z-Score')
ax3.legend()

# Combined signal distribution
ax4 = axes[1, 1]
for ticker in ['AAPL', 'TSLA', 'GME']:
    ticker_data = social_signals[social_signals['ticker'] == ticker]
    ax4.hist(ticker_data['combined_signal'].dropna(), alpha=0.5, bins=30, label=ticker)
ax4.set_title('Combined Signal Distribution')
ax4.set_xlabel('Signal Value')
ax4.set_ylabel('Frequency')
ax4.legend()

plt.tight_layout()
plt.show()

## 5. Transaction & Credit Card Data

### 5.1 Consumer Spending Analysis

Aggregated credit card and transaction data provides real-time insights into consumer spending.

In [None]:
class TransactionDataAnalyzer:
    """
    Analyze aggregated transaction data for revenue prediction.
    
    Real providers include:
    - Second Measure
    - Earnest Research
    - Bloomberg Second Measure
    - Mastercard SpendingPulse
    - Visa Merchant Insights
    """
    
    def __init__(self):
        self.merchants = {
            'AMZN': {'category': 'E-Commerce', 'base_txn': 10e6},
            'WMT': {'category': 'Retail', 'base_txn': 15e6},
            'SBUX': {'category': 'Restaurants', 'base_txn': 5e6},
            'MCD': {'category': 'Restaurants', 'base_txn': 8e6},
            'HD': {'category': 'Home Improvement', 'base_txn': 3e6},
            'NKE': {'category': 'Apparel', 'base_txn': 2e6}
        }
    
    def generate_transaction_data(self, n_weeks: int = 52) -> pd.DataFrame:
        """Generate simulated weekly transaction data."""
        
        weeks = pd.date_range(end=datetime.now(), periods=n_weeks, freq='W')
        data = []
        
        for week in weeks:
            for ticker, params in self.merchants.items():
                # Seasonal patterns
                month = week.month
                if params['category'] in ['E-Commerce', 'Retail']:
                    seasonal = 1 + 0.5 * (month in [11, 12]) + 0.1 * (month in [6, 7])
                elif params['category'] == 'Restaurants':
                    seasonal = 1 + 0.1 * (month in [6, 7, 8])  # Summer boost
                elif params['category'] == 'Home Improvement':
                    seasonal = 1 + 0.2 * (month in [4, 5, 6])  # Spring boost
                else:
                    seasonal = 1.0
                
                # Trend (growth/decline)
                trend = 1 + 0.001 * (week - weeks[0]).days
                
                # Random variation
                noise = np.random.lognormal(0, 0.1)
                
                txn_count = int(params['base_txn'] * seasonal * trend * noise)
                avg_ticket = np.random.normal(50, 10) if params['category'] != 'Restaurants' else np.random.normal(15, 5)
                
                data.append({
                    'week': week,
                    'ticker': ticker,
                    'category': params['category'],
                    'transaction_count': txn_count,
                    'avg_ticket_size': max(5, avg_ticket),
                    'total_spend': txn_count * avg_ticket,
                    'unique_customers': int(txn_count * np.random.uniform(0.4, 0.7)),
                    'repeat_rate': np.random.uniform(0.2, 0.5)
                })
        
        return pd.DataFrame(data)
    
    def predict_quarterly_revenue(self, df: pd.DataFrame, ticker: str) -> dict:
        """Predict quarterly revenue from transaction data."""
        
        ticker_data = df[df['ticker'] == ticker].copy()
        ticker_data['quarter'] = ticker_data['week'].dt.to_period('Q')
        
        # Aggregate by quarter
        quarterly = ticker_data.groupby('quarter').agg({
            'total_spend': 'sum',
            'transaction_count': 'sum',
            'unique_customers': 'sum',
            'avg_ticket_size': 'mean'
        }).reset_index()
        
        # Calculate growth rates
        quarterly['spend_qoq'] = quarterly['total_spend'].pct_change()
        quarterly['spend_yoy'] = quarterly['total_spend'].pct_change(4) if len(quarterly) > 4 else np.nan
        
        return {
            'ticker': ticker,
            'quarterly_data': quarterly,
            'latest_quarter_spend': quarterly['total_spend'].iloc[-1],
            'qoq_growth': quarterly['spend_qoq'].iloc[-1] if len(quarterly) > 1 else None
        }


# Generate transaction data
txn_analyzer = TransactionDataAnalyzer()
txn_data = txn_analyzer.generate_transaction_data(n_weeks=52)

print("Transaction Data Summary by Category:")
print(txn_data.groupby('category').agg({
    'total_spend': 'sum',
    'transaction_count': 'sum',
    'avg_ticket_size': 'mean'
}).round(2))

In [None]:
# Visualize transaction data
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Total spend by company
ax1 = axes[0, 0]
for ticker in ['AMZN', 'WMT', 'SBUX']:
    ticker_data = txn_data[txn_data['ticker'] == ticker]
    ax1.plot(ticker_data['week'], ticker_data['total_spend'] / 1e9, label=ticker)
ax1.set_title('Weekly Consumer Spend by Company')
ax1.set_xlabel('Week')
ax1.set_ylabel('Total Spend ($B)')
ax1.legend()

# Category comparison
ax2 = axes[0, 1]
category_spend = txn_data.groupby('category')['total_spend'].sum().sort_values()
category_spend.plot(kind='barh', ax=ax2, color='steelblue', edgecolor='black')
ax2.set_title('Total Annual Spend by Category')
ax2.set_xlabel('Total Spend ($)')

# Average ticket size trends
ax3 = axes[1, 0]
category_ticket = txn_data.groupby(['week', 'category'])['avg_ticket_size'].mean().unstack()
category_ticket.plot(ax=ax3)
ax3.set_title('Average Ticket Size by Category')
ax3.set_xlabel('Week')
ax3.set_ylabel('Avg Ticket ($)')
ax3.legend(title='Category', loc='upper left')

# Year-over-year growth by quarter
ax4 = axes[1, 1]
predictions = {}
for ticker in ['AMZN', 'WMT', 'HD']:
    pred = txn_analyzer.predict_quarterly_revenue(txn_data, ticker)
    predictions[ticker] = pred

tickers = list(predictions.keys())
qoq_growth = [predictions[t]['qoq_growth'] * 100 if predictions[t]['qoq_growth'] else 0 for t in tickers]
colors = ['green' if x > 0 else 'red' for x in qoq_growth]
ax4.bar(tickers, qoq_growth, color=colors, edgecolor='black')
ax4.axhline(0, color='black', linewidth=0.5)
ax4.set_title('Quarter-over-Quarter Spend Growth')
ax4.set_xlabel('Ticker')
ax4.set_ylabel('QoQ Growth (%)')

plt.tight_layout()
plt.show()

## 6. Alternative Data Quality Framework

### 6.1 Data Quality Assessment

In [None]:
class AltDataQualityAssessor:
    """
    Framework for assessing alternative data quality.
    
    Key dimensions:
    1. Coverage - How representative is the sample?
    2. Timeliness - How fresh is the data?
    3. Accuracy - How reliable is the data?
    4. Consistency - Is the data stable over time?
    5. Completeness - Are there gaps or missing values?
    """
    
    def __init__(self, df: pd.DataFrame, date_col: str, value_col: str):
        self.df = df.copy()
        self.date_col = date_col
        self.value_col = value_col
        
    def assess_coverage(self, expected_entities: list = None) -> dict:
        """Assess data coverage."""
        
        unique_dates = self.df[self.date_col].nunique()
        date_range = (self.df[self.date_col].max() - self.df[self.date_col].min()).days
        
        return {
            'unique_dates': unique_dates,
            'date_range_days': date_range,
            'coverage_ratio': unique_dates / max(date_range, 1),
            'total_records': len(self.df)
        }
    
    def assess_timeliness(self) -> dict:
        """Assess data timeliness."""
        
        latest_date = self.df[self.date_col].max()
        lag_days = (datetime.now() - pd.to_datetime(latest_date)).days
        
        return {
            'latest_date': latest_date,
            'lag_days': lag_days,
            'is_stale': lag_days > 7  # Data older than 7 days is considered stale
        }
    
    def assess_completeness(self) -> dict:
        """Assess data completeness."""
        
        missing_pct = self.df[self.value_col].isna().mean() * 100
        zero_pct = (self.df[self.value_col] == 0).mean() * 100
        
        return {
            'missing_pct': missing_pct,
            'zero_values_pct': zero_pct,
            'completeness_score': 100 - missing_pct
        }
    
    def assess_consistency(self, window: int = 30) -> dict:
        """Assess data consistency over time."""
        
        daily_mean = self.df.groupby(self.date_col)[self.value_col].mean()
        
        # Calculate coefficient of variation
        cv = daily_mean.std() / daily_mean.mean() if daily_mean.mean() != 0 else np.inf
        
        # Detect regime changes (large jumps)
        pct_changes = daily_mean.pct_change().abs()
        anomalies = (pct_changes > 0.5).sum()  # >50% daily change
        
        return {
            'coefficient_of_variation': cv,
            'anomaly_days': anomalies,
            'is_stable': cv < 0.5 and anomalies < 5
        }
    
    def full_assessment(self) -> pd.DataFrame:
        """Run full quality assessment."""
        
        coverage = self.assess_coverage()
        timeliness = self.assess_timeliness()
        completeness = self.assess_completeness()
        consistency = self.assess_consistency()
        
        # Calculate overall quality score
        quality_score = (
            min(coverage['coverage_ratio'], 1) * 25 +
            (1 - min(timeliness['lag_days'] / 30, 1)) * 25 +
            completeness['completeness_score'] / 100 * 25 +
            (1 if consistency['is_stable'] else 0.5) * 25
        )
        
        assessment = pd.DataFrame([
            {'Dimension': 'Coverage', 'Score': coverage['coverage_ratio'] * 100, 'Details': f"{coverage['unique_dates']} unique dates"},
            {'Dimension': 'Timeliness', 'Score': max(0, 100 - timeliness['lag_days'] * 5), 'Details': f"{timeliness['lag_days']} days lag"},
            {'Dimension': 'Completeness', 'Score': completeness['completeness_score'], 'Details': f"{completeness['missing_pct']:.1f}% missing"},
            {'Dimension': 'Consistency', 'Score': 80 if consistency['is_stable'] else 50, 'Details': f"CV: {consistency['coefficient_of_variation']:.2f}"},
            {'Dimension': 'Overall', 'Score': quality_score, 'Details': 'Weighted average'}
        ])
        
        return assessment


# Assess parking data quality
assessor = AltDataQualityAssessor(agg_metrics, 'date', 'car_count')
quality_report = assessor.full_assessment()

print("\n" + "="*50)
print("Alternative Data Quality Assessment Report")
print("="*50)
print(quality_report.to_string(index=False))

In [None]:
# Visualize quality assessment
fig, ax = plt.subplots(figsize=(10, 6))

dimensions = quality_report['Dimension'][:-1].tolist()  # Exclude 'Overall'
scores = quality_report['Score'][:-1].tolist()

# Create radar chart
angles = np.linspace(0, 2 * np.pi, len(dimensions), endpoint=False).tolist()
scores_plot = scores + [scores[0]]  # Close the polygon
angles += angles[:1]

ax = plt.subplot(111, polar=True)
ax.plot(angles, scores_plot, 'o-', linewidth=2, color='steelblue')
ax.fill(angles, scores_plot, alpha=0.25, color='steelblue')
ax.set_xticks(angles[:-1])
ax.set_xticklabels(dimensions)
ax.set_ylim(0, 100)
ax.set_title('Data Quality Assessment Radar', size=14, y=1.1)

plt.tight_layout()
plt.show()

# Print overall score interpretation
overall_score = quality_report[quality_report['Dimension'] == 'Overall']['Score'].values[0]
if overall_score >= 80:
    quality_grade = 'HIGH QUALITY - Suitable for trading signals'
elif overall_score >= 60:
    quality_grade = 'MEDIUM QUALITY - Use with caution'
else:
    quality_grade = 'LOW QUALITY - Not recommended for trading'

print(f"\nOverall Quality Score: {overall_score:.1f}/100")
print(f"Assessment: {quality_grade}")

## 7. Combining Alternative Data Sources

### 7.1 Multi-Source Alpha Generation

In [None]:
class MultiSourceAlphaGenerator:
    """
    Combine multiple alternative data sources for alpha generation.
    """
    
    def __init__(self):
        self.source_weights = {
            'web_traffic': 0.25,
            'social_sentiment': 0.25,
            'transaction_data': 0.30,
            'satellite_data': 0.20
        }
    
    def generate_combined_signal(self, 
                                  web_signal: float,
                                  social_signal: float,
                                  txn_signal: float,
                                  satellite_signal: float) -> dict:
        """Generate combined alpha signal from multiple sources."""
        
        # Normalize signals to [-1, 1]
        signals = {
            'web_traffic': np.clip(web_signal, -1, 1),
            'social_sentiment': np.clip(social_signal, -1, 1),
            'transaction_data': np.clip(txn_signal, -1, 1),
            'satellite_data': np.clip(satellite_signal, -1, 1)
        }
        
        # Weighted combination
        combined = sum(
            signals[source] * weight 
            for source, weight in self.source_weights.items()
        )
        
        # Calculate signal strength (agreement across sources)
        signal_signs = [np.sign(s) for s in signals.values()]
        agreement = abs(sum(signal_signs)) / len(signal_signs)
        
        # Confidence score based on agreement
        confidence = agreement * abs(combined)
        
        return {
            'combined_signal': combined,
            'agreement': agreement,
            'confidence': confidence,
            'individual_signals': signals,
            'recommendation': 'LONG' if combined > 0.2 else 'SHORT' if combined < -0.2 else 'NEUTRAL'
        }
    
    def backtest_combined_signal(self, n_periods: int = 100) -> pd.DataFrame:
        """Backtest the combined signal approach."""
        
        results = []
        
        for i in range(n_periods):
            # Generate random signals (in reality, these come from actual data)
            web = np.random.normal(0, 0.3)
            social = np.random.normal(0, 0.4)
            txn = np.random.normal(0.05, 0.2)  # Slight positive bias
            satellite = np.random.normal(0, 0.25)
            
            signal = self.generate_combined_signal(web, social, txn, satellite)
            
            # Simulate forward return (correlated with signal)
            forward_return = (
                signal['combined_signal'] * 0.02 +  # Signal has predictive power
                np.random.normal(0, 0.02)  # Random noise
            )
            
            results.append({
                'period': i,
                'combined_signal': signal['combined_signal'],
                'confidence': signal['confidence'],
                'recommendation': signal['recommendation'],
                'forward_return': forward_return
            })
        
        return pd.DataFrame(results)


# Demonstrate multi-source alpha generation
alpha_gen = MultiSourceAlphaGenerator()

# Example signal combination
example_signal = alpha_gen.generate_combined_signal(
    web_signal=0.3,      # Positive web traffic momentum
    social_signal=0.5,   # Bullish social sentiment
    txn_signal=0.2,      # Above-average transaction growth
    satellite_signal=-0.1 # Slightly negative parking data
)

print("Multi-Source Alpha Signal Analysis")
print("="*50)
print(f"\nIndividual Signals:")
for source, value in example_signal['individual_signals'].items():
    print(f"  {source}: {value:+.2f}")
print(f"\nCombined Signal: {example_signal['combined_signal']:+.3f}")
print(f"Source Agreement: {example_signal['agreement']:.2%}")
print(f"Confidence Score: {example_signal['confidence']:.3f}")
print(f"Recommendation: {example_signal['recommendation']}")

In [None]:
# Backtest the combined signal
backtest_results = alpha_gen.backtest_combined_signal(n_periods=200)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Signal vs Return scatter
ax1 = axes[0, 0]
ax1.scatter(backtest_results['combined_signal'], 
            backtest_results['forward_return'] * 100,
            alpha=0.5, c=backtest_results['confidence'], cmap='viridis')
ax1.axhline(0, color='black', linewidth=0.5)
ax1.axvline(0, color='black', linewidth=0.5)

# Add regression line
z = np.polyfit(backtest_results['combined_signal'], backtest_results['forward_return'] * 100, 1)
p = np.poly1d(z)
x_line = np.linspace(backtest_results['combined_signal'].min(), backtest_results['combined_signal'].max(), 100)
ax1.plot(x_line, p(x_line), 'r--', label=f'Regression (slope={z[0]:.2f})')
ax1.set_title('Signal vs Forward Return')
ax1.set_xlabel('Combined Signal')
ax1.set_ylabel('Forward Return (%)')
ax1.legend()

# Return by recommendation
ax2 = axes[0, 1]
rec_returns = backtest_results.groupby('recommendation')['forward_return'].mean() * 100
colors = ['green' if x > 0 else 'red' for x in rec_returns.values]
ax2.bar(rec_returns.index, rec_returns.values, color=colors, edgecolor='black')
ax2.axhline(0, color='black', linewidth=0.5)
ax2.set_title('Average Return by Recommendation')
ax2.set_xlabel('Recommendation')
ax2.set_ylabel('Avg Return (%)')

# Cumulative returns
ax3 = axes[1, 0]
backtest_results['strategy_return'] = np.sign(backtest_results['combined_signal']) * backtest_results['forward_return']
backtest_results['cum_strategy'] = (1 + backtest_results['strategy_return']).cumprod()
backtest_results['cum_market'] = (1 + backtest_results['forward_return']).cumprod()

ax3.plot(backtest_results['period'], backtest_results['cum_strategy'], label='Strategy', linewidth=2)
ax3.plot(backtest_results['period'], backtest_results['cum_market'], label='Market', linewidth=2, linestyle='--')
ax3.set_title('Cumulative Returns: Strategy vs Market')
ax3.set_xlabel('Period')
ax3.set_ylabel('Cumulative Return')
ax3.legend()

# Confidence distribution
ax4 = axes[1, 1]
ax4.hist(backtest_results['confidence'], bins=30, edgecolor='black', alpha=0.7)
ax4.axvline(backtest_results['confidence'].mean(), color='red', linestyle='--', 
            label=f"Mean: {backtest_results['confidence'].mean():.3f}")
ax4.set_title('Confidence Score Distribution')
ax4.set_xlabel('Confidence')
ax4.set_ylabel('Frequency')
ax4.legend()

plt.tight_layout()
plt.show()

# Print performance metrics
print("\nBacktest Performance Metrics:")
print(f"Strategy Sharpe Ratio: {backtest_results['strategy_return'].mean() / backtest_results['strategy_return'].std() * np.sqrt(252):.2f}")
print(f"Hit Rate: {(np.sign(backtest_results['combined_signal']) == np.sign(backtest_results['forward_return'])).mean():.1%}")
print(f"Information Coefficient: {stats.pearsonr(backtest_results['combined_signal'], backtest_results['forward_return'])[0]:.3f}")

## 8. Regulatory & Ethical Considerations

### Key Compliance Areas

1. **Material Non-Public Information (MNPI)**
   - Alternative data must not constitute insider information
   - Data providers should have proper consent and aggregation

2. **Personal Data Protection**
   - GDPR (Europe), CCPA (California)
   - Proper anonymization and aggregation required

3. **Data Licensing**
   - Ensure proper rights to use data for trading
   - Redistribution restrictions

4. **Web Scraping Legality**
   - Terms of Service compliance
   - Computer Fraud and Abuse Act considerations

In [None]:
class ComplianceChecker:
    """
    Framework for checking alternative data compliance.
    """
    
    def __init__(self):
        self.compliance_checklist = {
            'data_source': [
                'Source has documented data collection methodology',
                'Data is properly anonymized/aggregated',
                'No direct individual identification possible',
                'Source has necessary licenses and permissions'
            ],
            'mnpi_risk': [
                'Data is not obtained from company insiders',
                'Information is derived from public observations',
                'Data is available to any willing buyer',
                'No special access or relationships involved'
            ],
            'usage_rights': [
                'License permits use for trading signals',
                'Redistribution terms are understood',
                'Derivatives work creation is permitted',
                'Data retention policy is compliant'
            ],
            'privacy': [
                'GDPR requirements satisfied (if EU data)',
                'CCPA requirements satisfied (if CA data)',
                'Consent obtained for data collection',
                'Data minimization principles followed'
            ]
        }
    
    def generate_checklist(self, data_type: str) -> pd.DataFrame:
        """Generate compliance checklist for a data type."""
        
        rows = []
        for category, items in self.compliance_checklist.items():
            for item in items:
                rows.append({
                    'Category': category.replace('_', ' ').title(),
                    'Requirement': item,
                    'Status': '☐ Pending',
                    'Notes': ''
                })
        
        return pd.DataFrame(rows)


# Generate compliance checklist
compliance = ComplianceChecker()
checklist = compliance.generate_checklist('web_scraping')

print("\n" + "="*70)
print("ALTERNATIVE DATA COMPLIANCE CHECKLIST")
print("="*70)
print(checklist[['Category', 'Requirement', 'Status']].to_string(index=False))

## 9. Key Takeaways & Best Practices

### Alternative Data Integration Framework

1. **Data Sourcing**
   - Evaluate multiple providers for each data type
   - Conduct thorough due diligence on data provenance
   - Ensure compliance with all regulations

2. **Data Quality**
   - Implement robust quality assessment frameworks
   - Monitor for coverage changes and data drift
   - Validate against known benchmarks

3. **Signal Construction**
   - Combine multiple orthogonal data sources
   - Account for signal decay and crowding
   - Implement proper backtesting methodology

4. **Production Considerations**
   - Build robust data pipelines with monitoring
   - Handle missing data and delays gracefully
   - Document all data transformations

### Common Pitfalls to Avoid

- **Survivorship bias**: Ensure historical data includes failed companies
- **Look-ahead bias**: Respect data availability timestamps
- **Overfitting**: Use out-of-sample validation rigorously
- **Signal decay**: Monitor for alpha erosion as data becomes commoditized

In [None]:
# Summary of alternative data landscape
alt_data_summary = pd.DataFrame({
    'Data Type': ['Satellite/Geo', 'Web Traffic', 'Social Media', 'Transaction', 'App Usage', 'Sensor/IoT'],
    'Latency': ['Days-Weeks', 'Hours-Days', 'Minutes-Hours', 'Days', 'Days-Weeks', 'Real-time'],
    'Cost': ['$$$$$', '$$$', '$$', '$$$$', '$$$', '$$$$'],
    'Alpha Decay': ['Medium', 'High', 'Very High', 'Medium', 'High', 'Low'],
    'Coverage': ['Global', 'Online only', 'Tech-savvy', 'Card users', 'App users', 'Specific sectors'],
    'Complexity': ['High', 'Medium', 'Medium', 'Low', 'Low', 'High']
})

print("\n" + "="*80)
print("ALTERNATIVE DATA LANDSCAPE SUMMARY")
print("="*80)
print(alt_data_summary.to_string(index=False))

print("\n" + "="*80)
print("KEY VENDORS BY DATA TYPE")
print("="*80)

vendors = {
    'Satellite': ['Orbital Insight', 'RS Metrics', 'Descartes Labs', 'Planet Labs'],
    'Web/Social': ['SimilarWeb', 'Thinknum', 'Eagle Alpha', 'Quandl'],
    'Transaction': ['Second Measure', 'Earnest Research', 'Bloomberg Second Measure'],
    'Shipping': ['MarineTraffic', 'Kpler', 'Spire Global', 'VesselsValue']
}

for category, provider_list in vendors.items():
    print(f"\n{category}: {', '.join(provider_list)}")

---

## Practice Exercises

1. **Satellite Data Analysis**: Extend the parking lot analyzer to include weather data adjustments
2. **Web Scraping Project**: Build a scraper for job postings to predict company growth
3. **Multi-Source Integration**: Combine 3+ alternative data sources for a single stock
4. **Quality Framework**: Implement automated data quality alerts
5. **Compliance Review**: Create a compliance checklist for a new data source

---

## Further Reading

- "Alternative Data: A Guide to Intelligent Use" - CFA Institute
- "The Rise of Alternative Data" - JPMorgan Quantitative Research
- "Big Data and Machine Learning in Quantitative Investment" - López de Prado
- Alternative Data sources: [alternativedata.org](https://alternativedata.org)