# Data Collection: Kenya Facilities + 7-Day Weather Forecasts

## Objective
Collect complete dataset for MVP Plan 2 (Temporal Prediction):
1. Fetch 50-100 Kenya health facilities (Healthsites.io API)
2. Fetch 7-day weather forecasts for each facility (OpenWeatherMap)
3. Prepare data for daily failure prediction model

**Output**: `data/processed/facilities_with_daily_weather.csv`

In [None]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time
from tqdm import tqdm

# Our custom modules
from weather_api import WeatherAPI
from facility_data_loader import KenyaFacilityLoader

print("âœ“ Imports successful")

## Step 1: Fetch Kenya Health Facilities

Using Healthsites.io API (free, no authentication needed)

In [None]:
# Initialize facility loader
loader = KenyaFacilityLoader()

# Fetch facilities from Healthsites.io
print("Fetching Kenya health facilities...")
facilities_raw = loader.fetch_from_healthsites(country="Kenya", limit=150)

print(f"\nâœ“ Fetched {len(facilities_raw)} facilities")
print(f"\nSample data:")
facilities_raw.head()

In [None]:
# Check facility types distribution
print("Facility types:")
print(facilities_raw['facility_type'].value_counts())

print("\nFacilities by completeness:")
print(facilities_raw['completeness'].describe())

In [None]:
# Clean and prepare facility data
facilities_clean = loader.prepare_for_model(facilities_raw)

# Filter to get diverse geographic spread
# Select facilities across different lat/lon ranges for diversity
facilities_clean = facilities_clean.sort_values('completeness', ascending=False)

# Take top 100 by completeness
facilities = facilities_clean.head(100).copy()

print(f"\nâœ“ Selected {len(facilities)} facilities for analysis")
print(f"\nGeographic spread:")
print(f"  Latitude range: {facilities['latitude'].min():.2f} to {facilities['latitude'].max():.2f}")
print(f"  Longitude range: {facilities['longitude'].min():.2f} to {facilities['longitude'].max():.2f}")

facilities.head()

## Step 2: Fetch 7-Day Weather Forecasts

For each facility, get daily weather forecast (temperature, clouds, humidity, etc.)

**Note**: This will make 100 API calls. With free tier (1000/day), this is fine.
Takes ~3-5 minutes with rate limiting.

In [None]:
# Initialize weather API
weather_api = WeatherAPI()

print("Weather API initialized")
print(f"Will fetch forecasts for {len(facilities)} facilities")
print("Estimated time: 3-5 minutes (with rate limiting)\n")

In [None]:
# Test with one facility first
test_facility = facilities.iloc[0]
print(f"Testing with: {test_facility['name']}")
print(f"Location: ({test_facility['latitude']}, {test_facility['longitude']})\n")

test_forecast = weather_api.get_7day_forecast(
    lat=test_facility['latitude'],
    lon=test_facility['longitude']
)

if test_forecast:
    print("âœ“ Weather API working!")
    test_df = weather_api.parse_forecast(test_forecast)
    print(f"\nForecast data (7 days):")
    print(test_df[['date', 'temp_max', 'temp_min', 'clouds', 'humidity']].to_string())
else:
    print("âœ— API call failed - check your API key in .env file")

In [None]:
# Fetch forecasts for all facilities
print("Fetching weather forecasts for all facilities...")
print("This may take 3-5 minutes.\n")

weather_data = []
failed_facilities = []

for idx, facility in tqdm(facilities.iterrows(), total=len(facilities), desc="Fetching forecasts"):
    try:
        # Get forecast
        forecast = weather_api.get_7day_forecast(
            lat=facility['latitude'],
            lon=facility['longitude']
        )
        
        if forecast:
            df_forecast = weather_api.parse_forecast(forecast)
            
            # Add facility info to each day's forecast
            df_forecast['facility_id'] = facility['facility_id']
            df_forecast['facility_name'] = facility['name']
            df_forecast['latitude'] = facility['latitude']
            df_forecast['longitude'] = facility['longitude']
            df_forecast['facility_type'] = facility['facility_type']
            df_forecast['power_source'] = facility['power_source']
            
            weather_data.append(df_forecast)
        else:
            failed_facilities.append(facility['facility_id'])
            
        # Rate limiting: small delay to avoid overwhelming API
        time.sleep(0.5)  # 0.5 second delay between calls
        
    except Exception as e:
        print(f"\nError for {facility['name']}: {e}")
        failed_facilities.append(facility['facility_id'])
        continue

print(f"\nâœ“ Completed!")
print(f"  Successful: {len(weather_data)} facilities")
print(f"  Failed: {len(failed_facilities)} facilities")

In [None]:
# Combine all weather data
df_weather_long = pd.concat(weather_data, ignore_index=True)

print(f"Weather data shape: {df_weather_long.shape}")
print(f"  {len(df_weather_long['facility_id'].unique())} unique facilities")
print(f"  {len(df_weather_long)} total rows (facilities Ã— 7 days)")

print("\nSample weather data:")
df_weather_long.head(10)

## Step 3: Reshape for MVP Plan 2 (Daily Features)

For temporal prediction, we need features structured as:
- `temp_day1`, `temp_day2`, ..., `temp_day7`
- `clouds_day1`, `clouds_day2`, ..., `clouds_day7`
- etc.

This allows the model to see the daily progression of weather.

In [None]:
# Add day number (1-7)
df_weather_long = df_weather_long.sort_values(['facility_id', 'date'])
df_weather_long['day_num'] = df_weather_long.groupby('facility_id').cumcount() + 1

print("Sample with day numbers:")
print(df_weather_long[['facility_name', 'date', 'day_num', 'temp_max', 'clouds']].head(14))

In [None]:
# Pivot to wide format (one row per facility, columns for each day)
weather_features = ['temp_max', 'temp_min', 'temp_day', 'clouds', 'humidity', 'wind_speed']

# Create wide format DataFrame
facilities_wide = facilities[['facility_id', 'name', 'latitude', 'longitude', 'facility_type', 'power_source']].copy()

# Pivot each weather variable
for feature in weather_features:
    pivot = df_weather_long.pivot(
        index='facility_id',
        columns='day_num',
        values=feature
    )
    
    # Rename columns to feature_day1, feature_day2, etc.
    pivot.columns = [f'{feature}_day{day}' for day in pivot.columns]
    
    # Merge with facilities
    facilities_wide = facilities_wide.merge(pivot, left_on='facility_id', right_index=True, how='left')

print(f"\nâœ“ Wide format created: {facilities_wide.shape}")
print(f"  {len(facilities_wide)} facilities")
print(f"  {len(facilities_wide.columns)} columns (facility info + daily weather features)")

print("\nSample wide format (first 2 facilities, selected columns):")
sample_cols = ['name', 'power_source', 'temp_max_day1', 'temp_max_day2', 'temp_max_day3', 
               'clouds_day1', 'clouds_day2', 'clouds_day3']
print(facilities_wide[sample_cols].head(2).to_string())

## Step 4: Add Temporal Features

Month and season information for model

In [None]:
# Add current month
current_month = datetime.now().month
facilities_wide['month'] = current_month

# Kenya seasons:
# Dry: Jan-Mar, Jun-Oct
# Rainy: Apr-May, Nov-Dec
dry_season_months = [1, 2, 3, 6, 7, 8, 9, 10]
rainy_season_months = [4, 5, 11, 12]

facilities_wide['is_dry_season'] = (current_month in dry_season_months).astype(int)
facilities_wide['is_rainy_season'] = (current_month in rainy_season_months).astype(int)

print(f"Current month: {current_month}")
print(f"Season: {'Dry' if current_month in dry_season_months else 'Rainy'}")

print("\nTemporal features added:")
print(facilities_wide[['name', 'month', 'is_dry_season', 'is_rainy_season']].head())

## Step 5: Calculate Derived Features

Useful features for the model:
- Days with temp > 35Â°C
- Days with temp > 38Â°C
- Average temperature across 7 days
- Maximum temperature across 7 days
- Cloudy days (>60% cloud cover)
- Heat wave indicator

In [None]:
# Calculate aggregate features
temp_cols = [f'temp_max_day{i}' for i in range(1, 8)]
clouds_cols = [f'clouds_day{i}' for i in range(1, 8)]

# Temperature features
facilities_wide['max_temp_7d'] = facilities_wide[temp_cols].max(axis=1)
facilities_wide['min_temp_7d'] = facilities_wide[temp_cols].min(axis=1)
facilities_wide['avg_temp_7d'] = facilities_wide[temp_cols].mean(axis=1)
facilities_wide['temp_above_35_days'] = (facilities_wide[temp_cols] > 35).sum(axis=1)
facilities_wide['temp_above_38_days'] = (facilities_wide[temp_cols] > 38).sum(axis=1)

# Cloud features
facilities_wide['avg_cloud_cover_7d'] = facilities_wide[clouds_cols].mean(axis=1)
facilities_wide['cloudy_days'] = (facilities_wide[clouds_cols] > 60).sum(axis=1)

# Heat wave indicator (3+ consecutive days > 35Â°C)
def detect_heat_wave(row):
    temps = [row[f'temp_max_day{i}'] for i in range(1, 8)]
    consecutive = 0
    max_consecutive = 0
    for temp in temps:
        if temp > 35:
            consecutive += 1
            max_consecutive = max(max_consecutive, consecutive)
        else:
            consecutive = 0
    return 1 if max_consecutive >= 3 else 0

facilities_wide['heat_wave_indicator'] = facilities_wide.apply(detect_heat_wave, axis=1)

print("\nDerived features:")
print(facilities_wide[['name', 'max_temp_7d', 'avg_temp_7d', 'temp_above_35_days', 
                      'heat_wave_indicator', 'cloudy_days']].head())

## Step 6: Data Summary & Quality Check

In [None]:
print("=" * 60)
print("DATA COLLECTION SUMMARY")
print("=" * 60)

print(f"\nFacilities: {len(facilities_wide)}")
print(f"Features: {len(facilities_wide.columns)}")

print(f"\nFacility types:")
print(facilities_wide['facility_type'].value_counts())

print(f"\nPower sources (estimated):")
print(facilities_wide['power_source'].value_counts())

print(f"\nWeather summary (7-day forecast):")
print(f"  Max temperature: {facilities_wide['max_temp_7d'].max():.1f}Â°C")
print(f"  Min temperature: {facilities_wide['min_temp_7d'].min():.1f}Â°C")
print(f"  Avg temperature: {facilities_wide['avg_temp_7d'].mean():.1f}Â°C")
print(f"  Facilities with heat wave forecast: {facilities_wide['heat_wave_indicator'].sum()}")
print(f"  Facilities with >3 days above 35Â°C: {(facilities_wide['temp_above_35_days'] > 3).sum()}")

print(f"\nMissing values:")
missing = facilities_wide.isnull().sum()
if missing.sum() == 0:
    print("  âœ“ No missing values")
else:
    print(missing[missing > 0])

## Step 7: Save Dataset

In [None]:
# Save wide format (for modeling)
output_path_wide = '../data/processed/facilities_with_daily_weather.csv'
facilities_wide.to_csv(output_path_wide, index=False)
print(f"âœ“ Saved wide format to: {output_path_wide}")
print(f"  Shape: {facilities_wide.shape}")

# Also save long format (for visualization)
output_path_long = '../data/processed/weather_forecasts_long.csv'
df_weather_long.to_csv(output_path_long, index=False)
print(f"\nâœ“ Saved long format to: {output_path_long}")
print(f"  Shape: {df_weather_long.shape}")

# Save facility list only
facilities_only = facilities[['facility_id', 'name', 'latitude', 'longitude', 
                             'facility_type', 'power_source']].copy()
output_path_facilities = '../data/processed/kenya_facilities.csv'
facilities_only.to_csv(output_path_facilities, index=False)
print(f"\nâœ“ Saved facility list to: {output_path_facilities}")
print(f"  Shape: {facilities_only.shape}")

## Step 8: Quick Visualization Preview

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Temperature distribution
axes[0, 0].hist(facilities_wide['max_temp_7d'], bins=20, color='red', alpha=0.7, edgecolor='black')
axes[0, 0].axvline(35, color='orange', linestyle='--', linewidth=2, label='35Â°C threshold')
axes[0, 0].axvline(38, color='darkred', linestyle='--', linewidth=2, label='38Â°C threshold')
axes[0, 0].set_xlabel('Maximum Temperature (Â°C)', fontsize=11)
axes[0, 0].set_ylabel('Number of Facilities', fontsize=11)
axes[0, 0].set_title('Distribution of Max Temperatures (7-day)', fontsize=12, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# 2. Power source distribution
power_counts = facilities_wide['power_source'].value_counts()
axes[0, 1].bar(range(len(power_counts)), power_counts.values, color=['green', 'orange', 'blue', 'red'])
axes[0, 1].set_xticks(range(len(power_counts)))
axes[0, 1].set_xticklabels(power_counts.index, rotation=45, ha='right')
axes[0, 1].set_ylabel('Number of Facilities', fontsize=11)
axes[0, 1].set_title('Power Source Distribution (Estimated)', fontsize=12, fontweight='bold')
axes[0, 1].grid(axis='y', alpha=0.3)

# 3. Cloud cover vs temperature
axes[1, 0].scatter(facilities_wide['avg_cloud_cover_7d'], facilities_wide['avg_temp_7d'], 
                   alpha=0.6, s=50, c=facilities_wide['heat_wave_indicator'], 
                   cmap='YlOrRd', edgecolors='black', linewidth=0.5)
axes[1, 0].set_xlabel('Average Cloud Cover (%)', fontsize=11)
axes[1, 0].set_ylabel('Average Temperature (Â°C)', fontsize=11)
axes[1, 0].set_title('Cloud Cover vs Temperature\n(Red = Heat Wave)', fontsize=12, fontweight='bold')
axes[1, 0].grid(alpha=0.3)

# 4. Hot days distribution
hot_days = facilities_wide['temp_above_35_days'].value_counts().sort_index()
axes[1, 1].bar(hot_days.index, hot_days.values, color='coral', edgecolor='black')
axes[1, 1].set_xlabel('Days with Temp > 35Â°C', fontsize=11)
axes[1, 1].set_ylabel('Number of Facilities', fontsize=11)
axes[1, 1].set_title('Hot Days Distribution (7-day forecast)', fontsize=12, fontweight='bold')
axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../outputs/figures/data_collection_summary.png', dpi=300, bbox_inches='tight')
plt.show()

print("âœ“ Visualization saved to outputs/figures/data_collection_summary.png")

## âœ… Data Collection Complete!

### What We Have:
- âœ… 100 Kenya health facilities with GPS coordinates
- âœ… 7-day weather forecasts for each facility
- âœ… Daily temperature, cloud cover, humidity data
- âœ… Power source estimates
- âœ… Temporal features (month, season)
- âœ… Derived features (heat wave, hot days, etc.)

### Datasets Saved:
1. `data/processed/facilities_with_daily_weather.csv` - Wide format for modeling
2. `data/processed/weather_forecasts_long.csv` - Long format for visualization
3. `data/processed/kenya_facilities.csv` - Facility list

### Next Steps:
1. **EDA (Notebook 02)**: Explore weather patterns, facility distribution
2. **Feature Engineering (Notebook 02)**: Create 7 target variables (failure_day1 - failure_day7)
3. **Model Training (Notebook 03)**: Train multi-output classifier
4. **Demo (Notebook 04)**: Interactive heatmap + predictions

**You're now ready for Week 2: EDA & Feature Engineering! ðŸš€**