# ðŸ“Š Training Data Generation from Satellite Data

**Objective:** Generate complete training dataset with engineered features.

**Process:**
1. Load raw satellite data (182 locations Ã— 15 crops = 2,730 samples)
2. Generate features for multiple months per location (simulate seasonality)
3. Calculate risk scores based on crop-location-season matching
4. Save enriched training data with 28+ features

**Output:** `data/processed/training_data_satellite_enriched.csv`

In [7]:
# Import libraries
import pandas as pd
import numpy as np
from pathlib import Path
import sys
from tqdm import tqdm

# Add project paths
sys.path.append(str(Path.cwd().parent / "src"))
sys.path.append(str(Path.cwd().parent / "data" / "raw"))

# Import feature engineering functions from notebook 02
%run 02_feature_engineering.ipynb

from crops_database import CROPS, get_crop_info

print("âœ“ Libraries and functions imported")

âœ“ Libraries imported successfully
âœ“ Feature engineering functions defined
âœ“ Weather API function defined
ðŸ“¡ Using Open-Meteo API (free, no key required)
ðŸ“Š Soil moisture estimated from precipitation and temperature
Testing weather API for Tashkent (41.3167, 69.2167)...

âœ… Weather API working!
Current Temperature: 6.6Â°C
Current Precipitation: 0.0mm
Current Soil Moisture: 0.335
14-day Forecast Temp: 6.7Â°C
14-day Forecast Precip: 29.0mm
âœ“ Complete feature generation pipeline defined
Testing feature generation for: Tashkent City / Almazar
Testing with crop: cotton
Using real-time weather: Yes

GENERATED FEATURES:
  region                   : Tashkent City
  district                 : Almazar
  latitude                 : 41.3167
  longitude                : 69.2167
  climate_zone             : tashkent
  month                    : 6
  hist_temp_mean           : 4.871428571428572
  hist_precip_annual       : 253.6
  hist_soil_moisture       : 0.51618909636
  current_temp_mean

## 1. Risk Score Calculation Function

In [8]:
def calculate_risk_score(features):
    """
    Calculate risk score (0-100) based on all features.
    Higher score = better conditions, lower risk.
    
    Args:
        features: Dictionary of features
    
    Returns:
        float: Risk score (0-100)
    """
    score = 50  # Base score
    
    # Temperature matching (+/- 20 points)
    temp = features["current_temp_mean"]
    temp_min = features["crop_temp_min"]
    temp_max = features["crop_temp_max"]
    
    if temp_min <= temp <= temp_max:
        score += 20  # Optimal temperature
    elif temp < temp_min:
        penalty = min(20, abs(temp - temp_min) * 2)
        score -= penalty
    else:  # temp > temp_max
        penalty = min(20, abs(temp - temp_max) * 2)
        score -= penalty
    
    # Water availability (+/- 15 points)
    precip = features["hist_precip_annual"]
    water_need = features["crop_water_need"]
    water_ratio = precip / water_need
    
    if water_ratio >= 0.8:
        score += 15
    elif water_ratio >= 0.5:
        score += 10
    else:
        score -= 15
    
    # Soil moisture (+/- 10 points)
    soil_moisture = features["current_soil_moisture"]
    min_moisture = features["crop_moisture_min"]
    
    if soil_moisture >= min_moisture:
        score += 10
    else:
        score -= 10
    
    # NDVI health indicator (+/- 10 points)
    ndvi = features["ndvi_current"]
    if ndvi >= 0.4:
        score += 10
    elif ndvi >= 0.3:
        score += 5
    else:
        score -= 10
    
    # Regional suitability (+/- 15 points)
    if features["region_suitable"] == 1:
        score += 15
    else:
        score -= 10
    
    # Seasonal suitability (+/- 15 points)
    if features["season_suitable"] == 1:
        score += 15
    else:
        score -= 15
    
    # Risk penalties
    if features["frost_risk"] == 1:
        score -= 20
    
    if features["drought_risk"] == 1:
        score -= 15
    
    # Clip to 0-100 range
    score = np.clip(score, 0, 100)
    
    return score


def get_risk_category(score):
    """Convert risk score to category."""
    if score >= 70:
        return "green"
    elif score >= 40:
        return "yellow"
    else:
        return "red"


print("âœ“ Risk scoring functions defined")

âœ“ Risk scoring functions defined


## 2. Generate Training Data

In [9]:
# Load satellite data
satellite_df = pd.read_csv("../data/raw/satellite_data.csv")
print(f"Loaded satellite data: {satellite_df.shape}")
print(f"Locations: {len(satellite_df)}")
print(f"Crops: {len(CROPS)}")

# Generate training samples
# For each location Ã— crop Ã— selected months
training_samples = []

# Select key months to represent different seasons
months_to_generate = [3, 4, 5, 6, 7, 8, 9, 10]  # Spring, summer, autumn

print(f"\nGenerating training data...")
print(f"Locations: {len(satellite_df)}")
print(f"Crops: {len(CROPS)}")
print(f"Months: {len(months_to_generate)}")
print(f"Expected samples: {len(satellite_df) * len(CROPS) * len(months_to_generate):,}")

# Note: We'll use historical data (no real-time API calls during training data generation)
# This makes training data generation reproducible and fast

for idx, sat_row in tqdm(satellite_df.iterrows(), total=len(satellite_df), desc="Processing locations"):
    for crop_name in CROPS.keys():
        for month in months_to_generate:
            try:
                # Generate features (without real-time weather to keep it fast)
                features = generate_complete_features(
                    sat_row, 
                    crop_name, 
                    month=month, 
                    use_real_weather=False  # Use historical estimates for training data
                )
                
                # Calculate risk score
                risk_score = calculate_risk_score(features)
                features["risk_score"] = risk_score
                features["risk_category"] = get_risk_category(risk_score)
                
                training_samples.append(features)
                
            except Exception as e:
                print(f"Error processing {sat_row['region']}/{sat_row['district']}/{crop_name}: {e}")
                continue

print(f"\nâœ“ Generated {len(training_samples):,} training samples")

Loaded satellite data: (180, 14)
Locations: 180
Crops: 15

Generating training data...
Locations: 180
Crops: 15
Months: 8
Expected samples: 21,600


Processing locations: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 180/180 [00:00<00:00, 639.13it/s]


âœ“ Generated 21,600 training samples





In [10]:
# Convert to DataFrame
training_df = pd.DataFrame(training_samples)

print("="*80)
print("TRAINING DATA SUMMARY:")
print("="*80)
print(f"Total samples: {len(training_df):,}")
print(f"Features: {len(training_df.columns)}")
print(f"\nFeature columns:")
print(list(training_df.columns))

print(f"\n\nRisk score distribution:")
print(training_df["risk_score"].describe())

print(f"\n\nRisk category distribution:")
print(training_df["risk_category"].value_counts())

training_df.head()

TRAINING DATA SUMMARY:
Total samples: 21,600
Features: 30

Feature columns:
['region', 'district', 'latitude', 'longitude', 'climate_zone', 'month', 'hist_temp_mean', 'hist_precip_annual', 'hist_soil_moisture', 'current_temp_mean', 'current_precip', 'current_soil_moisture', 'forecast_temp_14d', 'forecast_precip_14d', 'frost_risk', 'drought_risk', 'ndvi_current', 'ndvi_forecast', 'crop', 'crop_category', 'crop_temp_min', 'crop_temp_max', 'crop_water_need', 'crop_moisture_min', 'crop_drought_sens', 'crop_frost_sens', 'region_suitable', 'season_suitable', 'risk_score', 'risk_category']


Risk score distribution:
count    21600.000000
mean        40.168111
std         24.520996
min          0.000000
25%         20.000000
50%         40.000000
75%         55.000000
max        100.000000
Name: risk_score, dtype: float64


Risk category distribution:
risk_category
yellow    10750
red        8411
green      2439
Name: count, dtype: int64


Unnamed: 0,region,district,latitude,longitude,climate_zone,month,hist_temp_mean,hist_precip_annual,hist_soil_moisture,current_temp_mean,...,crop_temp_min,crop_temp_max,crop_water_need,crop_moisture_min,crop_drought_sens,crop_frost_sens,region_suitable,season_suitable,risk_score,risk_category
0,Tashkent City,Almazar,41.3167,69.2167,tashkent,3,24.7,253.6,0.516189,24.7,...,20,35,700,0.3,0.5,0.8,0,0,10.0,red
1,Tashkent City,Almazar,41.3167,69.2167,tashkent,4,24.7,253.6,0.516189,24.7,...,20,35,700,0.3,0.5,0.8,0,1,40.0,yellow
2,Tashkent City,Almazar,41.3167,69.2167,tashkent,5,24.7,253.6,0.516189,24.7,...,20,35,700,0.3,0.5,0.8,0,1,40.0,yellow
3,Tashkent City,Almazar,41.3167,69.2167,tashkent,6,24.7,253.6,0.516189,24.7,...,20,35,700,0.3,0.5,0.8,0,1,40.0,yellow
4,Tashkent City,Almazar,41.3167,69.2167,tashkent,7,24.7,253.6,0.516189,24.7,...,20,35,700,0.3,0.5,0.8,0,1,40.0,yellow


## 3. Save Training Data

In [11]:
# Save to CSV
output_path = Path("../data/processed/training_data_satellite_enriched.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)

training_df.to_csv(output_path, index=False)

print(f"âœ… Training data saved to: {output_path}")
print(f"   Shape: {training_df.shape}")
print(f"   Size: {output_path.stat().st_size / 1024 / 1024:.2f} MB")

âœ… Training data saved to: ../data/processed/training_data_satellite_enriched.csv
   Shape: (21600, 30)
   Size: 4.37 MB
