# AgroGraphNet: Data Preprocessing

This notebook handles data preprocessing including satellite imagery processing, environmental data cleaning, and feature preparation.

## Objectives:
1. Process satellite imagery and calculate vegetation indices
2. Clean and normalize environmental data
3. Handle missing values and outliers
4. Create temporal features
5. Prepare data for graph construction

In [None]:
# Import required libraries
import sys
import os
sys.path.append('../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')

# Import custom modules
from config import *
from data_utils import *
from visualization import *

# Set random seed for reproducibility
np.random.seed(RANDOM_SEED)

print("Libraries imported successfully!")
print(f"Processing data from: {RAW_DATA_DIR}")
print(f"Output will be saved to: {PROCESSED_DATA_DIR}")

## 1. Load Raw Data

First, let's load the datasets created in the previous notebook.

In [None]:
# Load datasets
print("Loading raw datasets...")

# Farm locations
farm_files = list(FARM_LOCATIONS_DIR.glob('*.csv'))
if farm_files:
    farms_df = pd.read_csv(farm_files[0])
    print(f"✅ Loaded farm locations: {len(farms_df)} farms")
else:
    raise FileNotFoundError("No farm location files found. Please run notebook 01 first.")

# Weather data
weather_files = list(WEATHER_DIR.glob('*.csv'))
if weather_files:
    weather_df = load_weather_data(str(weather_files[0]))
    print(f"✅ Loaded weather data: {len(weather_df)} records")
else:
    raise FileNotFoundError("No weather files found. Please run notebook 01 first.")

# Disease data
disease_files = list(DISEASE_LABELS_DIR.glob('*.csv'))
if disease_files:
    disease_df = load_disease_labels(str(disease_files[0]))
    print(f"✅ Loaded disease data: {len(disease_df)} records")
else:
    raise FileNotFoundError("No disease files found. Please run notebook 01 first.")

print("\nData loaded successfully!")

## 2. Satellite Imagery Processing

Since we're working with sample data, we'll simulate satellite imagery processing. In a real scenario, you would load actual GeoTIFF files here.

In [None]:
# Check for satellite imagery files
satellite_files = list(SATELLITE_DIR.glob('*.tif')) + list(SATELLITE_DIR.glob('*.tiff'))

if satellite_files:
    print(f"Found {len(satellite_files)} satellite imagery files")
    
    # Process real satellite imagery
    # This would involve loading GeoTIFF files and extracting pixel values
    # For demonstration, we'll create simulated satellite features
    print("Processing satellite imagery...")
    
    # Simulate satellite band values for each farm
    satellite_features = {}
    
    for band_name in SATELLITE_BANDS.keys():
        # Simulate realistic band values
        if band_name in ['B02', 'B03', 'B04']:  # Visible bands
            values = np.random.uniform(0.05, 0.3, len(farms_df))
        elif band_name == 'B08':  # NIR
            values = np.random.uniform(0.3, 0.8, len(farms_df))
        else:  # SWIR bands
            values = np.random.uniform(0.1, 0.4, len(farms_df))
        
        satellite_features[band_name] = values
    
else:
    print("No satellite imagery files found. Creating simulated satellite features...")
    
    # Create simulated satellite features for each farm
    satellite_features = {}
    
    for band_name in SATELLITE_BANDS.keys():
        # Simulate realistic band values based on crop health
        base_values = np.random.uniform(0.1, 0.5, len(farms_df))
        
        # Add some correlation with disease status if available
        if len(disease_df) > 0:
            # Get latest disease status for each farm
            latest_disease = disease_df.loc[disease_df.groupby('farm_id')['date'].idxmax()]
            
            for i, farm in farms_df.iterrows():
                farm_disease = latest_disease[latest_disease['farm_id'] == farm['farm_id']]
                if len(farm_disease) > 0 and farm_disease.iloc[0]['disease_type'] != 'Healthy':
                    # Diseased farms have different spectral signatures
                    if band_name == 'B08':  # NIR typically lower in diseased plants
                        base_values[i] *= 0.8
                    elif band_name in ['B04']:  # Red might be higher
                        base_values[i] *= 1.2
        
        satellite_features[band_name] = base_values

# Convert to DataFrame
satellite_df = pd.DataFrame(satellite_features)
satellite_df['farm_id'] = farms_df['farm_id']

print(f"✅ Satellite features created: {satellite_df.shape}")
print(f"Bands: {list(satellite_features.keys())}")

In [None]:
# Calculate vegetation indices
print("Calculating vegetation indices...")

# Create band mapping for vegetation index calculation
band_mapping = {
    'Red': 'B04',
    'NIR': 'B08',
    'Blue': 'B02',
    'SWIR1': 'B11'
}

# Calculate indices for each farm
vegetation_indices = {}

for index_name in VEGETATION_INDICES:
    if index_name == 'NDVI':
        # NDVI = (NIR - Red) / (NIR + Red)
        nir = satellite_df['B08'].values
        red = satellite_df['B04'].values
        vegetation_indices[index_name] = (nir - red) / (nir + red + 1e-8)
    
    elif index_name == 'EVI':
        # EVI = 2.5 * (NIR - Red) / (NIR + 6*Red - 7.5*Blue + 1)
        nir = satellite_df['B08'].values
        red = satellite_df['B04'].values
        blue = satellite_df['B02'].values
        vegetation_indices[index_name] = 2.5 * (nir - red) / (nir + 6*red - 7.5*blue + 1 + 1e-8)
    
    elif index_name == 'SAVI':
        # SAVI = (1 + L) * (NIR - Red) / (NIR + Red + L)
        L = 0.5
        nir = satellite_df['B08'].values
        red = satellite_df['B04'].values
        vegetation_indices[index_name] = (1 + L) * (nir - red) / (nir + red + L + 1e-8)
    
    elif index_name == 'NDWI':
        # NDWI = (NIR - SWIR1) / (NIR + SWIR1)
        nir = satellite_df['B08'].values
        swir1 = satellite_df['B11'].values
        vegetation_indices[index_name] = (nir - swir1) / (nir + swir1 + 1e-8)

# Add vegetation indices to satellite DataFrame
for index_name, values in vegetation_indices.items():
    satellite_df[index_name] = values

print(f"✅ Vegetation indices calculated: {list(vegetation_indices.keys())}")
print(f"Updated satellite features shape: {satellite_df.shape}")

# Display statistics
print("\nVegetation Index Statistics:")
for index_name in VEGETATION_INDICES:
    values = vegetation_indices[index_name]
    print(f"{index_name}: mean={values.mean():.3f}, std={values.std():.3f}, range=[{values.min():.3f}, {values.max():.3f}]")

## 3. Environmental Data Preprocessing

In [None]:
# Clean and preprocess weather data
print("Preprocessing weather data...")

# Check for outliers and anomalies
weather_stats = weather_df.describe()
print("Weather data statistics:")
display(weather_stats)

# Handle outliers using IQR method
def remove_outliers_iqr(df, columns):
    df_clean = df.copy()
    outliers_removed = 0
    
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        outliers = (df[col] < lower_bound) | (df[col] > upper_bound)
        outliers_removed += outliers.sum()
        
        # Replace outliers with median values
        df_clean.loc[outliers, col] = df[col].median()
    
    return df_clean, outliers_removed

# Remove outliers from weather data
weather_columns = ['temperature', 'humidity', 'precipitation', 'wind_speed']
weather_clean, outliers_count = remove_outliers_iqr(weather_df, weather_columns)

print(f"\n✅ Outliers handled: {outliers_count} values replaced with median")

# Handle missing values
missing_before = weather_clean.isnull().sum().sum()
if missing_before > 0:
    print(f"Handling {missing_before} missing values...")
    
    # Use forward fill for temporal data, then backward fill
    weather_clean = weather_clean.sort_values(['lat', 'lon', 'date'])
    weather_clean[weather_columns] = weather_clean.groupby(['lat', 'lon'])[weather_columns].fillna(method='ffill').fillna(method='bfill')
    
    # If still missing, use overall median
    for col in weather_columns:
        weather_clean[col].fillna(weather_clean[col].median(), inplace=True)
    
    missing_after = weather_clean.isnull().sum().sum()
    print(f"✅ Missing values after cleaning: {missing_after}")
else:
    print("✅ No missing values found in weather data")

# Create temporal features
print("\nCreating temporal features...")
weather_clean['month'] = weather_clean['date'].dt.month
weather_clean['day_of_year'] = weather_clean['date'].dt.dayofyear
weather_clean['season'] = weather_clean['month'].map({
    12: 'Winter', 1: 'Winter', 2: 'Winter',
    3: 'Spring', 4: 'Spring', 5: 'Spring',
    6: 'Summer', 7: 'Summer', 8: 'Summer',
    9: 'Fall', 10: 'Fall', 11: 'Fall'
})

print(f"✅ Weather data preprocessed: {weather_clean.shape}")

In [None]:
# Preprocess disease data
print("Preprocessing disease data...")

# Create numerical labels for disease types
disease_mapping = {disease: idx for idx, disease in DISEASE_CLASSES.items()}
disease_clean = disease_df.copy()
disease_clean['disease_label'] = disease_clean['disease_type'].map(disease_mapping)

print(f"Disease mapping: {disease_mapping}")

# Handle missing severity values
missing_severity = disease_clean['severity'].isnull().sum()
if missing_severity > 0:
    print(f"Handling {missing_severity} missing severity values...")
    # Set severity to 0 for healthy crops, use median for others
    disease_clean.loc[disease_clean['disease_type'] == 'Healthy', 'severity'] = 0
    disease_clean['severity'].fillna(disease_clean[disease_clean['disease_type'] != 'Healthy']['severity'].median(), inplace=True)

# Create binary disease indicator
disease_clean['is_diseased'] = (disease_clean['disease_type'] != 'Healthy').astype(int)

# Create temporal features
disease_clean['month'] = disease_clean['date'].dt.month
disease_clean['season'] = disease_clean['month'].map({
    12: 'Winter', 1: 'Winter', 2: 'Winter',
    3: 'Spring', 4: 'Spring', 5: 'Spring',
    6: 'Summer', 7: 'Summer', 8: 'Summer',
    9: 'Fall', 10: 'Fall', 11: 'Fall'
})

print(f"✅ Disease data preprocessed: {disease_clean.shape}")
print(f"Disease distribution: {disease_clean['disease_type'].value_counts().to_dict()}")

## 4. Feature Engineering and Aggregation

In [None]:
# Create aggregated features for each farm
print("Creating aggregated features for each farm...")

# Aggregate weather data by farm and time period
def aggregate_weather_by_farm(weather_df, farms_df, tolerance=0.01):
    """
    Aggregate weather data for each farm location
    """
    farm_weather = []
    
    for _, farm in farms_df.iterrows():
        # Find weather data near this farm
        nearby_weather = weather_df[
            (abs(weather_df['lat'] - farm['lat']) < tolerance) & 
            (abs(weather_df['lon'] - farm['lon']) < tolerance)
        ]
        
        if len(nearby_weather) == 0:
            # If no nearby weather, use closest weather station
            distances = np.sqrt(
                (weather_df['lat'] - farm['lat'])**2 + 
                (weather_df['lon'] - farm['lon'])**2
            )
            closest_idx = distances.idxmin()
            nearby_weather = weather_df[weather_df.index == closest_idx]
        
        # Aggregate by time period
        for date in nearby_weather['date'].unique():
            date_weather = nearby_weather[nearby_weather['date'] == date]
            
            farm_weather.append({
                'farm_id': farm['farm_id'],
                'date': date,
                'temperature': date_weather['temperature'].mean(),
                'humidity': date_weather['humidity'].mean(),
                'precipitation': date_weather['precipitation'].mean(),
                'wind_speed': date_weather['wind_speed'].mean(),
                'wind_direction': date_weather['wind_direction'].mean(),
                'month': date_weather['month'].iloc[0],
                'season': date_weather['season'].iloc[0]
            })
    
    return pd.DataFrame(farm_weather)

# Aggregate weather data
farm_weather_df = aggregate_weather_by_farm(weather_clean, farms_df)
print(f"✅ Farm weather data aggregated: {farm_weather_df.shape}")

# Create temporal weather features (rolling averages, trends)
print("Creating temporal weather features...")

farm_weather_df = farm_weather_df.sort_values(['farm_id', 'date'])

# Calculate rolling averages (30-day window)
weather_features = ['temperature', 'humidity', 'precipitation', 'wind_speed']
for feature in weather_features:
    farm_weather_df[f'{feature}_rolling_mean'] = farm_weather_df.groupby('farm_id')[feature].rolling(window=3, min_periods=1).mean().reset_index(0, drop=True)
    farm_weather_df[f'{feature}_rolling_std'] = farm_weather_df.groupby('farm_id')[feature].rolling(window=3, min_periods=1).std().reset_index(0, drop=True)

# Fill NaN values in rolling std with 0
rolling_std_cols = [col for col in farm_weather_df.columns if 'rolling_std' in col]
farm_weather_df[rolling_std_cols] = farm_weather_df[rolling_std_cols].fillna(0)

print(f"✅ Temporal weather features created: {farm_weather_df.shape}")

In [None]:
# Create comprehensive feature matrix for each farm at each time point
print("Creating comprehensive feature matrix...")

# Get unique time points
time_points = sorted(disease_clean['date'].unique())
print(f"Time points: {len(time_points)}")

# Create feature matrix
feature_data = []

for time_point in time_points:
    for _, farm in farms_df.iterrows():
        farm_id = farm['farm_id']
        
        # Get farm static features
        farm_features = {
            'farm_id': farm_id,
            'date': time_point,
            'lat': farm['lat'],
            'lon': farm['lon'],
            'area_hectares': farm['area_hectares']
        }
        
        # Add crop type (one-hot encoded)
        for crop_type in farms_df['crop_type'].unique():
            farm_features[f'crop_{crop_type}'] = int(farm['crop_type'] == crop_type)
        
        # Add satellite features
        farm_satellite = satellite_df[satellite_df['farm_id'] == farm_id]
        if len(farm_satellite) > 0:
            for band in SATELLITE_BANDS.keys():
                farm_features[f'satellite_{band}'] = farm_satellite[band].iloc[0]
            
            for index in VEGETATION_INDICES:
                farm_features[f'vegetation_{index}'] = farm_satellite[index].iloc[0]
        else:
            # Fill with median values if no satellite data
            for band in SATELLITE_BANDS.keys():
                farm_features[f'satellite_{band}'] = satellite_df[band].median()
            
            for index in VEGETATION_INDICES:
                farm_features[f'vegetation_{index}'] = satellite_df[index].median()
        
        # Add weather features
        farm_weather = farm_weather_df[
            (farm_weather_df['farm_id'] == farm_id) & 
            (farm_weather_df['date'] == time_point)
        ]
        
        if len(farm_weather) > 0:
            weather_cols = ['temperature', 'humidity', 'precipitation', 'wind_speed', 'wind_direction']
            for col in weather_cols:
                farm_features[f'weather_{col}'] = farm_weather[col].iloc[0]
                
                # Add rolling features if available
                if f'{col}_rolling_mean' in farm_weather.columns:
                    farm_features[f'weather_{col}_rolling_mean'] = farm_weather[f'{col}_rolling_mean'].iloc[0]
                    farm_features[f'weather_{col}_rolling_std'] = farm_weather[f'{col}_rolling_std'].iloc[0]
            
            # Add temporal features
            farm_features['month'] = farm_weather['month'].iloc[0]
            farm_features['season'] = farm_weather['season'].iloc[0]
        else:
            # Fill with overall averages if no weather data
            weather_cols = ['temperature', 'humidity', 'precipitation', 'wind_speed', 'wind_direction']
            for col in weather_cols:
                farm_features[f'weather_{col}'] = farm_weather_df[col].mean()
                farm_features[f'weather_{col}_rolling_mean'] = farm_weather_df[f'{col}_rolling_mean'].mean()
                farm_features[f'weather_{col}_rolling_std'] = farm_weather_df[f'{col}_rolling_std'].mean()
            
            farm_features['month'] = pd.to_datetime(time_point).month
            farm_features['season'] = {12: 'Winter', 1: 'Winter', 2: 'Winter', 3: 'Spring', 4: 'Spring', 5: 'Spring', 6: 'Summer', 7: 'Summer', 8: 'Summer', 9: 'Fall', 10: 'Fall', 11: 'Fall'}[pd.to_datetime(time_point).month]
        
        # Add disease labels
        farm_disease = disease_clean[
            (disease_clean['farm_id'] == farm_id) & 
            (disease_clean['date'] == time_point)
        ]
        
        if len(farm_disease) > 0:
            farm_features['disease_type'] = farm_disease['disease_type'].iloc[0]
            farm_features['disease_label'] = farm_disease['disease_label'].iloc[0]
            farm_features['severity'] = farm_disease['severity'].iloc[0]
            farm_features['is_diseased'] = farm_disease['is_diseased'].iloc[0]
        else:
            # Default to healthy if no disease data
            farm_features['disease_type'] = 'Healthy'
            farm_features['disease_label'] = 0
            farm_features['severity'] = 0.0
            farm_features['is_diseased'] = 0
        
        feature_data.append(farm_features)

# Convert to DataFrame
features_df = pd.DataFrame(feature_data)

print(f"✅ Comprehensive feature matrix created: {features_df.shape}")
print(f"Features per sample: {len([col for col in features_df.columns if col not in ['farm_id', 'date', 'disease_type', 'disease_label', 'severity', 'is_diseased']])}")

## 5. Feature Scaling and Normalization

In [None]:
# Identify feature columns for scaling
feature_columns = [col for col in features_df.columns if col not in [
    'farm_id', 'date', 'disease_type', 'disease_label', 'severity', 'is_diseased', 'season'
]]

print(f"Features to scale: {len(feature_columns)}")

# Separate categorical and numerical features
categorical_features = [col for col in feature_columns if col.startswith('crop_') or col == 'month']
numerical_features = [col for col in feature_columns if col not in categorical_features]

print(f"Categorical features: {len(categorical_features)}")
print(f"Numerical features: {len(numerical_features)}")

# Scale numerical features
print("\nScaling numerical features...")
scaler = StandardScaler()
features_scaled = features_df.copy()

# Fit scaler on numerical features
features_scaled[numerical_features] = scaler.fit_transform(features_df[numerical_features])

print(f"✅ Numerical features scaled using StandardScaler")

# Handle categorical features (already one-hot encoded)
print("✅ Categorical features already encoded")

# Check for any remaining NaN values
nan_count = features_scaled[feature_columns].isnull().sum().sum()
if nan_count > 0:
    print(f"⚠️ Warning: {nan_count} NaN values found after scaling")
    # Fill remaining NaN with 0
    features_scaled[feature_columns] = features_scaled[feature_columns].fillna(0)
    print("✅ NaN values filled with 0")
else:
    print("✅ No NaN values found")

print(f"\nFinal feature matrix shape: {features_scaled.shape}")

## 6. Data Quality Visualization

In [None]:
# Visualize feature distributions and correlations
print("Creating data quality visualizations...")

# Plot feature distributions
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

# Select key features to visualize
key_features = [
    'vegetation_NDVI', 'vegetation_EVI', 'weather_temperature',
    'weather_humidity', 'weather_precipitation', 'area_hectares'
]

for i, feature in enumerate(key_features):
    if feature in features_scaled.columns:
        axes[i].hist(features_scaled[feature], bins=30, alpha=0.7, color='skyblue')
        axes[i].set_title(f'{feature} Distribution')
        axes[i].set_xlabel('Scaled Value')
        axes[i].set_ylabel('Frequency')
        axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(RESULTS_DIR / '02_feature_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

# Correlation matrix for key numerical features
print("\nCreating correlation matrix...")
correlation_features = [f for f in key_features if f in features_scaled.columns]
correlation_matrix = features_scaled[correlation_features].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig(RESULTS_DIR / '02_correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Visualize disease distribution over time and space
print("Visualizing disease patterns...")

fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Disease distribution over time
disease_time = features_scaled.groupby(['date', 'disease_type']).size().unstack(fill_value=0)
disease_time.plot(kind='bar', stacked=True, ax=axes[0, 0], color=['green', 'red', 'orange', 'purple', 'darkred'])
axes[0, 0].set_title('Disease Distribution Over Time')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Number of Farms')
axes[0, 0].tick_params(axis='x', rotation=45)
axes[0, 0].legend(bbox_to_anchor=(1.05, 1), loc='upper left')

# Disease severity distribution
diseased_farms = features_scaled[features_scaled['is_diseased'] == 1]
if len(diseased_farms) > 0:
    axes[0, 1].hist(diseased_farms['severity'], bins=20, alpha=0.7, color='red')
    axes[0, 1].set_title('Disease Severity Distribution')
    axes[0, 1].set_xlabel('Severity')
    axes[0, 1].set_ylabel('Frequency')

# NDVI vs Disease Status
healthy_farms = features_scaled[features_scaled['disease_type'] == 'Healthy']
diseased_farms = features_scaled[features_scaled['disease_type'] != 'Healthy']

if 'vegetation_NDVI' in features_scaled.columns:
    axes[1, 0].hist(healthy_farms['vegetation_NDVI'], bins=20, alpha=0.7, 
                   color='green', label='Healthy', density=True)
    axes[1, 0].hist(diseased_farms['vegetation_NDVI'], bins=20, alpha=0.7, 
                   color='red', label='Diseased', density=True)
    axes[1, 0].set_title('NDVI Distribution by Disease Status')
    axes[1, 0].set_xlabel('NDVI (scaled)')
    axes[1, 0].set_ylabel('Density')
    axes[1, 0].legend()

# Geographic distribution of diseases
disease_colors = {'Healthy': 'green', 'Blight': 'red', 'Rust': 'orange', 'Mosaic': 'purple', 'Bacterial': 'darkred'}
for disease_type, color in disease_colors.items():
    disease_subset = features_scaled[features_scaled['disease_type'] == disease_type]
    if len(disease_subset) > 0:
        axes[1, 1].scatter(disease_subset['lon'], disease_subset['lat'], 
                          c=color, label=disease_type, alpha=0.6, s=20)

axes[1, 1].set_title('Geographic Distribution of Diseases')
axes[1, 1].set_xlabel('Longitude')
axes[1, 1].set_ylabel('Latitude')
axes[1, 1].legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.savefig(RESULTS_DIR / '02_disease_patterns.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Save Processed Data

In [None]:
# Save processed datasets
print("Saving processed data...")

# Save main feature matrix
features_scaled.to_csv(PROCESSED_DATA_DIR / 'features_scaled.csv', index=False)
print(f"✅ Scaled features saved: {PROCESSED_DATA_DIR / 'features_scaled.csv'}")

# Save individual processed datasets
farms_df.to_csv(PROCESSED_DATA_DIR / 'farms_processed.csv', index=False)
weather_clean.to_csv(PROCESSED_DATA_DIR / 'weather_processed.csv', index=False)
disease_clean.to_csv(PROCESSED_DATA_DIR / 'disease_processed.csv', index=False)
satellite_df.to_csv(PROCESSED_DATA_DIR / 'satellite_processed.csv', index=False)
farm_weather_df.to_csv(PROCESSED_DATA_DIR / 'farm_weather_aggregated.csv', index=False)

print("✅ Individual processed datasets saved")

# Save feature scaler for future use
import joblib
joblib.dump(scaler, PROCESSED_DATA_DIR / 'feature_scaler.pkl')
print("✅ Feature scaler saved")

# Save feature column names
feature_info = {
    'all_features': feature_columns,
    'numerical_features': numerical_features,
    'categorical_features': categorical_features,
    'target_columns': ['disease_label', 'disease_type', 'severity', 'is_diseased']
}

import json
with open(PROCESSED_DATA_DIR / 'feature_info.json', 'w') as f:
    json.dump(feature_info, f, indent=2)

print("✅ Feature information saved")

# Create summary statistics
summary_stats = {
    'total_samples': len(features_scaled),
    'total_farms': features_scaled['farm_id'].nunique(),
    'time_points': len(features_scaled['date'].unique()),
    'total_features': len(feature_columns),
    'disease_distribution': features_scaled['disease_type'].value_counts().to_dict(),
    'feature_statistics': features_scaled[numerical_features].describe().to_dict()
}

with open(PROCESSED_DATA_DIR / 'data_summary.json', 'w') as f:
    json.dump(summary_stats, f, indent=2, default=str)

print("✅ Data summary saved")

print(f"\n🎉 Data preprocessing completed successfully!")
print(f"Processed data saved to: {PROCESSED_DATA_DIR}")
print(f"\nDataset summary:")
print(f"  - Total samples: {len(features_scaled):,}")
print(f"  - Unique farms: {features_scaled['farm_id'].nunique()}")
print(f"  - Time points: {len(features_scaled['date'].unique())}")
print(f"  - Total features: {len(feature_columns)}")
print(f"  - Disease classes: {len(features_scaled['disease_type'].unique())}")

## Summary

This notebook has completed the following preprocessing tasks:

1. ✅ **Satellite Imagery Processing**
   - Simulated satellite band values for each farm
   - Calculated vegetation indices (NDVI, EVI, SAVI, NDWI)
   - Created spectral features correlated with crop health

2. ✅ **Environmental Data Cleaning**
   - Removed outliers using IQR method
   - Handled missing values with temporal interpolation
   - Created temporal features (month, season)
   - Calculated rolling averages and trends

3. ✅ **Feature Engineering**
   - Aggregated weather data by farm location
   - Created comprehensive feature matrix
   - One-hot encoded categorical variables
   - Combined static and temporal features

4. ✅ **Data Scaling and Normalization**
   - Applied StandardScaler to numerical features
   - Preserved categorical encodings
   - Handled remaining missing values

5. ✅ **Quality Assessment and Visualization**
   - Created feature distribution plots
   - Generated correlation matrices
   - Visualized disease patterns over time and space

6. ✅ **Data Persistence**
   - Saved processed datasets and feature matrices
   - Stored feature scaler for future use
   - Created comprehensive data summary

**Next Steps:**
- Run notebook `03_graph_construction.ipynb` to build farm network graphs
- The processed feature matrix is ready for graph neural network training
- All data quality issues have been addressed and features are properly scaled