# FloodSense: Machine Learning-Based Flood Prediction System
## Comprehensive ML Pipeline for SEN12-FLOOD Dataset Analysis

**Author:** FloodSense Research Team  
**Date:** December 2024  
**Dataset:** SEN12-FLOOD Satellite Imagery Metadata  
**Objective:** Develop and evaluate ML models for flood prediction in South Sudan

---

## Notebook Overview

This notebook implements a systematic machine learning pipeline with four progressive experiments:

1. **Experiment 1:** Baseline Model Establishment
2. **Experiment 2:** Hyperparameter Optimization & Class Imbalance Handling
3. **Experiment 3:** Advanced Ensemble Methods (XGBoost)
4. **Experiment 4:** Feature Engineering Enhancement

### Key Research Questions
- Can satellite metadata effectively predict flood events?
- Which ML algorithms perform best for this task?
- How does class imbalance handling impact performance?
- What feature engineering strategies are most effective?

### Expected Outcomes
- Achieve >90% accuracy across all performance metrics
- Identify optimal algorithm and hyperparameters
- Create reproducible ML pipeline for flood prediction

---

## 1. Environment Setup and Dependencies

### 1.1 Install Required Packages

We use specific package versions to ensure reproducibility:

In [None]:
# Clear pip cache and install dependencies with fixed versions
!pip cache purge
!pip install pandas==2.2.2 numpy==1.26.4 scikit-learn==1.4.2 matplotlib==3.7.1 seaborn==0.13.0 imbalanced-learn==0.10.1 xgboost==1.7.6

### 1.2 Import Libraries and Set Configuration

Import all necessary libraries and configure the environment for reproducible results:

In [None]:
# Import core libraries for data processing and machine learning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import json
import warnings

# Import scikit-learn components
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, 
    classification_report, confusion_matrix, roc_auc_score, 
    precision_recall_curve, auc, roc_curve
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Import XGBoost for advanced ensemble methods
from xgboost import XGBClassifier

# Import joblib for model persistence
import joblib

# Configure environment for reproducible results
warnings.filterwarnings('ignore')
np.random.seed(42)  # CRITICAL: Fixed seed for reproducibility
plt.style.use('default')
sns.set_palette("husl")

print('✅ Dependencies imported successfully')
print('✅ Random seed set to 42 for reproducibility')
print('✅ Environment configured for systematic experimentation')

## 2. Data Loading and Exploration

### 2.1 Dataset Overview

**Dataset:** SEN12-FLOOD Satellite Imagery Metadata  
**Geographic Focus:** South Sudan regions (Jonglei, Unity, Upper Nile)  
**Purpose:** Extract meaningful features from satellite metadata for flood prediction

### 2.2 Data Loading Strategy

We load the SEN12-FLOOD dataset and perform initial exploration to understand:
- Data structure and format
- Available features and metadata
- Data quality and completeness
- Class distribution (flood vs non-flood events)

In [None]:
print('📊 Loading SEN12-FLOOD Dataset')
print('Dataset: Sentinel-2 satellite imagery metadata for flood detection')
print('Geographic Focus: South Sudan regions (Jonglei, Unity, Upper Nile)')
print('=' * 80)

data_dir = r'D:\FloodSenseSystem\SEN12FLOODDATA'
s2_json_path = os.path.join(data_dir, 'S2list.json')

# Load dataset
try:
    with open(s2_json_path, 'r') as f:
        s2_data = json.load(f)
    df = pd.DataFrame(s2_data)
    print(f"Dataset loaded successfully: {df.shape}")
except FileNotFoundError:
    print(f"Error: {s2_json_path} not found")
    raise

## 3. Data Preprocessing & Feature Engineering

### 3.1 Feature Engineering Strategy

**Objective:** Extract meaningful features from satellite metadata that can predict flood events

**Feature Categories:**
- **Temporal Features:** Date-based patterns (month, day, seasonality)
- **Spatial Context:** Geographic and scene-based information
- **Data Quality:** Coverage, completeness, and reliability indicators
- **Metadata Features:** File properties and observation characteristics

### 3.2 Missing Value Handling

We implement robust strategies for handling missing or invalid data:
- Skip incomplete observations
- Use latest available labels for classification
- Extract features only from valid scenes

In [None]:
# Initialize feature extraction variables
print('🔧 ENHANCED FEATURE EXTRACTION')
print('Strategy: Extract meaningful features from satellite metadata')
print('Focus: Temporal patterns, spatial context, data quality indicators')
print('=' * 80)

target_column = 'flood'
X = []
y = []
valid_indices = []
scene_ids = [col for col in df.columns if col not in ['count', 'folder', 'geo']]

print(f'Total scenes available: {len(scene_ids)}')
print(f'Dataset shape: {df.shape}')

### 3.3 Data Structure Analysis

Before feature extraction, we need to understand the data structure and identify available metadata fields:

In [None]:
print("Analyzing data structure and extracting meaningful features...")

# First, let's understand what data we actually have
print("Sample data exploration:")
sample_count = 0
for idx in df.index:
    if idx in ['count', 'folder', 'geo']:
        continue
    for scene_id in scene_ids[:3]:  # Check first 3 scenes
        cell_value = df.loc[idx, scene_id]
        if isinstance(cell_value, dict):
            print(f"Observation {idx}, Scene {scene_id}:")
            for key, value in cell_value.items():
                print(f"  {key}: {value} (type: {type(value).__name__})")
            sample_count += 1
            if sample_count >= 3:
                break
    if sample_count >= 3:
        break

### 3.4 Feature Extraction Implementation

**Feature Engineering Strategy:**

1. **Temporal Features:** Extract date-based patterns (month, day, seasonality)
2. **Spatial Context:** Scene-based geographic information
3. **Data Quality:** Coverage and completeness indicators
4. **Metadata Features:** File properties and observation characteristics

**Missing Value Strategy:** Use latest available labels and skip incomplete observations

In [None]:
# Extract meaningful features
print("\nExtracting features from all scenes...")

for idx in df.index:
    if idx in ['count', 'folder', 'geo']:
        continue
    
    # Find the latest scene for the label
    latest_date = None
    latest_label = None
    for scene_id in scene_ids:
        cell_value = df.loc[idx, scene_id]
        if not isinstance(cell_value, dict) or 'FLOODING' not in cell_value or 'date' not in cell_value:
            continue
        date = pd.to_datetime(cell_value.get('date', '1900-01-01'))
        if latest_date is None or date > latest_date:
            latest_date = date
            latest_label = 1 if cell_value['FLOODING'] else 0
    
    if latest_label is None:
        continue
    
    # Extract features for all valid scenes
    for scene_id in scene_ids:
        cell_value = df.loc[idx, scene_id]
        if not isinstance(cell_value, dict) or 'FLOODING' not in cell_value:
            continue
        
        features = []
        
        # 1. Date-based features (temporal patterns)
        if 'date' in cell_value:
            try:
                date = pd.to_datetime(cell_value['date'])
                features.extend([
                    date.month,
                    date.day,
                    date.dayofweek,
                    date.dayofyear,
                    date.quarter,
                    (date - pd.Timestamp('2000-01-01')).days
                ])
            except:
                features.extend([0, 0, 0, 0, 0, 0])
        else:
            features.extend([0, 0, 0, 0, 0, 0])
        
        # 2. Scene-based features
        scene_numeric = hash(scene_id) % 10000
        features.append(scene_numeric)
        
        # 3. Data quality features
        data_coverage = len([k for k, v in cell_value.items() if v is not None])
        features.append(data_coverage)
        
        # 4. Metadata features
        filename = cell_value.get('filename', '')
        features.extend([
            len(filename),
            hash(filename) % 1000,
            len(X)  # observation index
        ])
        
        X.append(features)
        y.append(latest_label)
        valid_indices.append(f"{idx}_{scene_id}")

print(f'Feature extraction completed: {len(X)} samples with {len(X[0]) if X else 0} features')

### 3.5 Dataset Summary and Class Distribution

After feature extraction, we analyze the final dataset characteristics:

In [None]:
# Convert to numpy arrays for ML processing
X = np.array(X)
y = np.array(y)

print(f'Dataset created successfully: {X.shape}')
print(f'Features: {X.shape[1]}')
print(f'Samples: {X.shape[0]}')

# Class distribution analysis
flood_count = np.sum(y == 1)
no_flood_count = np.sum(y == 0)
flood_ratio = flood_count / len(y)

print(f'Class Distribution:')
print(f'Flood: {flood_count}, No Flood: {no_flood_count}')
print(f'Flood ratio: {flood_ratio:.3f}')

# Feature names for reference
feature_names = [
    'month', 'day', 'day_of_week', 'day_of_year', 'quarter',
    'days_since_reference', 'scene_id_numeric', 'data_coverage',
    'filename_length', 'filename_hash', 'observation_index'
]

print(f'Feature names: {feature_names}')

## 4. Data Preprocessing and Normalization

### 4.1 Train-Test Split Strategy

We use stratified sampling to maintain class distribution across train/test sets:
- **Training Set:** 80% of data for model training
- **Test Set:** 20% of data for final evaluation
- **Stratification:** Ensures balanced flood/non-flood distribution

### 4.2 Feature Scaling

StandardScaler normalization ensures all features contribute equally to model training:

In [None]:
# Train-test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f'Training set: {X_train.shape[0]} samples')
print(f'Test set: {X_test.shape[0]} samples')
print(f'Training flood ratio: {np.mean(y_train):.3f}')
print(f'Test flood ratio: {np.mean(y_test):.3f}')

In [None]:
# Feature scaling for consistent model performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print('✅ Feature scaling completed')
print(f'Scaled training features shape: {X_train_scaled.shape}')
print(f'Feature means: {np.mean(X_train_scaled, axis=0)[:5]}')  # Show first 5
print(f'Feature stds: {np.std(X_train_scaled, axis=0)[:5]}')   # Show first 5

## 5. Experiment 1: Baseline Model Establishment

### 5.1 Experiment Objective

**Goal:** Establish baseline performance with standard ML algorithms
**Models:** Logistic Regression, Random Forest, XGBoost (default parameters)
**Evaluation:** Accuracy, Precision, Recall, F1-Score

### 5.2 Model Training and Evaluation

We train multiple models to identify the best baseline approach:

In [None]:
# Initialize baseline models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42, eval_metric='logloss')
}

# Train and evaluate each model
results = {}

for name, model in models.items():
    # Train model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    results[name] = {'accuracy': accuracy, 'f1': f1}
    
    print(f'{name}: Accuracy={accuracy:.4f}, F1={f1:.4f}')

# Find best model
best_model_name = max(results.keys(), key=lambda k: results[k]['f1'])
best_f1 = results[best_model_name]['f1']

print(f'\nBest Model: {best_model_name}')
print(f'Best F1 Score: {best_f1:.4f}')

## 6. Model Deployment and Persistence

### 6.1 Save Best Model

We save the best performing model and preprocessing components for production deployment:

In [None]:
# Create models directory
import os
os.makedirs('models', exist_ok=True)

# Get the best model
best_model = models[best_model_name]

# Save model artifacts
joblib.dump(best_model, 'models/flood_prediction_model.pkl')
joblib.dump(scaler, 'models/feature_scaler.pkl')

# Save model metadata
model_info = {
    'model_type': best_model_name,
    'accuracy': results[best_model_name]['accuracy'],
    'f1_score': results[best_model_name]['f1'],
    'features': len(feature_names),
    'training_samples': len(X_train),
    'feature_names': feature_names
}

joblib.dump(model_info, 'models/model_info.pkl')
joblib.dump(feature_names, 'models/feature_names.pkl')

print('✅ Model artifacts saved successfully')
print(f'Final F1 Score: {best_f1:.4f}')
print(f'Model files saved to: models/ directory')