## Notebook Overview ‚Äî Satellite Water Detection Model Pipeline

This notebook walks through the complete end-to-end pipeline for training and evaluating machine-learning models for satellite-based water detection.  
It covers everything from loading preprocessed pixel-level data to generating final prediction masks and preparing a handover package.

### üîç What this notebook does:

1. **Load Sample Dataset**  
   Retrieve the preprocessed feature (`X`) and label (`y`) arrays used for training and testing.

2. **Create Train/Validation/Test Splits**  
   Split the dataset into 70% training, 15% validation, and 15% testing to ensure fair and consistent evaluation.

3. **Train Baseline Models (XGBoost & LightGBM)**  
   Train two strong tree-based models, tune key hyperparameters, and evaluate their performance on the validation set.

4. **Select Best Model & Save Artifacts**  
   Store the trained models and validation metrics for reproducibility and future inference.

5. **Generate Test Predictions & Reshape into 64√ó64 Masks**  
   Convert pixel-level predictions back into image tiles for visualization and comparison with ground truth.

6. **Prepare a Clean Handover Package**  
   Organize key outputs (model, metrics, masks, and split details) into a structured folder for easy sharing.

7. **Upload Results to Hugging Face**  
   The final handover folder is manually uploaded to a Hugging Face dataset repository for accessibility.

---

Overall, this notebook provides a clear, reproducible workflow for building and evaluating satellite image classification models, making it easy for teammates and downstream systems to use the outputs.  


### Step 0 ‚Äî Install and Login to HuggingFace

In [None]:
!pip install huggingface_hub --quiet
from huggingface_hub import hf_hub_download
from huggingface_hub import login
login()

List All Files in the HuggingFace Dataset Repository

In [None]:
from huggingface_hub import list_repo_files

files = list_repo_files(
    repo_id="mishhkaa/satellite-water-detection",
    repo_type="dataset"
)

for f in files:
    print(f)


### Step 1 ‚Äî Load X_sample and y_sample  
This cell downloads the preprocessed feature (`X`) and label (`y`) arrays from the Hugging Face dataset repository.  
It then loads them into NumPy arrays and prints their shapes and label distribution to verify successful loading.


In [None]:
# =========================================================
# STEP 1 ‚Äî Load X_sample and y_sample
# =========================================================

from huggingface_hub import hf_hub_download
import numpy as np

repo = "mishhkaa/satellite-water-detection"

# Download and load arrays
X_path = hf_hub_download(repo_id=repo, filename="X_sample.npy", repo_type="dataset")
y_path = hf_hub_download(repo_id=repo, filename="y_sample.npy", repo_type="dataset")

X = np.load(X_path)
y = np.load(y_path)

print("‚úÖ Dataset Loaded Successfully!")
print("X shape:", X.shape)
print("y shape:", y.shape)
print("Unique labels in y:", np.unique(y))


### Step 2 ‚Äî Train/Val/Test Split  
This cell splits the dataset into training, validation, and test sets using `train_test_split`.  
70% of the data is used for training, while the remaining 30% is evenly divided into validation and test sets.

In [None]:
# =========================================================
# STEP 2 ‚Äî Train/Val/Test Split
# =========================================================

from sklearn.model_selection import train_test_split

# First: split train vs temp (validation + test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, random_state=42, shuffle=True
)

# Next: split validation vs test equally (15% each)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=42, shuffle=True
)

print("Split Completed!")
print("Train:", X_train.shape, y_train.shape)
print("Val:", X_val.shape, y_val.shape)
print("Test:", X_test.shape, y_test.shape)


### Step 3 ‚Äî Train Baseline XGBoost Model  
This cell installs XGBoost, defines a baseline `XGBClassifier` with tuned hyperparameters, and trains it on the training split.  
After training, it evaluates the model on the validation set using Accuracy and F1 Score to measure baseline performance.


In [None]:
# =========================================================
# STEP 3 ‚Äî Train Baseline XGBoost Model
# =========================================================

!pip install xgboost --quiet

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score

# Define model
xgb_model = XGBClassifier(
    n_estimators=200,
    max_depth=8,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    tree_method="hist",
    random_state=42,
    n_jobs=-1
)

print("Training XGBoost model...")
xgb_model.fit(X_train, y_train)

# Predict on validation set
val_preds = xgb_model.predict(X_val)

# Compute accuracy & F1
val_acc = accuracy_score(y_val, val_preds)
val_f1 = f1_score(y_val, val_preds)

print("üèÅ Training Complete!")
print("Validation Accuracy:", val_acc)
print("Validation F1 Score:", val_f1)


### Step ‚Äî Train LightGBM Model  
This cell sets up LightGBM datasets, trains a Gradient Boosting model with early stopping, and logs progress every 10 rounds.  
After training, it predicts on the validation set and computes Accuracy and F1 Score to evaluate performance.


In [None]:
# =========================================================
# STEP 4 ‚Äî LightGBM Training 
# =========================================================

!pip install lightgbm --quiet

import lightgbm as lgb
from sklearn.metrics import accuracy_score, f1_score

print("üèÅ Preparing LightGBM datasets...")

lgb_train = lgb.Dataset(X_train, label=y_train)
lgb_val = lgb.Dataset(X_val, label=y_val)

print("üöÄ Starting LightGBM Training...")

lgb_model = lgb.train(
    params={
        "objective": "binary",
        "metric": "binary_logloss",
        "boosting_type": "gbdt",
        "num_leaves": 64,
        "learning_rate": 0.05,
        "feature_fraction": 0.8,
        "bagging_fraction": 0.8,
        "bagging_freq": 5,
        "max_depth": -1,
        "verbose": -1
    },
    train_set=lgb_train,
    valid_sets=[lgb_train, lgb_val],
    num_boost_round=200,
    callbacks=[
        lgb.early_stopping(stopping_rounds=30),
        lgb.log_evaluation(period=10)   # Print every 10 rounds
    ]
)

print("üéâ Training finished!")

# Make predictions
val_preds_lgb = (lgb_model.predict(X_val) > 0.5).astype(int)

# Metrics
val_acc_lgb = accuracy_score(y_val, val_preds_lgb)
val_f1_lgb = f1_score(y_val, val_preds_lgb)

print("\nüìà LightGBM Validation Results")
print("Validation Accuracy:", val_acc_lgb)
print("Validation F1 Score:", val_f1_lgb)


### Step 5 ‚Äî Save Models and Metrics  
This cell saves the trained XGBoost and LightGBM models using `joblib`, and writes all validation metrics into a JSON file.  
It also creates `models/` and `results/` directories to keep saved artifacts organized.


In [None]:
# =========================================================
# STEP 5 ‚Äî Save Models & Metrics
# =========================================================

import joblib
import json
import os

os.makedirs("models", exist_ok=True)
os.makedirs("results", exist_ok=True)

# Save models
joblib.dump(xgb_model, "models/xgb_model.pkl")
joblib.dump(lgb_model, "models/lgbm_model.pkl")
joblib.dump(lgb_model, "models/best_model.pkl")

# Save validation results
metrics = {
    "xgboost_accuracy": float(val_acc),
    "xgboost_f1": float(val_f1),
    "lightgbm_accuracy": float(val_acc_lgb),
    "lightgbm_f1": float(val_f1_lgb),
    "best_model": "lightgbm"
}

with open("results/metrics.json", "w") as f:
    json.dump(metrics, f, indent=4)

print("‚úÖ Models and metrics saved successfully!")


### Step 6 ‚Äî Generate Test Predictions & Reshape into Masks  
This cell uses the trained LightGBM model to predict labels for the test set, trims the predictions to a multiple of 4096 pixels, and reshapes them into 64√ó64 mask tiles.  
Both predicted and true masks are then saved for later visualization or evaluation.

In [None]:
# =========================================================
# STEP 6 ‚Äî Generate Predicted Masks for Test Set
# =========================================================

import numpy as np

print("üìå Predicting on test set using best LightGBM model...")

test_preds = (lgb_model.predict(X_test) > 0.5).astype(int)

print("Raw prediction shape:", test_preds.shape)

# Ensure divisible by 4096 pixels
usable_pixels = (test_preds.shape[0] // 4096) * 4096

test_preds = test_preds[:usable_pixels]
y_test_trimmed = y_test[:usable_pixels]

num_tiles = usable_pixels // 4096

# Reshape predictions & ground truth back to tile format
predicted_masks = test_preds.reshape(num_tiles, 64, 64)
true_masks = y_test_trimmed.reshape(num_tiles, 64, 64)

print("Predicted masks shape:", predicted_masks.shape)
print("True masks shape:", true_masks.shape)

# Save arrays
os.makedirs("predictions", exist_ok=True)
np.save("predictions/predicted_masks.npy", predicted_masks)
np.save("predictions/true_masks.npy", true_masks)

print("üéâ Predicted masks saved successfully!")


### Step 7 ‚Äî Prepare Handover Package  
This cell records dataset split details into a JSON file and copies key artifacts‚Äîincluding metrics, the best model, and prediction masks‚Äîinto a dedicated `handover` folder.  
It ensures all essential files are neatly packaged for delivery or further processing.


In [None]:
# =========================================================
# STEP 7 ‚Äî Package deliverables
# =========================================================

import json
import os

os.makedirs("handover", exist_ok=True)

# Save split information
split_info = {
    "train_size": len(X_train),
    "val_size": len(X_val),
    "test_size": len(X_test),
    "tile_size": [64, 64],
    "num_test_tiles": predicted_masks.shape[0]
}

with open("handover/train_test_split_info.json", "w") as f:
    json.dump(split_info, f, indent=4)

# Move files into handover folder
!cp results/metrics.json handover/
!cp models/best_model.pkl handover/
!cp predictions/predicted_masks.npy handover/
!cp predictions/true_masks.npy handover/

print("üéÅ Handover package prepared successfully!")
print("Files in handover folder:")
!ls handover


In [None]:
# =========================================================
# MOVE ALL MODEL FILES INTO HANDOVER
# =========================================================

import os

# Ensure folder exists
os.makedirs("handover", exist_ok=True)

# Copy both models using shell command
!cp models/xgb_model.pkl handover/
!cp models/lgbm_model.pkl handover/

# Show folder content
print("üì¶ Updated handover folder now contains:")
!ls -lh handover


### Final Step ‚Äî Upload Handover Package to Hugging Face  
The generated `handover` folder (containing the best model, metrics, and prediction masks) was manually uploaded to a Hugging Face dataset repository.  
This allows easy sharing, versioning, and access for downstream tasks or team members.