# 🌧️ Flood Prediction (NCR, Philippines) — Simple Machine Learning Notebook

<div align="center">

**Topic:** Flood Prediction using Rainfall, Water Level, and Elevation  
**Model:** Logistic Regression (Binary Classification)  
**Platform:** Google Colab  

</div>

---

## 📌 Quick Background (student write-up)
Flooding is a common concern in some areas of NCR, especially during heavy rainfall.  
In this activity, the proponents created a simple machine learning model that predicts if flooding will occur (**FloodOccurrence**) based on:

- **Rainfall_mm** (mm)  
- **WaterLevel_m** (m)  
- **Elevation_m** (m)

**Output**
- `0` = No Flood  
- `1` = Flood  

> Note: The dataset typically has **more “No Flood”** than “Flood”, so the notebook uses a basic imbalance handling technique.

---

## ✅ What this notebook covers (based on the given instructions)
A. Dataset Loading  
B. Preprocessing (missing values, scaling, train/test split)  
C. Logistic Regression Training  
D. Model Evaluation (Accuracy, Confusion Matrix, Precision/Recall/F1, ROC)  
E. Insights + Suggestions  
F. A simple function to predict from new inputs


## 1) ⚙️ Imports
This section imports the libraries used for data handling, training, and evaluation.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    roc_curve, auc
)

RANDOM_STATE = 123


## 2) 📂 Dataset Loading (Colab)

### How the proponents load the dataset:
1. Upload the cleaned dataset (`cleaned_data.csv`) into Colab.
2. Read the file using `pandas`.

✅ **Tip:** If the dataset name is different, just edit `CSV_PATH`.


In [None]:
# Upload dataset file here (recommended in Google Colab)
try:
    from google.colab import files
    uploaded = files.upload()
    print("Uploaded:", list(uploaded.keys()))
except Exception as e:
    print("Upload skipped or not running in Colab.")
    print("Info:", e)


In [None]:
# Set the dataset filename here (default = cleaned_data.csv)
CSV_PATH = "cleaned_data.csv"

df = pd.read_csv(CSV_PATH)
df.head()


## 3) 🔍 Quick Dataset Check

The proponents checked the following:
- dataset size
- column names
- missing values
- target distribution (to see if there is imbalance)


In [None]:
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())

print("\nMissing values per column:")
display(df.isna().sum())

if "FloodOccurrence" in df.columns:
    print("\nFloodOccurrence counts:")
    display(df["FloodOccurrence"].value_counts())

    print("\nFloodOccurrence ratio:")
    display(df["FloodOccurrence"].value_counts(normalize=True))
else:
    print("⚠️ FloodOccurrence column was not found. Please check the dataset.")


## 4) 🧼 Preprocessing

Based on the task instructions, preprocessing should include:

✅ **Handle missing values**  
✅ **Feature scaling** (StandardScaler recommended)  
✅ **Train–test split**

### Why scaling?
Rainfall values can be much larger (mm) compared to elevation and water level (meters).  
Scaling helps Logistic Regression train more consistently.

### Why imbalance handling?
Flood events are often fewer than non-flood events.  
So the proponents used:
- `class_weight="balanced"`


In [None]:
FEATURES = ["Rainfall_mm", "WaterLevel_m", "Elevation_m"]
TARGET = "FloodOccurrence"

missing_cols = [c for c in FEATURES + [TARGET] if c not in df.columns]
if missing_cols:
    raise ValueError(f"Missing required columns: {missing_cols}")

X = df[FEATURES].copy()
y = df[TARGET].copy()

# Train-test split (required format from instructions)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y
)

print("Train shape:", X_train.shape)
print("Test shape :", X_test.shape)


### 4.1 Preprocessing + Model Pipeline

To keep the workflow organized, the proponents used a pipeline:

1. Median Imputation (for missing values)
2. StandardScaler (feature scaling)
3. Logistic Regression (classification model)


In [None]:
pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(
        max_iter=1000,
        class_weight="balanced",
        random_state=RANDOM_STATE
    ))
])

pipeline


## 5) 🤖 Model Training (Logistic Regression)

In this step, the model learns from the training data to predict FloodOccurrence.


In [None]:
pipeline.fit(X_train, y_train)
print("✅ Model training finished.")


## 6) 📊 Model Evaluation

For Logistic Regression, the required evaluation metrics are:

- **Accuracy**
- **Confusion Matrix**
- **Precision, Recall, F1-score**
- (Optional) ROC Curve

> Since flood prediction is a safety-related case, the proponents focused more on **Recall** (catching flood cases).


In [None]:
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, zero_division=0)
rec = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)

print(f"Accuracy : {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall   : {rec:.4f}")
print(f"F1-score : {f1:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, digits=4, zero_division=0))


### 6.1 Confusion Matrix

- **TN**: correct No Flood
- **FP**: false alarm (predicted flood but no flood)
- **FN**: missed flood (most critical)
- **TP**: correct Flood


In [None]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["No Flood (0)", "Flood (1)"])
disp.plot(values_format="d")
plt.title("Confusion Matrix — Flood Prediction")
plt.show()


### 6.2 ROC Curve (Optional)

ROC curve helps visualize how well the model separates flood vs no-flood cases.


In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.4f}")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (Recall)")
plt.title("ROC Curve — Flood Prediction")
plt.legend(loc="lower right")
plt.show()


## 7) 🧠 Insights (3–5)

Below are sample insights written in a student-friendly way.  
(Students can revise depending on their final metrics.)

1. The dataset shows **class imbalance**, so `class_weight="balanced"` was applied to help the model detect flood cases better.  
2. Using **StandardScaler** improved training stability because features have different ranges and units.  
3. **Recall** is important because missing a flood (false negative) can be more risky than a false alarm.  
4. Rainfall and water level generally increase flood probability, while elevation may affect flood susceptibility.  
5. The model can still be improved by adding more features (e.g., soil moisture, location, seasonal variables) and tuning the threshold.


### 7.1 Feature Influence (Optional Interpretation)

This prints coefficients to see which features push the prediction toward flood (positive coefficient) or no flood (negative coefficient).


In [None]:
model = pipeline.named_steps["model"]
coefs = model.coef_[0]

coef_table = pd.DataFrame({
    "Feature": FEATURES,
    "Coefficient": coefs,
    "Odds_Ratio (exp(coef))": np.exp(coefs)
}).sort_values("Coefficient", ascending=False)

display(coef_table)


## 8) 🔮 Predict Flood From New Inputs (Demo)

This function is useful for the final presentation:
- Input rainfall, water level, elevation
- Output probability + predicted label

The threshold can be adjusted:
- Default: 0.50
- Lower threshold: catches more floods (higher recall) but more false alarms


In [None]:
def predict_flood(rainfall_mm: float, waterlevel_m: float, elevation_m: float, threshold: float = 0.50):
    new_data = pd.DataFrame([{
        "Rainfall_mm": rainfall_mm,
        "WaterLevel_m": waterlevel_m,
        "Elevation_m": elevation_m
    }])

    prob_flood = pipeline.predict_proba(new_data)[0][1]
    pred = int(prob_flood >= threshold)

    return {
        "Flood_Probability": float(prob_flood),
        "Predicted_Class": pred,
        "Meaning": "FLOOD" if pred == 1 else "NO FLOOD",
        "Threshold_Used": threshold
    }

# Sample demo values
print(predict_flood(rainfall_mm=40, waterlevel_m=1.2, elevation_m=12))
print(predict_flood(rainfall_mm=120, waterlevel_m=2.8, elevation_m=8))
print(predict_flood(rainfall_mm=80, waterlevel_m=2.0, elevation_m=10, threshold=0.35))


---  
## ✅ End of Notebook

**Submission reminder (GitHub structure):**
- Notebook: `notebooks/`
- Dataset: `data/`
- README.md + requirements.txt in the main folder

After running, the proponents should copy the final evaluation results (Accuracy, Recall, F1, AUC) into the README.
