# Flood Data Processing, Feature Selection, and Modeling

This script processes flood-related data, performs feature selection, and trains a machine learning model.  
It is designed to automatically detect whether the target variable is **categorical (classification)** or **continuous (regression)** and adapt accordingly.

---

## 🔹 1. Configuration
- **DATA_PATH**: Path to your dataset (`flood_data.csv`).
- **TARGET_COLUMN**: The column you want to predict (e.g., `storm_drain_proximity_m`).
- **Output paths**: Where processed data, selected features, and figures will be saved.

---

## 🔹 2. Load Data
- Reads the CSV file into a Pandas DataFrame.
- Prints dataset shape, info, summary statistics, and missing values.

---

## 🔹 3. Target Column
- Checks if the specified `TARGET_COLUMN` exists.
- If not, tries to guess from common target names (`flood`, `label`, etc.).

---

## 🔹 4. Clean Target
- Removes rows where the target is missing (`NaN`).
- This ensures machine learning models don’t break on missing labels.

---

## 🔹 5. Split Features & Target
- `X`: Features (all columns except the target).
- `y`: Target column.

---

## 🔹 6. Handle Missing Values in Features
- Numeric columns → filled with **median**.
- Categorical columns → filled with **mode** (most frequent value).

---

## 🔹 7. Exploratory Data Analysis (EDA)
- Histograms for numeric features.
- Box plots for numeric features.
- Correlation heatmap of numeric columns.
- All plots saved in the `outputs/figures` directory.

---

## 🔹 8. Handle Categorical Data
- Converts categorical variables into numeric format using **one-hot encoding** (`pd.get_dummies`).

---

## 🔹 9. Scaling
- Standardizes numeric features using `StandardScaler` (mean = 0, std = 1).
- Ensures features are on the same scale for better model performance.

---

## 🔹 10. Train-Test Split
- Splits dataset into training and testing sets (`80/20` split).

---

## 🔹 11. Task Detection (Classification vs Regression)
- If the target (`y`) is **numeric with many unique values** → treat as **regression**.
- Otherwise → treat as **classification**.

---

## 🔹 12. Feature Selection
- **Classification** → uses `f_classif` (ANOVA F-test).
- **Regression** → uses `f_regression`.
- Selects the top **10 most relevant features**.

---

## 🔹 13. Model Training
- **Classification** → trains a `RandomForestClassifier`.
- **Regression** → trains a `RandomForestRegressor`.

---

## 🔹 14. Feature Importances
- Extracts feature importance scores from the trained model.
- Saves a bar plot of the **top 10 most important features**.

---

## 🔹 15. Save Results
- Saves the **list of selected features** to a text file.
- Creates a **processed dataset** containing only the selected features + target.
- Saves the processed dataset as a CSV.

---

## ✅ Summary
This script:
1. Cleans and preprocesses flood data.
2. Detects the correct type of task (classification or regression).
3. Selects the top 10 features.
4. Trains a Random Forest model.
5. Saves results and figures for later use.
