# Two-Stage Precipitation Modeling with Negative Sampling

This notebook demonstrates how to handle a highly imbalanced precipitation dataset
(â‰ˆ99% no-rain events) using structured undersampling and a two-stage modeling approach:
1. Classifier for rain vs. no-rain
2. Regressor for precipitation amount given rain

Here is a summary of what the code does:

- Undersample majority class (no-rain) to balance data.
- Train a classifier for event detection.
- Train a regressor for continuous rain amount prediction.
- Combine both for an end-to-end precipitation model.

In [None]:
#
## 1. Setup
#
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import classification_report, mean_squared_error
from imblearn.under_sampling import ClusterCentroids

#
## 2. Load and Prepare Data
#

#  Replace the CSV path with your actual dataset path.
df = pd.read_csv("precipitation_dataset.csv")

# Separate predictors and target
X = df.drop(columns=["precipitation_mm"])
y = df["precipitation_mm"]

# Create binary target (1 = rain, 0 = no rain)
y_binary = (y > 0.1).astype(int)

print("Rain events:", y_binary.sum(), "of", len(y_binary))

#
## 3. Visualize Class Distribution
#

plt.figure(figsize=(5,4))
sns.countplot(x=y_binary, palette="coolwarm")
plt.title("Class Distribution (Rain vs. No Rain)")
plt.xlabel("Rain (1) / No Rain (0)")
plt.ylabel("Count")
plt.show()

#
## 4. Train/Test Split
#

X_train, X_test, y_train_bin, y_test_bin = train_test_split(
    X, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)

#
## 5. Structured Undersampling of the No-Rain Class
#

# We'll use ClusterCentroids to retain diverse, representative no-rain samples.
cc = ClusterCentroids(random_state=42)
X_resampled, y_resampled = cc.fit_resample(X_train, y_train_bin)

print("Before:", np.bincount(y_train_bin))
print("After:", np.bincount(y_resampled))

# Visualize new balance
plt.figure(figsize=(5,4))
sns.countplot(x=y_resampled, palette="coolwarm")
plt.title("After ClusterCentroids Undersampling")
plt.xlabel("Rain (1) / No Rain (0)")
plt.ylabel("Count")
plt.show()

#
## 6. Train Rain/No-Rain Classifier
#
clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(X_resampled, y_resampled)

# Evaluate
print(classification_report(y_test_bin, clf.predict(X_test)))

# 
# 7. Feature Importance Visualization (Classifier)
#

importances = pd.Series(clf.feature_importances_, index=X.columns).sort_values(ascending=False)
plt.figure(figsize=(8,5))
sns.barplot(x=importances[:15], y=importances.index[:15], palette="viridis")
plt.title("Top 15 Feature Importances (Rain Classifier)")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()

#
# 8. Train Regressor for Rain Amounts
#

rain_mask = y > 0.1
X_rain = X[rain_mask]
y_rain = y[rain_mask]

reg = RandomForestRegressor(n_estimators=300, random_state=42)
reg.fit(X_rain, y_rain)

# Evaluate on training data (only for illustration)
y_pred_rain = reg.predict(X_rain)
rmse = np.sqrt(mean_squared_error(y_rain, y_pred_rain))
print(f"RMSE on rain events: {rmse:.3f}")

# 
# 9. Feature Importance Visualization (Regressor)
#

reg_importances = pd.Series(reg.feature_importances_, index=X.columns).sort_values(ascending=False)
plt.figure(figsize=(8,5))
sns.barplot(x=reg_importances[:15], y=reg_importances.index[:15], palette="crest")
plt.title("Top 15 Feature Importances (Rain Regressor)")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()

# 
# 10. Combined Prediction Function
#
def predict_precipitation(x_new):
    rain_prob = clf.predict_proba([x_new])[0, 1]
    if rain_prob < 0.5:
        return 0.0
    else:
        return reg.predict([x_new])[0]

# Example usage:
example = X_test.iloc[0]
y_hat = predict_precipitation(example)
print("Predicted precipitation (mm):", y_hat)

Possible Refinements

- Use **NearMiss** or **TomekLinks** instead of ClusterCentroids for undersampling.
- Apply **SMOTE** or **ADASYN** to oversample rare rain events.
- Calibrate classifier probabilities with `CalibratedClassifierCV`.
- Add temporal cross-validation to avoid data leakage.