<a href="https://colab.research.google.com/github/Kevan123/AI4SIDS/blob/main/ai4sids_predictive_model_converted.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI4SIDS Predictive Model: Flood Risk Analysis and Prediction

This notebook focuses on developing a predictive model to assess and forecast flood risk. It integrates data from various sources, including IoT river gauges, weather sensors, and social media, to build a comprehensive dataset for analysis and modeling.

The primary objectives of this notebook are:

1.  **Data Integration and Preparation**: Load, merge, and clean data from disparate sources, ensuring it is in a usable format for machine learning.
2.  **Feature Engineering**: Create relevant time-based features from timestamp data to capture temporal patterns.
3.  **Exploratory Data Analysis (EDA)**: Visualize the spatial distribution of flood events and river levels to gain insights into the data.
4.  **Model Development**: Train machine learning models (specifically, a Random Forest Classifier and Regressors) to:
    *   Classify whether a flood event is likely to occur.
    *   Predict the location (latitude and longitude) and timing (hour) of flood events when they are predicted to occur.
5.  **Model Evaluation**: Assess the performance of the trained models using appropriate metrics (e.g., ROC AUC, Classification Report, Mean Absolute Error).
6.  **Future Simulation and Prediction**: Generate simulated future data and use the trained models to predict flood risk, location, and hour for a future period (e.g., the next 7 days).
7.  **Visualization of Predictions**: Visualize the simulated future flood predictions on a map to provide a clear spatial and temporal understanding of potential flood events.

This work aims to contribute to the development of an AI-driven system for Small Island Developing States (SIDS) to enhance their capacity for predicting and managing flood risks.

In [91]:
# -*- coding: utf-8 -*-
#"""AI4SIDS Predictive Model


In [92]:
#Original file is located at
 #   https://colab.research.google.com/drive/1xaybAlLzzFVl7L5LSGw9k-DCBa36pcVB


In [93]:
# Re-import libraries and reload datasets
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer


In [94]:
# File paths
gauge_data_path = "/content/Updated_Caroni_River_Monthly_Gauge_Data.csv"
weather_data_path = "/content/Weather_Data_by_Sensor_ID_and_City.csv"
social_media_data_path = "/content/social_media_simulated_dataset_transformed_agg_only.xlsx"


In [95]:
# Load data
gauge_data = pd.read_csv(gauge_data_path)
weather_data = pd.read_csv(weather_data_path)
social_media_data = pd.read_excel(social_media_data_path)


In [96]:
# Rename key columns for merging
gauge_data.rename(columns={"Time_Sensor_ID": "time_sensor_id"}, inplace=True)
weather_data.rename(columns={"Time_Sensor_ID": "time_sensor_id"}, inplace=True)
social_media_data.rename(columns={"datestamp_gauge": "time_sensor_id"}, inplace=True)


In [97]:
# Add source prefixes
gauge_data = gauge_data.add_prefix("iot_")
weather_data = weather_data.add_prefix("wthr_")
social_media_data = social_media_data.add_prefix("sm_")


In [98]:
# Re-align key column names post-prefix
gauge_data.rename(columns={"iot_time_sensor_id": "time_sensor_id"}, inplace=True)
weather_data.rename(columns={"wthr_time_sensor_id": "time_sensor_id"}, inplace=True)
social_media_data.rename(columns={"sm_time_sensor_id": "time_sensor_id"}, inplace=True)


In [99]:
# Merge datasets
merged_data = gauge_data.merge(weather_data, on="time_sensor_id", how="outer")
merged_data = merged_data.merge(social_media_data, on="time_sensor_id", how="outer")


In [100]:
# Begin data cleaning and transformation
data = merged_data.copy()


In [101]:
# Impute numerical features
numerical_cols = data.select_dtypes(include=[np.number]).columns.tolist()
num_imputer = SimpleImputer(strategy="mean")
data[numerical_cols] = num_imputer.fit_transform(data[numerical_cols])


In [102]:
# Impute categorical features
categorical_cols = data.select_dtypes(include=["object"]).columns.tolist()
cat_imputer = SimpleImputer(strategy="most_frequent")
data[categorical_cols] = cat_imputer.fit_transform(data[categorical_cols])


In [103]:
# Convert target to binary


In [104]:
# Detect and convert the flood event column
flood_event_col = next((col for col in data.columns if "Flood Event" in col), None)
if flood_event_col:
    data[flood_event_col] = data[flood_event_col].map({"Yes": 1, "No": 0})
    data.rename(columns={flood_event_col: "Flood_Event"}, inplace=True)
else:
    raise KeyError("Column containing 'Flood Event' not found.")


In [105]:

# Parse timestamp and derive time-based features
data["iot_Timestamp"] = pd.to_datetime(data["iot_Timestamp"], errors="coerce")
data["date"] = data["iot_Timestamp"].dt.date
data["hour"] = data["iot_Timestamp"].dt.hour
data["dayofweek"] = data["iot_Timestamp"].dt.dayofweek
data["month"] = data["iot_Timestamp"].dt.month


In [106]:
# Ensure target and spatial info is present
data = data.dropna(subset=["iot_Latitude", "iot_Longitude", "Flood_Event"])


In [107]:
import plotly.express as px

# Assuming 'data' DataFrame is already created and processed as in previous cells
# Ensure 'iot_Latitude', 'iot_Longitude', and 'Flood_Event' columns exist

if not data.empty:
    # Create a scatter plot on a map
    fig = px.scatter_mapbox(data,
                            lat="iot_Latitude",
                            lon="iot_Longitude",
                            color="Flood_Event", # Color points by flood event (0 or 1)
                            size="iot_River Level (m)", # Size of points by river level
                            color_continuous_scale="Viridis",
                            zoom=7,
                            height=600,
                            hover_data=["iot_Timestamp", "iot_Sensor ID", "iot_River Level (m)", "Flood_Event"])

    fig.update_layout(mapbox_style="open-street-map")
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    fig.update_layout(title_text="Flood Events and River Levels at Sensor Locations")
    fig.show()
else:
    print("The 'data' DataFrame is empty.")

In [108]:
# Inject label noise (flip 3% of the flood labels)
def inject_label_noise(df, target_col="Flood_Event", noise_level=0.03, random_state=42):
    np.random.seed(random_state)
    noisy_df = df.copy()
    mask = np.random.rand(len(noisy_df)) < noise_level
    noisy_df.loc[mask, target_col] = 1 - noisy_df.loc[mask, target_col]
    return noisy_df


In [109]:
# Apply label noise injection
data = inject_label_noise(data, target_col="Flood_Event", noise_level=0.03)


In [110]:

# Now use df_noisy to train/test your classifier


In [111]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import classification_report, roc_auc_score, mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


In [112]:
# Define updated feature columns
feature_cols = [
    "iot_River Level (m)", "iot_Change in Level (m)",
    "wthr_Actual Temperature (°C)", "wthr_Actual Humidity (%)",
    "wthr_Actual Rainfall (mm)", "wthr_Actual Windspeed (km/h)",
    "sm_Average of sentiment_score", "sm_Sum of distance_to_gauge_km",
    "sm_Count of post_id", "hour", "dayofweek", "month"
]


In [113]:
# Drop rows with missing values in any selected feature columns
data_model = data.dropna(subset=feature_cols)


In [114]:
# Define inputs and targets
X = data_model[feature_cols]
y_class = data_model["Flood_Event"]
y_lat = data_model["iot_Latitude"]
y_lon = data_model["iot_Longitude"]
y_hour = data_model["hour"]


In [115]:
# Split data for classification model
X_train, X_test, y_class_train, y_class_test = train_test_split(X, y_class, test_size=0.3, random_state=42)


In [116]:
# Classifier pipeline
clf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=42))
])
clf_pipeline.fit(X_train, y_class_train)


In [117]:
# Classification predictions
y_class_pred = clf_pipeline.predict(X_test)
y_class_proba = clf_pipeline.predict_proba(X_test)[:, 1]
class_report = classification_report(y_class_test, y_class_pred, output_dict=True)
roc_auc = roc_auc_score(y_class_test, y_class_proba)


In [118]:
# Subset data for flood-only regression tasks
flood_data = data_model[data_model["Flood_Event"] == 1]
X_flood = flood_data[feature_cols]
y_flood_lat = flood_data["iot_Latitude"]
y_flood_lon = flood_data["iot_Longitude"]
y_flood_hour = flood_data["hour"]


In [119]:
# Train regressors
lat_model = RandomForestRegressor(random_state=42).fit(X_flood, y_flood_lat)
lon_model = RandomForestRegressor(random_state=42).fit(X_flood, y_flood_lon)
hour_model = RandomForestRegressor(random_state=42).fit(X_flood, y_flood_hour)


In [120]:
# Predict and calculate errors
lat_preds = lat_model.predict(X_flood)
lon_preds = lon_model.predict(X_flood)
hour_preds = hour_model.predict(X_flood)


In [121]:
lat_error = mean_absolute_error(y_flood_lat, lat_preds)
lon_error = mean_absolute_error(y_flood_lon, lon_preds)
hour_error = mean_absolute_error(y_flood_hour, hour_preds)


In [122]:
# Return evaluation results
evaluation_results = {
    "Flood Classifier ROC AUC": roc_auc,
    "Flood Classifier Report": class_report,
    "Latitude MAE": lat_error,
    "Longitude MAE": lon_error,
    "Hour MAE": hour_error
}


In [123]:
evaluation_results


{'Flood Classifier ROC AUC': np.float64(0.8186865590299067),
 'Flood Classifier Report': {'0': {'precision': 0.972594752186589,
   'recall': 0.9991015274034142,
   'f1-score': 0.9856699660215689,
   'support': 3339.0},
  '1': {'precision': 0.9788732394366197,
   'recall': 0.5965665236051502,
   'f1-score': 0.7413333333333333,
   'support': 233.0},
  'accuracy': 0.9728443449048152,
  'macro avg': {'precision': 0.9757339958116044,
   'recall': 0.7978340255042822,
   'f1-score': 0.863501649677451,
   'support': 3572.0},
  'weighted avg': {'precision': 0.973004295167904,
   'recall': 0.9728443449048152,
   'f1-score': 0.969731994180483,
   'support': 3572.0}},
 'Latitude MAE': 0.0028925684258075536,
 'Longitude MAE': 0.010751233325810327,
 'Hour MAE': 0.0}

In [124]:
# Extract report and convert to DataFrame
clf_report = evaluation_results["Flood Classifier Report"]
clf_df = pd.DataFrame(clf_report).T  # Transpose to get classes as rows


In [125]:
# Separate non-numeric values if needed
roc_auc = evaluation_results["Flood Classifier ROC AUC"]
lat_mae = evaluation_results["Latitude MAE"]
lon_mae = evaluation_results["Longitude MAE"]
hour_mae = evaluation_results["Hour MAE"]


In [126]:
# Display classifier report
print("=== Classification Report ===")
display(clf_df.round(4))  # If in notebook, else use print(clf_df)


=== Classification Report ===


Unnamed: 0,precision,recall,f1-score,support
0,0.9726,0.9991,0.9857,3339.0
1,0.9789,0.5966,0.7413,233.0
accuracy,0.9728,0.9728,0.9728,0.9728
macro avg,0.9757,0.7978,0.8635,3572.0
weighted avg,0.973,0.9728,0.9697,3572.0


In [127]:
# Display additional metrics
summary_df = pd.DataFrame({
    "Metric": ["ROC AUC", "Latitude MAE", "Longitude MAE", "Hour MAE"],
    "Value": [roc_auc, lat_mae, lon_mae, hour_mae]
})
print("\n=== Other Metrics ===")
display(summary_df)



=== Other Metrics ===


Unnamed: 0,Metric,Value
0,ROC AUC,0.818687
1,Latitude MAE,0.002893
2,Longitude MAE,0.010751
3,Hour MAE,0.0


In [128]:
# Use the previously merged and processed data
merged = merged_data.copy()


In [129]:
# Step 2: Correct sentiment and rainfall field names
merged = merged.rename(columns={
    "sm_Average of sentiment_score": "Average of sentiment_score",
    "wthr_Actual Rainfall (mm)": "Actual Rainfall (mm)",
    "iot_River Level (m)": "River Level (m)",
    "iot_Sensor ID": "Sensor ID",
    "iot_Latitude": "Latitude",
    "iot_Longitude": "Longitude",
    "sm_hourly_timestamp": "Timestamp"
})


In [130]:
# Step 3: Fill missing values with default assumptions
merged["Average of sentiment_score"] = merged["Average of sentiment_score"].fillna(0)
merged["Actual Rainfall (mm)"] = merged["Actual Rainfall (mm)"].fillna(0)
merged["River Level (m)"] = merged["River Level (m)"].fillna(0)


In [131]:
# Step 4: Simulate Flood Risk
def simulate_flood(row):
    flood = 0
    if (row["River Level (m)"] > 3.0) or (row["Actual Rainfall (mm)"] > 10):
        flood = 1
    elif (2.8 <= row["River Level (m)"] <= 3.0) and (row["Average of sentiment_score"] < -0.2):
        flood = np.random.choice([0, 1], p=[0.7, 0.3])
    return flood


In [132]:
merged["Flood Risk (Simulated)"] = merged.apply(simulate_flood, axis=1)


In [133]:
# Step 5: Final dataset formatting
final_merged = merged.rename(columns={
    "Average of sentiment_score": "Avg Sentiment",
    "Actual Rainfall (mm)": "Rainfall (mm)"
})[
    ["Timestamp", "Sensor ID", "Latitude", "Longitude",
     "Rainfall (mm)", "River Level (m)", "Avg Sentiment", "Flood Risk (Simulated)"]
]


In [134]:
# Save the new simulated dataset
simulated_flood_dataset_path = "/content/simulated_flood_risk_dataset.csv"
final_merged.to_csv(simulated_flood_dataset_path, index=False)


In [135]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Load the simulated historical data
simulated_data_path = "/content/simulated_flood_risk_dataset.csv"
simulated_df = pd.read_csv(simulated_data_path)

# Convert Timestamp to datetime
simulated_df["Timestamp"] = pd.to_datetime(simulated_df["Timestamp"])

# Create subplots
fig = make_subplots(rows=3, cols=1, shared_xaxes=True,
                    subplot_titles=("Rainfall (mm) Over Time",
                                      "River Level (m) Over Time",
                                      "Simulated Flood Risk Over Time"))

# Add Rainfall trace
fig.add_trace(go.Scatter(x=simulated_df["Timestamp"], y=simulated_df["Rainfall (mm)"],
                         mode='lines', name='Rainfall (mm)'), row=1, col=1)

# Add River Level trace
fig.add_trace(go.Scatter(x=simulated_df["Timestamp"], y=simulated_df["River Level (m)"],
                         mode='lines', name='River Level (m)'), row=2, col=1)

# Add Flood Risk trace (using scatter for clarity of points)
fig.add_trace(go.Scatter(x=simulated_df["Timestamp"], y=simulated_df["Flood Risk (Simulated)"],
                         mode='markers', name='Flood Risk (Simulated)'), row=3, col=1)

# Update layout
fig.update_layout(height=800, title_text="Simulated Flood Risk Factors Over Time")
fig.update_xaxes(title_text="Timestamp")
fig.update_yaxes(title_text="Rainfall (mm)", row=1, col=1)
fig.update_yaxes(title_text="River Level (m)", row=2, col=1)
fig.update_yaxes(title_text="Flood Risk", row=3, col=1)
fig.show()

In [136]:
# Output path
simulated_flood_dataset_path


'/content/simulated_flood_risk_dataset.csv'

In [137]:
# Generate the future 7-day flood simulation dataset
import pandas as pd
import numpy as np
from datetime import timedelta


In [138]:
# Setup
np.random.seed(42)  # For reproducibility


In [139]:
# Define sensors with static lat/lon (Caroni River gauges)
sensors = {
    "CR-001": (10.5500, -61.3333),
    "CR-002": (10.6000, -61.3500),
    "CR-003": (10.6500, -61.4000),
    "CR-004": (10.6200, -61.3800),
    "CR-005": (10.5800, -61.3400),
    "CR-006": (10.5300, -61.3200),
    "CR-007": (10.6100, -61.3700),
    "CR-008": (10.5700, -61.3300),
}


In [140]:
# Generate timestamps: 1-hour intervals, 7 days starting April 1st, 2025
timestamps = pd.date_range(start="2025-04-01", periods=24*7, freq="H")



'H' is deprecated and will be removed in a future version, please use 'h' instead.



In [141]:
# Create future simulated data
simulated_records = []
for sensor_id, (lat, lon) in sensors.items():
    for ts in timestamps:
        # Rainfall (higher at night 7PM–7AM)
        if ts.hour >= 19 or ts.hour <= 7:
            rainfall = np.random.normal(loc=8, scale=5)  # Higher rain at night
        else:
            rainfall = np.random.normal(loc=3, scale=2)  # Lower rain during day
        rainfall = max(0, rainfall)  # No negative rain


In [142]:
        # River Level (correlate slightly with rainfall)
        river_level = np.random.normal(loc=2.5 + 0.1*(rainfall/10), scale=0.3)
        river_level = max(0, river_level)


In [143]:
        # Avg Sentiment (worse with more rainfall)
        if rainfall > 10:
            avg_sentiment = np.random.normal(loc=-0.4, scale=0.2)
        else:
            avg_sentiment = np.random.normal(loc=0.0, scale=0.2)
        avg_sentiment = np.clip(avg_sentiment, -1, 1)


In [144]:
        # Simulate Flood Risk
        flood_risk = 0
        if river_level > 3.0 or rainfall > 10:
            flood_risk = 1
        elif (2.8 <= river_level <= 3.0) and avg_sentiment < -0.2:
            flood_risk = np.random.choice([0, 1], p=[0.7, 0.3])


In [145]:
        simulated_records.append({
            "Timestamp": ts,
            "Sensor ID": sensor_id,
            "Latitude": lat,
            "Longitude": lon,
            "Rainfall (mm)": round(rainfall, 2),
            "River Level (m)": round(river_level, 2),
            "Avg Sentiment": round(avg_sentiment, 2),
            "Flood Risk (Simulated)": flood_risk
        })


In [146]:
# Convert to DataFrame
simulated_future_df = pd.DataFrame(simulated_records)


In [147]:
# Save the future dataset
future_simulated_data_path = "/content/simulated_7days_flood_test_data.csv"
simulated_future_df.to_csv(future_simulated_data_path, index=False)


In [148]:
future_simulated_data_path


'/content/simulated_7days_flood_test_data.csv'

In [149]:
# Step 1: Load Training and Testing Datasets
train_path = "/content/simulated_flood_risk_dataset.csv"
test_path = "/content/simulated_7days_flood_test_data.csv"


In [150]:
# Load datasets
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)


In [151]:
# Step 2: Feature Engineering
def engineer_features(df):
    df["Timestamp"] = pd.to_datetime(df["Timestamp"])
    df["hour"] = df["Timestamp"].dt.hour
    df["dayofweek"] = df["Timestamp"].dt.dayofweek
    return df


In [152]:
train_df = engineer_features(train_df)
test_df = engineer_features(test_df)


In [153]:
# Step 3: Define Features and Target
feature_cols = [
    "hour", "dayofweek", "Latitude", "Longitude",
    "Rainfall (mm)", "River Level (m)", "Avg Sentiment"
]
target_col = "Flood Risk (Simulated)"


In [154]:
X_train = train_df[feature_cols]
y_train = train_df[target_col]
X_test = test_df[feature_cols]
y_test = test_df[target_col]


In [155]:
# Step 4: Train a RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score


In [156]:
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)


In [157]:
# Step 5: Predict and Evaluate
y_pred = clf.predict(X_test)
y_pred_prob = clf.predict_proba(X_test)[:, 1]


In [158]:
# Evaluation Metrics
clf_report = classification_report(y_test, y_pred, output_dict=True)
roc_auc = roc_auc_score(y_test, y_pred_prob)



Only one class is present in y_true. ROC AUC score is not defined in that case.



In [159]:
# Save report
evaluation_results = {
    "Flood Classifier ROC AUC": roc_auc,
    "Flood Classifier Report": clf_report
}


In [160]:
evaluation_results


{'Flood Classifier ROC AUC': nan,
 'Flood Classifier Report': {'1': {'precision': 1.0,
   'recall': 1.0,
   'f1-score': 1.0,
   'support': 1.0},
  'accuracy': 1.0,
  'macro avg': {'precision': 1.0,
   'recall': 1.0,
   'f1-score': 1.0,
   'support': 1.0},
  'weighted avg': {'precision': 1.0,
   'recall': 1.0,
   'f1-score': 1.0,
   'support': 1.0}}}

In [161]:
merged_data.head(5)


Unnamed: 0,iot_Timestamp,iot_Sensor ID,time_sensor_id,iot_Latitude,iot_Longitude,iot_Location,iot_River Level (m),iot_Change in Level (m),iot_Flood Event,wthr_Timestamp,...,wthr_Actual Storm,wthr_Flood Event,sm_hourly_timestamp,sm_Average of sentiment_score,sm_Sum of distance_to_gauge_km,sm_Count of post_id,sm_Cunupia,sm_Piarco,sm_St. Augustine,sm_St. Helena
0,,,01-03-2025 00-00-00-CITY-Arima,,,,,,,3/1/2025 0:00,...,Cloudy,No,NaT,,,,,,,
1,,,01-03-2025 00-00-00-CITY-Chaguanas,,,,,,,3/1/2025 0:00,...,Clear,No,NaT,,,,,,,
2,,,01-03-2025 00-00-00-CITY-Couva,,,,,,,3/1/2025 0:00,...,Rainy,No,NaT,,,,,,,
3,,,01-03-2025 00-00-00-CITY-PointFortin,,,,,,,3/1/2025 0:00,...,Clear,No,NaT,,,,,,,
4,,,01-03-2025 00-00-00-CITY-PortofSpain,,,,,,,3/1/2025 0:00,...,Cloudy,No,NaT,,,,,,,


In [162]:
# Re-import all required libraries
import pandas as pd
import numpy as np
from datetime import timedelta
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


In [163]:
# ---- Load and Prefix Data (Assumes DataFrames Already Loaded) ----


In [164]:
# Standardize merge keys
gauge_data.rename(columns={"Time_Sensor_ID": "time_sensor_id"}, inplace=True)
weather_data.rename(columns={"Time_Sensor_ID": "time_sensor_id"}, inplace=True)
social_media_data.rename(columns={"datestamp_gauge": "time_sensor_id"}, inplace=True)


In [165]:
# Prefix all columns for provenance
gauge_data = gauge_data.add_prefix("iot_")
weather_data = weather_data.add_prefix("wthr_")
social_media_data = social_media_data.add_prefix("sm_")


In [166]:
# Reassign time_sensor_id for merging
gauge_data.rename(columns={"iot_time_sensor_id": "time_sensor_id"}, inplace=True)
weather_data.rename(columns={"wthr_time_sensor_id": "time_sensor_id"}, inplace=True)
social_media_data.rename(columns={"sm_time_sensor_id": "time_sensor_id"}, inplace=True)


In [167]:
# Merge all sources into a unified dataset
merged_data = gauge_data.merge(weather_data, on="time_sensor_id", how="outer")
merged_data = merged_data.merge(social_media_data, on="time_sensor_id", how="outer")


In [168]:
# ---- Clean, Impute, and Feature Engineer ----
data = merged_data.copy()


In [169]:
# Impute numerics and categoricals
num_cols = data.select_dtypes(include=[np.number]).columns
cat_cols = data.select_dtypes(include=["object"]).columns
data[num_cols] = SimpleImputer(strategy="mean").fit_transform(data[num_cols])
data[cat_cols] = SimpleImputer(strategy="most_frequent").fit_transform(data[cat_cols])


In [170]:
# Encode flood event
# Changed line to correctly reference 'iot_Flood Event' column


In [171]:
# Detect and convert the flood event column
flood_event_col = next((col for col in data.columns if "Flood Event" in col), None)
if flood_event_col:
    data[flood_event_col] = data[flood_event_col].map({"Yes": 1, "No": 0})
    data.rename(columns={flood_event_col: "Flood_Event"}, inplace=True)
else:
    raise KeyError("Column containing 'Flood Event' not found.")


In [172]:

# Parse timestamps and extract date features
data["iot_Timestamp"] = pd.to_datetime(data["iot_iot_Timestamp"], errors="coerce")
data["date"] = data["iot_Timestamp"].dt.date
data["hour"] = data["iot_Timestamp"].dt.hour
data["dayofweek"] = data["iot_Timestamp"].dt.dayofweek
data["month"] = data["iot_Timestamp"].dt.month


In [173]:
# Filter valid rows for modeling
data = data.dropna(subset=["iot_iot_Latitude", "iot_iot_Longitude", "Flood_Event"])


In [174]:
# ---- Define Model Inputs ----
feature_cols = [
    "iot_River Level (m)", "iot_Change in Level (m)",
    "wthr_Actual Temperature (°C)", "wthr_Actual Humidity (%)",
    "wthr_Actual Rainfall (mm)", "wthr_Actual Windspeed (km/h)",
    "sm_Average of sentiment_score", "sm_Sum of distance_to_gauge_km",
    "sm_Count of post_id"
]
extra_cols = ["hour", "dayofweek", "month"]
model_features = feature_cols + extra_cols

# Check if the columns in model_features exist in the DataFrame
for col in model_features:
    if col not in data.columns:
        print(f"Column '{col}' not found in DataFrame")

Column 'iot_River Level (m)' not found in DataFrame
Column 'iot_Change in Level (m)' not found in DataFrame
Column 'wthr_Actual Temperature (°C)' not found in DataFrame
Column 'wthr_Actual Humidity (%)' not found in DataFrame
Column 'wthr_Actual Rainfall (mm)' not found in DataFrame
Column 'wthr_Actual Windspeed (km/h)' not found in DataFrame
Column 'sm_Average of sentiment_score' not found in DataFrame
Column 'sm_Sum of distance_to_gauge_km' not found in DataFrame
Column 'sm_Count of post_id' not found in DataFrame


In [175]:
# Ensure only existing columns are used in subset
existing_model_features = [col for col in model_features if col in data.columns]

# Optional: Warn if any expected columns are missing
missing = list(set(model_features) - set(existing_model_features))
if missing:
    print("⚠️ Missing columns in data:", missing)

# Drop only on existing feature columns
data_model = data.dropna(subset=existing_model_features)

⚠️ Missing columns in data: ['wthr_Actual Humidity (%)', 'wthr_Actual Temperature (°C)', 'wthr_Actual Rainfall (mm)', 'sm_Average of sentiment_score', 'wthr_Actual Windspeed (km/h)', 'iot_River Level (m)', 'sm_Count of post_id', 'iot_Change in Level (m)', 'sm_Sum of distance_to_gauge_km']


In [176]:
data_model.head(5)

Unnamed: 0,iot_iot_Timestamp,iot_iot_Sensor ID,time_sensor_id,iot_iot_Latitude,iot_iot_Longitude,iot_iot_Location,iot_iot_River Level (m),iot_iot_Change in Level (m),Flood_Event,wthr_wthr_Timestamp,...,sm_sm_Count of post_id,sm_sm_Cunupia,sm_sm_Piarco,sm_sm_St. Augustine,sm_sm_St. Helena,iot_Timestamp,date,hour,dayofweek,month
0,3/1/2025 0:00,CR-001,01-03-2025 00-00-00-CITY-Arima,10.601419,-61.372977,Caroni River Mouth,2.751186,-1.3e-05,0,3/1/2025 0:00,...,2.702703,0.68,0.687568,0.678919,0.656216,2025-03-01,2025-03-01,0,5,3
1,3/1/2025 0:00,CR-001,01-03-2025 00-00-00-CITY-Chaguanas,10.601419,-61.372977,Caroni River Mouth,2.751186,-1.3e-05,0,3/1/2025 0:00,...,2.702703,0.68,0.687568,0.678919,0.656216,2025-03-01,2025-03-01,0,5,3
2,3/1/2025 0:00,CR-001,01-03-2025 00-00-00-CITY-Couva,10.601419,-61.372977,Caroni River Mouth,2.751186,-1.3e-05,0,3/1/2025 0:00,...,2.702703,0.68,0.687568,0.678919,0.656216,2025-03-01,2025-03-01,0,5,3
3,3/1/2025 0:00,CR-001,01-03-2025 00-00-00-CITY-PointFortin,10.601419,-61.372977,Caroni River Mouth,2.751186,-1.3e-05,0,3/1/2025 0:00,...,2.702703,0.68,0.687568,0.678919,0.656216,2025-03-01,2025-03-01,0,5,3
4,3/1/2025 0:00,CR-001,01-03-2025 00-00-00-CITY-PortofSpain,10.601419,-61.372977,Caroni River Mouth,2.751186,-1.3e-05,0,3/1/2025 0:00,...,2.702703,0.68,0.687568,0.678919,0.656216,2025-03-01,2025-03-01,0,5,3


In [177]:
# Inputs and outputs
X = data_model[existing_model_features]
y_class = data_model["Flood_Event"]
y_lat = data_model["iot_iot_Latitude"]
y_lon = data_model["iot_iot_Longitude"]
y_hour = data_model["hour"]


In [178]:
# ---- Train Models ----
X_train, X_test, y_class_train, y_class_test = train_test_split(X, y_class, test_size=0.3, random_state=42)


In [179]:
# Pipeline for classification
clf_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(random_state=42))
])
clf_pipeline.fit(X_train, y_class_train)


In [180]:
# Train regressors on flood-only rows
flood_data = data_model[data_model["Flood_Event"] == 1]
X_flood = flood_data[existing_model_features]
lat_model = RandomForestRegressor(random_state=42).fit(X_flood, flood_data["iot_iot_Latitude"])
lon_model = RandomForestRegressor(random_state=42).fit(X_flood, flood_data["iot_iot_Longitude"])
hour_model = RandomForestRegressor(random_state=42).fit(X_flood, flood_data["hour"])


In [181]:
# ---- Simulate Future Data for April 1–7 ----
last_march = data[data["date"] >= pd.to_datetime("2025-03-25").date()]
feature_means = last_march[existing_model_features].mean()
feature_stds = last_march[existing_model_features].std()


In [182]:
future_dates = [pd.Timestamp("2025-04-01") + timedelta(hours=h) for h in range(0, 24 * 7)]
simulated_rows = []


In [183]:
for dt in future_dates:
    row = {
        "hour": dt.hour,
        "dayofweek": dt.dayofweek,
        "month": dt.month,
        "timestamp": dt
    }
    for col in existing_model_features:
        mean = feature_means[col]
        std = feature_stds[col]
        row[col] = np.random.normal(mean, std if not np.isnan(std) else 0.01)
    simulated_rows.append(row)


In [184]:
future_df = pd.DataFrame(simulated_rows)


In [185]:
# ---- Predict Flood Risk & Location ----
future_scaled = clf_pipeline.named_steps['scaler'].transform(future_df[existing_model_features])
flood_probabilities = clf_pipeline.named_steps['clf'].predict_proba(future_scaled)[:, 1]
future_df["Flood_Probability"] = flood_probabilities


In [186]:
# Subset and predict spatial + time attributes
flood_risk_df = future_df[future_df["Flood_Probability"] > 0.2].copy()
if not flood_risk_df.empty:
    flood_risk_df["Pred_Latitude"] = lat_model.predict(flood_risk_df[existing_model_features])
    flood_risk_df["Pred_Longitude"] = lon_model.predict(flood_risk_df[existing_model_features])
    flood_risk_df["Pred_Hour"] = hour_model.predict(flood_risk_df[existing_model_features])


In [187]:
flood_risk_df.head(5)

Unnamed: 0,hour,dayofweek,month,timestamp,Flood_Probability,Pred_Latitude,Pred_Longitude,Pred_Hour
11,24.144685,4.356356,3.0,2025-04-01 11:00:00,0.269144,10.590318,-61.340901,23.0
40,23.063693,3.789637,3.0,2025-04-02 16:00:00,0.269144,10.590318,-61.340901,23.0
87,22.750514,3.646399,3.0,2025-04-04 15:00:00,0.269144,10.590318,-61.340901,23.0
101,29.432115,4.159698,3.0,2025-04-05 05:00:00,0.269144,10.590318,-61.340901,23.0
138,22.856366,4.484807,3.0,2025-04-06 18:00:00,0.269144,10.590318,-61.340901,23.0


In [188]:
import plotly.express as px

if not flood_risk_df.empty:
    fig = px.scatter_mapbox(flood_risk_df,
                            lat="Pred_Latitude",
                            lon="Pred_Longitude",
                            color="Flood_Probability",
                            size="Flood_Probability", # Size of dots based on probability
                            color_continuous_scale="Viridis",
                            zoom=7,
                            height=600,
                            hover_data=["timestamp", "Pred_Hour"])

    fig.update_layout(mapbox_style="open-street-map")
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    fig.update_layout(title_text="Predicted Flood Locations and Probabilities (Next 7 Days)")
    fig.show()
else:
    print("No flood events predicted in the next 7 days.")