In [1]:
import matplotlib
matplotlib.use('Agg')  # Use a non-interactive backend
import matplotlib.pyplot as plt
print("Matplotlib imported successfully with Agg backend!")

Matplotlib imported successfully with Agg backend!


# Intuitive Supervised Learning for Flood Prediction

This notebook demonstrates how we can use machine learning to predict flood probability based on various environmental and human factors. We'll walk through each step, explaining how it relates to our flood prediction goal.

In [2]:
# First, we import the tools we need for our flood prediction project
import pandas as pd  # For handling our flood data
import numpy as np  # For numerical operations
from sklearn.model_selection import train_test_split  # To split our flood data
from sklearn.ensemble import RandomForestRegressor  # Our flood prediction model
from sklearn.metrics import mean_squared_error, r2_score  # To evaluate our flood predictions

import matplotlib.pyplot as plt  # For visualizing flood risk factors
import seaborn as sns  # For prettier visualizations

# We set a random seed to make our flood predictions reproducible
np.random.seed(42)

## Step 1: Loading and Exploring Our Flood Data

First, we need to load and examine our flood-related data to understand what information we have to work with.

In [3]:
# Load the flood data from our CSV file
flood_data = pd.read_csv('flood_kaggle.csv')

# Let's look at the first few rows of our flood data
print("Here are the first few rows of our flood data:")
print(flood_data.head())

# And get some basic information about our flood dataset
print("\nHere's some information about our flood dataset:")
print(flood_data.info())

# This gives us an overview of our dataset. We can see all the factors that might influence flood probability,
# like MonsoonIntensity, TopographyDrainage, etc., and our target variable 'FloodProbability' at the end.
# Understanding these factors is crucial for predicting flood risk.

Here are the first few rows of our flood data:
   MonsoonIntensity  TopographyDrainage  RiverManagement  Deforestation  \
0                 3                   8                6              6   
1                 8                   4                5              7   
2                 3                  10                4              1   
3                 4                   4                2              7   
4                 3                   7                5              2   

   Urbanization  ClimateChange  DamsQuality  Siltation  AgriculturalPractices  \
0             4              4            6          2                      3   
1             7              9            1          5                      5   
2             7              5            4          7                      4   
3             3              4            1          4                      6   
4             5              8            5          2                      7   

   Encroachment

## Step 2: Preparing Our Flood Data

Now that we've loaded our data, we need to prepare it for our machine learning model. We'll split our data into two parts:
1. The features (X) - all the factors that might influence flooding
2. The target (y) - the flood probability we want to predict

In [4]:
# Separate our flood risk factors (X) and flood probability (y)
X = flood_data.drop('FloodProbability', axis=1)  # All columns except FloodProbability
y = flood_data['FloodProbability']  # Just the FloodProbability column

# Now we split our data into training and testing sets
# We'll use 80% of the data to train our flood prediction model, and 20% to test it
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"We have {X_train.shape[0]} flood scenarios to train our model")
print(f"And {X_test.shape[0]} flood scenarios to test it")

# This split allows us to train our model on one set of flood data and then test how well it performs on data it hasn't seen before.
# This helps us understand if our model can generalize to new flood scenarios, which is crucial for predicting future flood risks.

We have 40000 flood scenarios to train our model
And 10000 flood scenarios to test it


## Step 3: Training Our Flood Prediction Model

Now we'll use a Random Forest model to learn patterns from our training data. Think of this like the model studying many examples of past flood scenarios to understand what factors lead to higher flood probabilities.

In [5]:
# Create our Random Forest model for flood prediction
# n_estimators=100 means it will create 100 decision trees to make its predictions
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model on our flood data
model.fit(X_train, y_train)

print("Our flood prediction model has finished learning from the training data!")

# The model has now learned patterns from the training data. It's like it has studied many past flood scenarios
# and understands how different factors (like monsoon intensity, topography, etc.) relate to flood probability.

Our flood prediction model has finished learning from the training data!


## Step 4: Evaluating Our Flood Prediction Model

Now that our model has learned, let's see how well it can predict flood probabilities for scenarios it hasn't seen before.

In [6]:
# Use our trained model to make flood probability predictions on the test data
y_pred = model.predict(X_test)

# Calculate how well our predictions match the actual flood probabilities
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print(f"R-squared Score: {r2:.4f}")

# The Mean Squared Error (MSE) tells us how far off our flood predictions are on average. Lower is better.
# The R-squared score tells us how well our model explains the variability in flood probability. Closer to 1 is better.
# These metrics help us understand how reliable our flood predictions might be for new areas or future scenarios.

Mean Squared Error: 0.0007
R-squared Score: 0.7295


## Step 5: Understanding Important Factors for Flood Prediction

One of the benefits of our model is that it can tell us which factors are most important in predicting flood probability. This can help focus flood prevention efforts.

In [7]:
# Get the importance of each flood risk factor
feature_importance = model.feature_importances_
features = X.columns

# Sort flood risk factors by importance
feature_importance_sorted = sorted(zip(feature_importance, features), reverse=True)

# Create a bar chart of flood risk factor importances
plt.figure(figsize=(12, 8))
sns.barplot(x=[imp for imp, _ in feature_importance_sorted], 
            y=[feat for _, feat in feature_importance_sorted])
plt.title("Which Factors Are Most Important for Predicting Floods?")
plt.xlabel("Importance Score")
plt.tight_layout()
plt.show()

# Print the top 5 most important flood risk factors
print("The 5 most important factors for predicting flood probability are:")
for imp, feat in feature_importance_sorted[:5]:
    print(f"{feat}: {imp:.4f}")

# This analysis helps us understand which factors contribute most to flood risk.
# It could guide where to focus flood prevention efforts or what to monitor most closely for early warning systems.

The 5 most important factors for predicting flood probability are:
TopographyDrainage: 0.0530
DamsQuality: 0.0528
PoliticalFactors: 0.0524
IneffectiveDisasterPreparedness: 0.0516
PopulationScore: 0.0516


  plt.show()


In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and train the Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Make predictions
y_pred = linear_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Linear Regression - MSE: {mse}, R²: {r2}")


Linear Regression - MSE: 1.1059953150715025e-31, R²: 1.0


In [9]:
import matplotlib.pyplot as plt

# Scatter plot for predicted vs actual values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5, label="Predictions")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label="Perfect Fit")
plt.title("Linear Regression: Predicted vs Actual Flood Probabilities")
plt.xlabel("Actual Flood Probability")
plt.ylabel("Predicted Flood Probability")
plt.legend()
plt.show()
plt.savefig("linear_regression_actual_vs_predicted.png")


  plt.show()


In [10]:
from sklearn.ensemble import RandomForestRegressor

# Initialize and train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=50, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest - MSE: {mse_rf}, R²: {r2_rf}")


Random Forest - MSE: 0.0006917651559999998, R²: 0.7222480062226527


In [11]:
import matplotlib.pyplot as plt

# Scatter plot for predicted vs actual values (Random Forest)
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred_rf, alpha=0.5, label="Predictions")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label="Perfect Fit")
plt.title("Random Forest: Predicted vs Actual Flood Probabilities")
plt.xlabel("Actual Flood Probability")
plt.ylabel("Predicted Flood Probability")
plt.legend()
plt.savefig("random_forest_actual_vs_predicted.png")  # Save the figure
plt.show()


  plt.show()


In [12]:
!pip install PyQt5
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and train the XGBoost model
xgb_model = XGBRegressor(n_estimators=50, random_state=42)
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"XGBoost - MSE: {mse_xgb}, R²: {r2_xgb}")


Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: C:\Users\bobby\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


XGBoost - MSE: 0.0002371374625304009, R²: 0.9047864691567092


In [13]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Load the dataset (adjust file name if needed)
kaggle_data = pd.read_csv('flood_kaggle.csv')

# Define the input (X) and target variable (Y)
Y_kaggle = kaggle_data['FloodProbability']  # Ensure this column exists in your dataset

# Correlation heatmap
plt.figure(figsize=(12, 8))
correlation = kaggle_data.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.savefig("feature_correlation_heatmap.png")  # Save the plot as an image
plt.close()  # Close the plot to free memory


In [14]:
import pandas as pd
import matplotlib.pyplot as plt

# Feature importance for Random Forest
feature_importance = rf_model.feature_importances_
features = X_train.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importance})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(12, 8))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.title("Feature Importance (Random Forest)")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.gca().invert_yaxis()
plt.savefig("random_forest_feature_importance.png")  # Save the plot
plt.close()


In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

kaggle_data = pd.read_csv('flood_kaggle.csv')
Y_kaggle = kaggle_data['FloodProbability']

kaggle_data['Interaction_Topography_Dams'] = kaggle_data['TopographyDrainage'] * kaggle_data['DamsQuality']

X_interaction = kaggle_data.iloc[:, 1:14].copy()
X_interaction['Interaction_Topography_Dams'] = kaggle_data['Interaction_Topography_Dams']

X_train_int, X_test_int, y_train_int, y_test_int = train_test_split(X_interaction, Y_kaggle, test_size=0.2, random_state=42)

rf_model = RandomForestRegressor(n_estimators=50, random_state=42)
rf_model.fit(X_train_int, y_train_int)

y_pred_int = rf_model.predict(X_test_int)
mse_int = mean_squared_error(y_test_int, y_pred_int)
r2_int = r2_score(y_test_int, y_pred_int)

metrics = {'Before': [mse_rf, r2_rf], 'After': [mse_int, r2_int]}
categories = ['MSE', 'R²']
metrics_df = pd.DataFrame(metrics, index=categories)

metrics_df.plot(kind='bar', figsize=(10, 6))
plt.title("Model Performance Before and After Interaction Terms")
plt.ylabel("Metric Value")
plt.legend(title="Model")
plt.savefig("model_performance_interaction_terms.png")
plt.close()


In [16]:
# Histogram of Interaction_Topography_Dams
plt.figure(figsize=(8, 6))
kaggle_data['Interaction_Topography_Dams'].hist(bins=30, color='lightgreen')
plt.title("Distribution of Interaction Term: TopographyDrainage * DamsQuality")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.savefig("interaction_term_distribution.png")  # Save the plot
plt.close()


In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# Define the model
rf_model = RandomForestRegressor(random_state=42)

# Define hyperparameters to tune
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform GridSearchCV
grid_search_rf = GridSearchCV(estimator=rf_model, param_grid=param_grid_rf, cv=3, n_jobs=-1, verbose=2)
grid_search_rf.fit(X_train, y_train)

# Best parameters and model performance
print(f"Best Parameters: {grid_search_rf.best_params_}")
print(f"Best Score: {grid_search_rf.best_score_}")


Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best Parameters: {'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best Score: 0.7236758283465475


In [18]:
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

# Define the model
xgb_model = XGBRegressor(random_state=42)

# Define hyperparameters to tune
param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0]
}

# Perform GridSearchCV
grid_search_xgb = GridSearchCV(estimator=xgb_model, param_grid=param_grid_xgb, cv=3, n_jobs=-1, verbose=2)
grid_search_xgb.fit(X_train, y_train)

# Best parameters and model performance
print(f"Best Parameters: {grid_search_xgb.best_params_}")
print(f"Best Score: {grid_search_xgb.best_score_}")


Fitting 3 folds for each of 54 candidates, totalling 162 fits
Best Parameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
Best Score: 0.9560455015270932


In [20]:
# Get the best model from GridSearchCV
best_rf_model = grid_search_rf.best_estimator_

# Make predictions with the best model
y_pred_rf_best = best_rf_model.predict(X_test)

# Calculate MSE and R² for the best model
best_mse_rf = mean_squared_error(y_test, y_pred_rf_best)
best_r2_rf = r2_score(y_test, y_pred_rf_best)

# Now proceed to your metrics dictionary
metrics = {'Before': [mse_rf, r2_rf], 'After': [best_mse_rf, best_r2_rf]}
categories = ['MSE', 'R²']
metrics_df = pd.DataFrame(metrics, index=categories)

metrics_df.plot(kind='bar', figsize=(10, 6))
plt.title("Performance Comparison Before and After Hyperparameter Tuning")
plt.ylabel("Metric Value")
plt.legend(title="Model")
plt.savefig("performance_comparison_tuning.png")
plt.close()



## Conclusion

We've now built a model that can predict flood probability based on various environmental and human factors. This model could be used to:
1. Estimate flood risk for new areas
2. Identify the most critical factors contributing to flood risk
3. Guide decision-making for flood prevention and preparedness

Remember, this is a simplified model and real-world flood prediction is very complex. But this gives us a starting point for understanding and predicting flood risks. As you continue your project, you might want to consider:
- Collecting more detailed local data to improve predictions
- Incorporating time-based data to predict flood risks over time
- Exploring other machine learning models to see if they perform better
- Creating a user-friendly interface for local authorities to use these predictions