# GIS Wind-Methane Dispersion Analysis - Predictive Modeling

This notebook focuses on developing predictive models for methane concentration based on wind patterns and spatial features. We will explore both machine learning regression models and time series forecasting approaches.

## Objectives

1. Build regression models to predict methane concentration based on wind and location data
2. Develop time series forecasting models for specific sensor locations
3. Evaluate model performance and feature importance
4. Visualize predictions and forecasts

## 1. Setup and Data Loading

In [None]:
# Import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
from sklearn.model_selection import TimeSeriesSplit, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor
from statsmodels.tsa.arima.model import ARIMA
from pmdarima import auto_arima
import warnings

warnings.filterwarnings('ignore')

# Set up plotting parameters
plt.style.use('seaborn-whitegrid')
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
sns.set_style("whitegrid")

In [None]:
# Import from our project modules
import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from src.data_processing import load_data, preprocess_methane_data, preprocess_wind_data, merge_data
from src.predictive_model import prepare_regression_data, train_random_forest_model, train_xgboost_model

In [None]:
# Define file paths
project_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
methane_path = os.path.join(project_dir, 'data', 'methane_sensors.csv')
wind_path = os.path.join(project_dir, 'data', 'wind_data.csv')

# Alternative direct paths if needed
if not os.path.exists(methane_path):
    methane_path = r"C:\Users\pradeep dubey\Downloads\methane_sensors.csv"
    wind_path = r"C:\Users\pradeep dubey\Downloads\wind_data.csv"

# Load and preprocess data
methane_df, wind_df = load_data(methane_path, wind_path)
methane_gdf = preprocess_methane_data(methane_df)
wind_df_processed = preprocess_wind_data(wind_df)
merged_gdf = merge_data(methane_gdf, wind_df_processed)

print(f"Loaded {len(methane_gdf)} methane records and {len(wind_df_processed)} wind records")
print(f"Merged dataset contains {len(merged_gdf)} records")

## 2. Exploratory Data Analysis for Modeling

In [None]:
# Check distribution of methane concentration
plt.figure(figsize=(10, 6))
sns.histplot(merged_gdf['Methane_Concentration (ppm)'], kde=True)
plt.title('Distribution of Methane Concentration', fontsize=14)
plt.xlabel('Methane Concentration (ppm)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Check correlation between wind features and methane concentration
# Convert timestamp to hour for temporal features
merged_gdf['Hour'] = merged_gdf['Timestamp'].dt.hour
merged_gdf['Minute'] = merged_gdf['Timestamp'].dt.minute
    
# Calculate hour of day as continuous feature (for cyclical patterns)
merged_gdf['Hour_Continuous'] = merged_gdf['Hour'] + merged_gdf['Minute']/60
    
# Create cyclical features for time of day
merged_gdf['Hour_Sin'] = np.sin(2 * np.pi * merged_gdf['Hour_Continuous']/24)
merged_gdf['Hour_Cos'] = np.cos(2 * np.pi * merged_gdf['Hour_Continuous']/24)

# Calculate correlation matrix for relevant features
correlation_cols = [
    'Methane_Concentration (ppm)',
    'Wind_Speed (m/s)', 
    'Wind_Direction (°)', 
    'U', 
    'V',
    'Hour_Sin',
    'Hour_Cos',
    'Latitude',
    'Longitude'
]

corr_matrix = merged_gdf[correlation_cols].corr()

# Plot correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Features', fontsize=16)
plt.tight_layout()
plt.show()

## 3. Prepare Data for Regression Modeling

In [None]:
# Prepare data for regression modeling
X, y, feature_names = prepare_regression_data(merged_gdf)

print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"Features: {feature_names}")

In [None]:
# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")

## 4. Random Forest Regression Model

In [None]:
# Train Random Forest model
rf_model, X_train_rf, X_test_rf, y_train_rf, y_test_rf, y_pred_rf, rf_scaler = train_random_forest_model(X, y)

# Evaluate feature importance
importance = rf_model.feature_importances_
indices = np.argsort(importance)[::-1]

# Plot feature importance
plt.figure(figsize=(12, 6))
plt.bar(range(len(indices)), importance[indices], align='center')
plt.xticks(range(len(indices)), [feature_names[i] for i in indices], rotation=45)
plt.title('Feature Importance (Random Forest)', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Plot actual vs. predicted values for Random Forest model
plt.figure(figsize=(10, 6))
plt.scatter(y_test_rf, y_pred_rf, alpha=0.5)
plt.plot([y_test_rf.min(), y_test_rf.max()], [y_test_rf.min(), y_test_rf.max()], 'r--')
plt.title('Random Forest: Actual vs. Predicted Methane Concentration', fontsize=14)
plt.xlabel('Actual Methane Concentration (ppm)', fontsize=12)
plt.ylabel('Predicted Methane Concentration (ppm)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Plot residuals
residuals_rf = y_test_rf - y_pred_rf
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_rf, residuals_rf, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Random Forest: Residual Plot', fontsize=14)
plt.xlabel('Predicted Methane Concentration (ppm)', fontsize=12)
plt.ylabel('Residuals', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5. XGBoost Regression Model

In [None]:
# Train XGBoost model
xgb_model, X_train_xgb, X_test_xgb, y_train_xgb, y_test_xgb, y_pred_xgb, xgb_scaler = train_xgboost_model(X, y)

# Evaluate feature importance
xgb_importance = xgb_model.feature_importances_
xgb_indices = np.argsort(xgb_importance)[::-1]

# Plot feature importance
plt.figure(figsize=(12, 6))
plt.bar(range(len(xgb_indices)), xgb_importance[xgb_indices], align='center')
plt.xticks(range(len(xgb_indices)), [feature_names[i] for i in xgb_indices], rotation=45)
plt.title('Feature Importance (XGBoost)', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Plot actual vs. predicted values for XGBoost model
plt.figure(figsize=(10, 6))
plt.scatter(y_test_xgb, y_pred_xgb, alpha=0.5)
plt.plot([y_test_xgb.min(), y_test_xgb.max()], [y_test_xgb.min(), y_test_xgb.max()], 'r--')
plt.title('XGBoost: Actual vs. Predicted Methane Concentration', fontsize=14)
plt.xlabel('Actual Methane Concentration (ppm)', fontsize=12)
plt.ylabel('Predicted Methane Concentration (ppm)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Plot residuals
residuals_xgb = y_test_xgb - y_pred_xgb
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_xgb, residuals_xgb, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('XGBoost: Residual Plot', fontsize=14)
plt.xlabel('Predicted Methane Concentration (ppm)', fontsize=12)
plt.ylabel('Residuals', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Compare Model Performance

In [None]:
# Calculate metrics for both models
rf_rmse = np.sqrt(mean_squared_error(y_test_rf, y_pred_rf))
rf_r2 = r2_score(y_test_rf, y_pred_rf)

xgb_rmse = np.sqrt(mean_squared_error(y_test_xgb, y_pred_xgb))
xgb_r2 = r2_score(y_test_xgb, y_pred_xgb)

# Create comparison dataframe
model_comparison = pd.DataFrame({
    'Model': ['Random Forest', 'XGBoost'],
    'RMSE': [rf_rmse, xgb_rmse],
    'R²': [rf_r2, xgb_r2]
})

display(model_comparison)

# Plot comparison
plt.figure(figsize=(10, 6))
plt.bar(model_comparison['Model'], model_comparison['RMSE'])
plt.title('Model Comparison: RMSE (lower is better)', fontsize=14)
plt.ylabel('RMSE', fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.show()

plt.figure(figsize=(10, 6))
plt.bar(model_comparison['Model'], model_comparison['R²'])
plt.title('Model Comparison: R² (higher is better)', fontsize=14)
plt.ylabel('R²', fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.show()

## 7. Time Series Analysis and Forecasting

In [None]:
# Prepare time series data for a specific sensor
sensor_id = 'S1'  # Choose a sensor for analysis
ts_data = prepare_time_series_data(merged_gdf, sensor_id)

# Plot time series
plt.figure(figsize=(12, 6))
plt.plot(ts_data.index, ts_data['Methane_Concentration (ppm)'], marker='o')
plt.title(f'Methane Concentration Time Series for Sensor {sensor_id}', fontsize=14)
plt.xlabel('Time', fontsize=12)
plt.ylabel('Methane Concentration (ppm)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Fit ARIMA model to the time series data
arima_model, arima_order = fit_arima_model(ts_data)

# Display model summary
display(arima_model.summary())

In [None]:
# Generate forecasts for next 12 time steps (6 hours)
forecast_steps = 12
forecast_series, _ = forecast_methane(arima_model, steps=forecast_steps)

# Plot forecasts with historical data
plt.figure(figsize=(12, 6))

# Plot historical data
plt.plot(ts_data.index, ts_data['Methane_Concentration (ppm)'], label='Historical', color='blue')

# Plot forecast
plt.plot(forecast_series.index, forecast_series, label='Forecast', color='red')

# Add prediction intervals if available
try:
    forecast_obj = arima_model.get_forecast(steps=forecast_steps)
    conf_int = forecast_obj.conf_int()
    plt.fill_between(
        forecast_series.index,
        conf_int.iloc[:, 0],
        conf_int.iloc[:, 1],
        color='red',
        alpha=0.2,
        label='95% Confidence Interval'
    )
except:
    pass  # Skip if confidence intervals aren't available

plt.title(f'Methane Concentration Forecast for Sensor {sensor_id}', fontsize=14)
plt.xlabel('Time', fontsize=12)
plt.ylabel('Methane Concentration (ppm)', fontsize=12)
plt.legend()
plt.grid(True, alpha=0.3)
plt.gcf().autofmt_xdate()
plt.tight_layout()
plt.show()

## 8. Spatial Prediction

Let's visualize how our model predicts methane concentrations across space.

In [None]:
def spatial_prediction_grid(model, scaler, methane_gdf, wind_df, timestamp, resolution=50):
    """
    Create a spatial prediction grid using the trained model.
    """
    # Filter data for the specified timestamp
    time_methane = methane_gdf[methane_gdf['Timestamp'] == timestamp]
    time_wind = wind_df[wind_df['Timestamp'] == timestamp].iloc[0]
    
    # Get bounds
    min_lon = time_methane.geometry.x.min() - 0.001
    max_lon = time_methane.geometry.x.max() + 0.001
    min_lat = time_methane.geometry.y.min() - 0.001
    max_lat = time_methane.geometry.y.max() + 0.001
    
    # Create grid
    lon_vals = np.linspace(min_lon, max_lon, resolution)
    lat_vals = np.linspace(min_lat, max_lat, resolution)
    xx, yy = np.meshgrid(lon_vals, lat_vals)
    
    # Extract wind features
    wind_speed = time_wind['Wind_Speed (m/s)']
    wind_direction = time_wind['Wind_Direction (°)']
    u = time_wind['U']
    v = time_wind['V']
    
    # Get hour features
    hour = timestamp.hour
    minute = timestamp.minute
    hour_continuous = hour + minute/60
    hour_sin = np.sin(2 * np.pi * hour_continuous/24)
    hour_cos = np.cos(2 * np.pi * hour_continuous/24)
    
    # Create input features for each grid point
    grid_points = []
    for i in range(xx.shape[0]):
        for j in range(xx.shape[1]):
            lon = xx[i, j]
            lat = yy[i, j]
            point_features = [wind_speed, wind_direction, u, v, hour_sin, hour_cos, lat, lon]
            grid_points.append(point_features)
    
    # Convert to numpy array
    grid_features = np.array(grid_points)
    
    # Scale features
    scaled_features = scaler.transform(grid_features)
    
    # Predict methane concentrations
    predicted_methane = model.predict(scaled_features)
    
    # Reshape for plotting
    zz = predicted_methane.reshape(xx.shape)
    
    return xx, yy, zz, time_methane

# Generate spatial prediction for noon timestamp
noon_timestamp = pd.Timestamp('2025-02-10 12:00:00')
xx_pred, yy_pred, zz_pred, noon_data = spatial_prediction_grid(
    rf_model, rf_scaler, methane_gdf, wind_df_processed, noon_timestamp)

# Plot the spatial prediction
fig, ax = plt.subplots(figsize=(12, 10))

# Plot predicted surface
contour = ax.contourf(xx_pred, yy_pred, zz_pred, cmap='YlOrRd', levels=15)

# Plot actual observations
scatter = ax.scatter(
    noon_data.geometry.x,
    noon_data.geometry.y,
    c=noon_data['Methane_Concentration (ppm)'],
    cmap='YlOrRd',
    edgecolor='k',
    s=100
)

# Add sensor labels
for idx, row in noon_data.iterrows():
    ax.annotate(
        row['Sensor_ID'], 
        (row.geometry.x, row.geometry.y),
        xytext=(5, 5),
        textcoords="offset points",
        fontsize=10,
        color='black',
        fontweight='bold'
    )

# Add colorbar
cbar = plt.colorbar(contour, ax=ax)
cbar.set_label('Predicted Methane Concentration (ppm)', fontsize=12)

# Set title and labels
ax.set_title(f'Random Forest Predicted Methane Concentration\n{noon_timestamp}', fontsize=14)
ax.set_xlabel('Longitude', fontsize=12)
ax.set_ylabel('Latitude', fontsize=12)

plt.tight_layout()
plt.show()

## 9. Conclusions from Predictive Modeling

Based on our modeling and analysis, we can draw the following conclusions:

1. **Regression Model Performance**:
   - Both Random Forest and XGBoost models show good predictive performance for methane concentrations based on wind and spatial features.
   - Key predictive factors include wind direction, wind speed, time of day, and spatial coordinates.

2. **Time Series Forecasting**:
   - ARIMA modeling suggests temporal patterns in methane concentrations that can be predicted.
   - Short-term forecasts show the expected evolution of methane levels over time.

3. **Spatial Prediction**:
   - Our models can generate spatial predictions of methane concentration across the entire area.
   - This could be valuable for identifying potential high concentration areas between sensors.

4. **Feature Importance**:
   - Wind direction and speed are among the most important features for predicting methane concentrations.
   - Time of day (cyclical hour features) also plays a significant role.
   - Location coordinates capture the spatial variability in methane levels.

These models could be used to:
- Predict methane concentrations at unmonitored locations
- Forecast future methane levels based on expected wind conditions
- Identify the most influential factors driving methane dispersion