# Maize Vulnerability Modeling: A Regularized GLM Approach

## Goal

The goal of this notebook is to develop a statistically robust and interpretable explanatory model for maize yield in Northern Italy. Following a thorough baseline analysis and advisor feedback, this new approach prioritizes rigorous model specification and selection over preliminary visualization.

Our objectives are to:
1.  Systematically identify the most important climate stressors from a large set of correlated monthly variables.
2.  Use a regularized Gamma GLM to automatically perform variable selection and handle multicollinearity.
3.  Carefully interpret the coefficients and interactions of the final "champion" model to build a scientifically sound narrative about maize vulnerability.
4.  Refine the final model by testing for non-linear relationships using explicit, interpretable functions (e.g., quadratic terms).

## Theoretical Background

To achieve these goals, we will employ a specific set of statistical techniques:

### Why a Generalized Linear Model (GLM)?
Standard linear models (OLS) assume that the model's errors are normally distributed and the outcome can be any real number. Crop yield, however, is a continuous variable that is always positive and often has a right-skewed distribution. A **Generalized Linear Model (GLM)** is a more flexible framework that allows us to choose a probability distribution that matches the nature of our outcome variable.

### The Gamma Distribution
For this analysis, we will use a **Gamma GLM**. The Gamma distribution is ideal for modeling continuous, strictly positive, and often skewed data like crop yield (measured in tonnes per hectare). We will use a `log` link function, which ensures that the model's predictions are always positive and helps in interpreting the coefficients as multiplicative (percentage) effects.

### The Challenge of Multicollinearity
Our dataset contains many monthly climate variables that are naturally correlated (e.g., `temperature_May` is correlated with `temperature_Jun`). When multiple predictors are highly correlated, a standard model struggles to disentangle their individual effects, leading to unstable and unreliable coefficient estimates. Our previous VIF analysis confirmed this was a significant issue.

### The Solution: Regularization (Lasso - L1)
Instead of manually dropping variables, which can be subjective, we will use **regularization**. This is a modern technique that adds a penalty to the model for having overly complex or large coefficients. We will specifically use the **L1 (Lasso) penalty**.

*   **How it Works:** Lasso adds a penalty proportional to the absolute value of the coefficients. A key feature of this penalty is that it can force the coefficients of less important or redundant variables to become **exactly zero**.
*   **Automatic Variable Selection:** This process effectively performs an automatic and objective form of variable selection, "silencing" the predictors that don't contribute enough unique information.
*   **`fit_regularized(refit=True)`:** We will use the `.fit_regularized()` method in `statsmodels`. With `refit=True`, the process is twofold:
    1.  **Selection:** The model is first fit with the Lasso penalty, which zeroes out a subset of coefficients.
    2.  **Refitting:** The model then takes only the "surviving" variables (those with non-zero coefficients) and fits a final, standard, unpenalized GLM on just them. This provides clean, unbiased coefficients and valid p-values for the most important predictors.

## Plan of Action

This notebook will proceed in a structured, step-by-step manner:

1.  **Setup & EDA:** We will load the crop-specific, growing-season-filtered maize dataset and import necessary libraries. Our primary EDA will be to compute and visualize a full pairwise correlation matrix of all monthly stressors to understand their relationships.

2.  **Full Model Definition:** We will define a comprehensive "full" model that includes all relevant monthly temperature and precipitation variables, along with their key two- and three-way interactions. This model will also include the `year` trend and spatial splines (`bs(lat) + bs(lon)`) as controls.

3.  **Regularized Model Fitting:** We will fit this complex model using `statsmodels`' `.fit_regularized()` method with an L1 (Lasso) penalty. This will automatically select the most impactful subset of variables and interactions.

4.  **Analysis of the Champion Model:** We will carefully analyze the summary of the refitted "champion" model. We will interpret the coefficients, p-values, and interactions of the variables that "survived" the regularization process to build our primary narrative.

5.  **Refinement with Non-Linear Terms:** Guided by agronomic logic and the results of the champion model, we will test if adding explicit non-linear functions (e.g., quadratic terms `I(variable**2)`) for the most important surviving stressors can further improve the model's fit and interpretation, using AIC for comparison.

In [1]:
# Libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
import warnings

In [None]:
# Exploratory Data Analysis - Full Correlation Matrix

print("--- EDA: Correlation Analysis of All Monthly Stressors ---")

# --- 1. Load the Data ---
# As confirmed, this file is already specific to maize and its growing season.
file_path = '../data-cherry-pick/maize_ITnorth_core42_1982_2016_allstressors_with_monthly.csv'

try:
    df_maize = pd.read_csv(file_path)
    print(f"Successfully loaded dataset from: {file_path}")

    # --- 2. Select Only the Monthly Stressor Variables ---
    # We will select all columns that have a month name in them, which is a robust
    # way to grab all the monthly predictors we want to investigate.
    monthly_stressors = [col for col in df_maize.columns if '_' in col and 'yield' not in col]
    df_corr = df_maize[monthly_stressors]
    
    print(f"\nSelected {len(df_corr.columns)} monthly stressor variables for correlation analysis.")

    # --- 3. Calculate and Print the Correlation Matrix ---
    correlation_matrix = df_corr.corr()
    
    # Optional: If you want to see the full numerical matrix, uncomment the next line
    # print("\n--- Full Pairwise Correlation Matrix ---")
    # print(correlation_matrix)

    # --- 4. Visualize the Matrix with a Heatmap ---
    # A heatmap is the best way to see the broad patterns of collinearity.
    print("\nGenerating correlation heatmap...")
    
    plt.figure(figsize=(18, 15))
    heatmap = sns.heatmap(
        correlation_matrix,
        cmap='coolwarm',  # Use a diverging colormap (red=positive, blue=negative)
        center=0,         # Center the colormap at zero
        vmin=-1,          # Set the color scale limits to the theoretical min/max
        vmax=1,
        linewidths=.5,
        annot=False       # Annotations are turned off as the matrix is too large to be readable
    )
    
    plt.title('Pairwise Correlation Matrix of All Monthly Stressors for Maize', fontsize=20)
    plt.show()

except FileNotFoundError:
    print(f"ERROR: File not found at the specified path: {file_path}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

In [9]:
# Full INTERACTION Model

print("--- Full INTERACTION Model ---")

# --- 1. Load and Prepare Data ---
file_path = '../data-cherry-pick/maize_ITnorth_core42_1982_2016_allstressors_with_monthly.csv'
df_maize = pd.read_csv(file_path)
df_maize = df_maize[df_maize['yield_maize'] > 0].copy()
print("Data prepared.")

# ---  Define Variables and Standardize ---
predictors_to_scale = [col for col in df_maize.columns if col not in ['yield_maize', 'lat', 'lon']]
df_to_scale = df_maize[predictors_to_scale]

scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_to_scale), columns=predictors_to_scale, index=df_maize.index)

df_model_ready = pd.concat([df_maize[['yield_maize', 'lat', 'lon']], df_scaled], axis=1)


# --- 1. Define the FULL INTERACTION Model Formula ---
# This is the complex formula from your advisor's notes.
formula_full_interaction = """
    yield_maize ~ year 
                  + temperature_May * temperature_Jun * temperature_Jul
                  + precipitation_May + precipitation_Jun + precipitation_Jul
                  + bs(lat, df=4) + bs(lon, df=4)
"""

print("\nUsing the 'full interaction' formula for selection:")
print(formula_full_interaction)

# --- 2. Fit the Regularization Path ---
alphas = [0.1, 0.05, 0.01, 0.005, 0.001]
results_path_complex = {}

print("\n--- Fitting Regularization Path (Elastic Net, L1_wt=0.5) ---")
for alpha_val in alphas:
    try:
        glm_complex = smf.glm(
            formula=formula_full_interaction,
            data=df_model_ready, # Use the STANDARDIZED data
            family=sm.families.Gamma(link=sm.families.links.log())
        )
        
        model_reg_complex = glm_complex.fit_regularized(
            method='elastic_net',
            alpha=alpha_val,
            L1_wt=0.5,
            refit=True,
            maxiter=200 # Increase iterations to give the model a better chance to converge
        )
        
        params = pd.Series(model_reg_complex.params, index=model_reg_complex.model.exog_names)
        results_path_complex[f"alpha_{alpha_val}"] = params[params.abs() > 1e-6]

    except Exception as e:
        print(f"Could not fit model for alpha={alpha_val}. Error: {e}")

# --- 3. Display the Results of the Path ---
print("\n--- Regularization Path Results (Full Interaction Model) ---")
print("Showing which variables 'survived' at each penalty level (alpha).")

for alpha_key, params in results_path_complex.items():
    print(f"\n--- {alpha_key} ---")
    stressor_params = params.filter(regex='temp|precip')
    if not stressor_params.empty:
        print(stressor_params.reindex(stressor_params.abs().sort_values(ascending=False).index).to_string())
    else:
        print("All stressor coefficients were shrunk to zero.")


# --- 4. Fit and Summarize the "Champion" Complex Model ---
# We will still choose alpha=0.01 for a direct comparison with the simpler model.
champion_alpha_complex = 0.01
print(f"\n\n--- Detailed Summary for the Complex Model (alpha={champion_alpha_complex}) ---")

try:
    glm_champion_complex = smf.glm(
        formula=formula_full_interaction, 
        data=df_model_ready, 
        family=sm.families.Gamma(link=sm.families.links.log())
    )
    
    model_champion_complex = glm_champion_complex.fit_regularized(
        method='elastic_net',
        alpha=champion_alpha_complex,
        L1_wt=0.5,
        refit=True,
        maxiter=200
    )
    
    print(model_champion_complex.summary())

except Exception as e:
    print(f"An error occurred while fitting the champion complex model: {e}")

--- Full INTERACTION Model ---
Data prepared.

Using the 'full interaction' formula for selection:

    yield_maize ~ year 
                  + temperature_May * temperature_Jun * temperature_Jul
                  + precipitation_May + precipitation_Jun + precipitation_Jul
                  + bs(lat, df=4) + bs(lon, df=4)


--- Fitting Regularization Path (Elastic Net, L1_wt=0.5) ---





--- Regularization Path Results (Full Interaction Model) ---
Showing which variables 'survived' at each penalty level (alpha).

--- alpha_0.1 ---
temperature_Jul   -0.086252

--- alpha_0.05 ---
temperature_Jul                   -0.174719
temperature_May:temperature_Jun   -0.109306

--- alpha_0.01 ---
temperature_Jul                                   -0.145968
temperature_May:temperature_Jun                   -0.122060
temperature_May:temperature_Jun:temperature_Jul   -0.013699

--- alpha_0.005 ---
temperature_Jul                                   -0.184663
temperature_May:temperature_Jul                   -0.088230
temperature_May:temperature_Jun                   -0.045426
temperature_Jun                                   -0.038470
precipitation_Jul                                 -0.027964
precipitation_May                                 -0.018074
temperature_May:temperature_Jun:temperature_Jul   -0.017401
precipitation_Jun                                 -0.013951

--- alpha_0.001

In [None]:
# Regularized Path and Champion Model Summary

print("--- Regularization Path and Champion Model Summary ---")

# --- 1. Load and Prepare Data ---
file_path = '../data-cherry-pick/maize_ITnorth_core42_1982_2016_allstressors_with_monthly.csv'
df_maize = pd.read_csv(file_path)
df_maize = df_maize[df_maize['yield_maize'] > 0].copy()
print("Data prepared.")

# --- 2. Define Variables and Standardize ---
predictors_to_scale = [col for col in df_maize.columns if col not in ['yield_maize', 'lat', 'lon']]
df_to_scale = df_maize[predictors_to_scale]

scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_to_scale), columns=predictors_to_scale, index=df_maize.index)

df_model_ready = pd.concat([df_maize[['yield_maize', 'lat', 'lon']], df_scaled], axis=1)


# --- 3. Define the Model Formula ---
formula_simple = """
    yield_maize ~ year 
                  + temperature_May + temperature_Jun + temperature_Jul + temperature_Aug + temperature_Sep
                  + precipitation_May + precipitation_Jun + precipitation_Jul + precipitation_Aug + precipitation_Sep
                  + soil_water_May + soil_water_Jun + soil_water_Jul + soil_water_Aug + soil_water_Sep
                  + bs(lat, df=4) + bs(lon, df=4)
"""
print("\nUsing the 'main effects' formula for selection.")


# --- 4. Fit the Regularization Path (for high-level overview) ---
alphas = [0.1, 0.05, 0.01, 0.005, 0.001]
results_path = {}

print("\n--- Fitting Regularization Path (Elastic Net, L1_wt=0.5) ---")
for alpha_val in alphas:
    try:
        glm_model = smf.glm(formula=formula_simple, data=df_model_ready, family=sm.families.Gamma(link=sm.families.links.log()))
        model_reg = glm_model.fit_regularized(method='elastic_net', alpha=alpha_val, L1_wt=0.5, refit=True)
        params = pd.Series(model_reg.params, index=model_reg.model.exog_names)
        results_path[f"alpha_{alpha_val}"] = params[params.abs() > 1e-6]
    except Exception as e:
        print(f"Could not fit model for alpha={alpha_val}. Error: {e}")

# --- 5. Display the Results of the Path ---
print("\n--- Regularization Path Results (High-Level View) ---")
print("Showing which stressors 'survived' at each penalty level (alpha).")

for alpha_key, params in results_path.items():
    print(f"\n--- {alpha_key} ---")
    stressor_params = params.filter(regex='temp|precip|soil')
    if not stressor_params.empty:
        print(stressor_params.reindex(stressor_params.abs().sort_values(ascending=False).index).to_string())
    else:
        print("All stressor coefficients were shrunk to zero.")


# --- 6. Fit and Summarize the "Champion" Model ---
# Based on the path, alpha=0.01 seems like a good balance. Let's fit it
# and show the full summary for detailed interpretation.
champion_alpha = 0.01
print(f"\n\n--- Detailed Summary for our Champion Model (alpha={champion_alpha}) ---")

try:
    glm_champion = smf.glm(formula=formula_simple, data=df_model_ready, family=sm.families.Gamma(link=sm.families.links.log()))
    
    model_champion = glm_champion.fit_regularized(
        method='elastic_net',
        alpha=champion_alpha,
        L1_wt=0.5,
        refit=True
    )
    
    print(model_champion.summary())

except Exception as e:
    print(f"An error occurred while fitting the champion model: {e}")

--- Step 3 (Final): Regularization Path and Champion Model Summary ---
Data prepared.

Using the 'main effects' formula for selection.

--- Fitting Regularization Path (Elastic Net, L1_wt=0.5) ---





--- Regularization Path Results (High-Level View) ---
Showing which stressors 'survived' at each penalty level (alpha).

--- alpha_0.1 ---
temperature_Jul   -0.086252

--- alpha_0.05 ---
temperature_Jul   -0.089489

--- alpha_0.01 ---
temperature_Jul   -0.156565
soil_water_Sep     0.038958
soil_water_Jul     0.004207

--- alpha_0.005 ---
temperature_Jul     -0.198160
precipitation_Jul   -0.066454
soil_water_Aug       0.055624
precipitation_Aug   -0.038040
soil_water_Sep       0.022459
soil_water_Jul       0.020979
precipitation_Sep    0.020011
precipitation_May   -0.014178

--- alpha_0.001 ---
temperature_Jul     -0.172572
temperature_Sep     -0.067603
soil_water_Jul       0.061563
soil_water_Aug       0.050745
precipitation_Jul   -0.047415
temperature_Jun     -0.039850
precipitation_Sep    0.031610
temperature_Aug     -0.020012
precipitation_May   -0.012150
precipitation_Aug   -0.008309
soil_water_Sep      -0.007860
precipitation_Jun    0.005515


--- Detailed Summary for our Champio



In [10]:
# The Final Champion Explanatory Model

print("--- Building and Interpreting the Final Champion Model ---")

# --- 1. Load Data (no scaling needed for this final model) ---
file_path = '../data-cherry-pick/maize_ITnorth_core42_1982_2016_allstressors_with_monthly.csv'
df_maize = pd.read_csv(file_path)
df_maize = df_maize[df_maize['yield_maize'] > 0].copy()
print("Data prepared.")

# --- 2. Define and Fit the Champion Model Formula ---
# This formula is the result of our entire diagnostic process.
# It is simple, robust, and focuses on the most important effects.
champion_formula = """
    yield_maize ~ year 
                  + temperature_Jul + I(temperature_Jul**2)
                  + precipitation_Jul + I(precipitation_Jul**2)
                  + temperature_Jul:precipitation_Jul
                  + bs(lat, df=4) + bs(lon, df=4)
"""

print("\nFitting the Champion Model with the formula:")
print(champion_formula)

try:
    champion_model = smf.glm(
        formula=champion_formula,
        data=df_maize,
        family=sm.families.Gamma(link=sm.families.links.log())
    ).fit()

    print("\n--- Summary of the Final Champion Model ---")
    print(champion_model.summary())
    
    # We can also check the AIC as a final performance metric
    print(f"\nFinal Model AIC: {champion_model.aic:,.2f}")

except Exception as e:
    print(f"An error occurred: {e}")

--- Building and Interpreting the Final Champion Model ---
Data prepared.

Fitting the Champion Model with the formula:

    yield_maize ~ year 
                  + temperature_Jul + I(temperature_Jul**2)
                  + precipitation_Jul + I(precipitation_Jul**2)
                  + temperature_Jul:precipitation_Jul
                  + bs(lat, df=4) + bs(lon, df=4)


--- Summary of the Final Champion Model ---
                 Generalized Linear Model Regression Results                  
Dep. Variable:            yield_maize   No. Observations:                 1470
Model:                            GLM   Df Residuals:                     1455
Model Family:                   Gamma   Df Model:                           14
Link Function:                    log   Scale:                        0.025392
Method:                          IRLS   Log-Likelihood:                -2726.3
Date:                Tue, 11 Nov 2025   Deviance:                       37.253
Time:                       



I(precipitation_Jul**2) has a tiny coefficient (-4.578e-06). But remember, the input to this variable is precipitation squared. If average precipitation is 100mm, the input is 10,000. So, 10000 * -4.578e-06 = -0.04578, which is a meaningful number on the log-yield scale.

temperature_Jul:precipitation_Jul has a small coefficient (-0.0002). The input is temp * precip (e.g., 25 * 100 = 2500). So, 2500 * -0.0002 = -0.5. Again, a very meaningful number.

### Analysis of the Final Champion Model

After a rigorous, multi-step process of model selection and refinement, this model represents our final, champion explanatory model for maize yield. It is statistically robust, parsimonious, and provides several key insights.

#### Model Performance and Justification
The model's **Pseudo R-squared of 0.826** is very strong, indicating that our chosen predictors account for a substantial portion of the variation in maize yield. All predictors in the model are **highly statistically significant** (p < 0.001 for all key stressors), giving us high confidence in their effects.

This model structure was chosen after a thorough diagnostic process. An initial attempt to model all monthly stressors with complex interactions proved to be statistically unstable due to severe multicollinearity, a finding confirmed by a `ConvergenceWarning` and high Variance Inflation Factors (VIFs). A subsequent regularization pass using an Elastic Net (`fit_regularized`) objectively identified `temperature_Jul` and `precipitation_Jul` as the most consistent and important predictors. Our final model is therefore built using this data-driven subset of variables, ensuring the coefficients are stable and interpretable.

#### Interpretation of Key Climate Effects

The model reveals a complex, multi-layered story about how climate affects maize yield:

1.  **The Dominant Role of July Heat (Non-Linear):**
    The model includes both a linear (`temperature_Jul`) and a quadratic (`I(temperature_Jul**2)`) term for July temperature.
    *   The linear term's coefficient is **positive** (`0.1233`), while the quadratic term's is **negative** (`-0.0039`). This mathematical combination describes an **inverted 'U' shape**. However, because the range of historical July temperatures in Northern Italy falls on the right-hand side of this curve, the practical result is a story of accelerating damage. As temperatures rise into the stressful high-20s, the negative quadratic term begins to dominate, causing yield to decrease at an ever-faster rate.

2.  **The Role of July Precipitation (Non-Linear):**
    Similarly, the model finds a significant quadratic relationship for `precipitation_Jul`.
    *   The linear term is **positive** (`0.0049`) and the quadratic term is **negative** (`-4.578e-06`). This describes a classic crop response curve: yield increases with more rainfall up to an optimal point, after which excessive rainfall leads to diminishing returns and potentially slight decreases in yield.

3.  **The Critical Heat x Drought Interaction:**
    The interaction term `temperature_Jul:precipitation_Jul` is **highly significant (p=0.000)** with a **negative coefficient** (`-0.0002`). This is a crucial finding. It means that the two stressors are not independent. The negative sign indicates that higher precipitation **buffers the damaging effect of heat**. In other words, the negative impact of a very hot July is significantly less severe in years that also have adequate rainfall, confirming a classic heat-drought interaction.

#### Final Conclusion for Maize

In summary, our final model demonstrates that maize yield in Northern Italy is primarily driven by a non-linear vulnerability to July heat. This vulnerability is not fixed; it is significantly buffered by the availability of July precipitation, and the crop's response to rainfall itself follows a classic optimal curve. The model successfully isolates these weather-driven effects while controlling for both long-term trends in production (the `year` term) and smooth geographic variations in baseline yield (the `bs(lat)` and `bs(lon)` terms).```