# Multicollinearity Analysis for Environmental Modelling

## 1. DTM and Slope 

In environmental modelling, multicollinearity between predictor variables can introduce redundancy and distort the results of statistical analyses and machine learning models. Slope and Digital Terrain Model (DTM) elevation data are often derived from the same terrain source, which can result in a high degree of correlation. This analysis aims to determine the extent of multicollinearity between these two variables to guide their inclusion in species distribution models (SDMs).

The analysis calculates the Pearson correlation coefficient to assess the linear relationship between the two variables and uses the Variance Inflation Factor (VIF) to quantify multicollinearity. These metrics inform whether both predictors should be retained or if one should be excluded from the model to ensure robustness.


In [1]:
import rasterio
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Define file paths for the DTM and slope rasters
dtm_path = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/DTM/DTM_30.tif"
slope_path = "C:/GIS_Course/MScThesis-MaviSantarelli/data/Predictors/Slope/Slope_30.tif"

# Function to read raster data and return flattened array of values
def read_raster_as_array(raster_path):
    with rasterio.open(raster_path) as src:
        data = src.read(1)  # Read the first band
        data[data == src.nodata] = np.nan  # Replace nodata values with NaN
    return data.flatten()

# Load raster data
print("Loading raster data...")
dtm_values = read_raster_as_array(dtm_path)
slope_values = read_raster_as_array(slope_path)

# Remove NaN values from both arrays
print("Cleaning data...")
valid_mask = ~np.isnan(dtm_values) & ~np.isnan(slope_values)
dtm_clean = dtm_values[valid_mask]
slope_clean = slope_values[valid_mask]

# Combine into a DataFrame for analysis
data = pd.DataFrame({
    "DTM": dtm_clean,
    "Slope": slope_clean
})

# Calculate Pearson correlation coefficient
correlation = data.corr().loc["DTM", "Slope"]
print(f"Pearson correlation coefficient between DTM and Slope: {correlation:.2f}")

# Calculate Variance Inflation Factor (VIF)
def calculate_vif(df):
    X = df.values
    vif_data = pd.DataFrame()
    vif_data["Variable"] = df.columns
    vif_data["VIF"] = [variance_inflation_factor(X, i) for i in range(X.shape[1])]
    return vif_data

vif = calculate_vif(data)
print("\nVariance Inflation Factor (VIF):")
print(vif)

# Interpretation
if abs(correlation) > 0.7 or vif["VIF"].max() > 5:
    print("\nHigh multicollinearity detected. Consider including only one variable (either DTM or Slope) in the model.")
else:
    print("\nLow multicollinearity detected. Both variables (DTM and Slope) can be included in the model.")

Loading raster data...
Cleaning data...
Pearson correlation coefficient between DTM and Slope: 0.73

Variance Inflation Factor (VIF):
  Variable       VIF
0      DTM  2.766431
1    Slope  2.766439

High multicollinearity detected. Consider including only one variable (either DTM or Slope) in the model.



### **Results**

### Pearson Correlation Coefficient
The Pearson correlation coefficient between DTM and Slope was calculated as **0.73**. This indicates a strong positive linear relationship between the two variables. This result is expected since slope is mathematically derived from elevation data, leading to inherent correlation.

### Variance Inflation Factor (VIF)
The calculated VIF values for both variables are:
- **DTM:** 2.77
- **Slope:** 2.77

Although these values are below the common threshold of 5, they are notable and suggest some level of redundancy. High VIF values typically indicate that one variable can be predicted from the other, leading to multicollinearity.

### **Interpretation**
The strong correlation and VIF values suggest that retaining both variables in the model may not significantly improve its predictive power. Including both could introduce redundancy, complicate model interpretation, and potentially inflate standard errors. 

### Why This Analysis is Interesting
Applying multicollinearity analysis to slope and DTM is particularly relevant because:
1. Both are derived from the same terrain data, making their relationship highly intuitive yet impactful on model performance.
2. Environmental predictors in SDMs benefit from being orthogonal (uncorrelated), as this improves model stability and interpretability.
3. Determining whether to include slope, DTM, or both ensures a streamlined model with reduced computational complexity and a focus on predictors that add unique explanatory value.

This analysis underscores the importance of evaluating predictor variables for multicollinearity to enhance the reliability of species distribution models.