This notebook aims to investigate the relationship between Soil_Type-based features and the forest Cover_Type.

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

os.chdir("..")
sys.path.append(os.getcwd())

%matplotlib inline

sns.set_theme(style="whitegrid")


In [10]:
from src.preprocessing import preprocess_pipeline
from src.feature_engineering import engineer_features
from src.plot_utils import display_zone_distributions

X_train, X_val, y_train, y_val, test_df = preprocess_pipeline()

train_df = X_train.copy()
train_df["Cover_Type"] = y_train.values

In [None]:
# zone_columns = [col for col in train_df.columns if "climatic_zone" in col or "geological_zone" in col]
# corr_df = train_df[zone_columns + ["Cover_Type"]].corr()

# plt.figure(figsize=(8, 6))
# sns.heatmap(corr_df, annot=True, cmap="coolwarm", fmt=".2f")
# plt.title("Correlation Heatmap: Climatic and Geological Zones vs Cover_Type")
# plt.show()

In [11]:
climatic_zone_names = ["climatic_zone_2", "climatic_zone_3", "climatic_zone_4", 
                       "climatic_zone_5", "climatic_zone_6", "climatic_zone_7", 
                       "climatic_zone_8"]

geological_zone_names = ["geological_zone_1", "geological_zone_2", "geological_zone_5", "geological_zone_7"]


print("Cover_Type Distribution (%) by Climatic Zone:")
display(display_zone_distributions(train_df, climatic_zone_names) * 100)

print("\nCover_Type Distribution (%) by Geological Zone:")
display(display_zone_distributions(train_df, geological_zone_names) * 100)

Cover_Type Distribution (%) by Climatic Zone:


Unnamed: 0_level_0,climatic_zone_2,climatic_zone_3,climatic_zone_4,climatic_zone_5,climatic_zone_6,climatic_zone_7,climatic_zone_8
Cover_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0.087566,37.5,3.090677,0.0,1.147959,36.568799,20.464656
2,4.816112,62.5,34.891443,0.0,11.479592,45.585013,2.224419
3,35.055458,0.0,17.598978,3.846154,5.357143,0.065732,0.0
4,38.674839,0.0,4.342273,69.871795,43.367347,0.0,0.0
5,5.370695,0.0,11.443167,0.0,24.362245,12.280894,0.0
6,15.703444,0.0,28.633461,26.282051,14.285714,1.4461,0.0
7,0.291886,0.0,0.0,0.0,0.0,4.053462,77.310924



Cover_Type Distribution (%) by Geological Zone:


Unnamed: 0_level_0,geological_zone_1,geological_zone_2,geological_zone_5,geological_zone_7
Cover_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,11.509591,55.239064,37.5,16.358271
2,15.929942,34.079349,62.5,30.528063
3,4.003336,0.0,0.0,11.655499
4,37.447873,0.0,0.0,9.190385
5,15.346122,6.256358,0.0,10.063318
6,15.679733,0.050865,0.0,10.788713
7,0.083403,4.374364,0.0,11.41575


Based on the distribution plots/tables:

- **Climatic Zones:**  
  For example, `climatic_zone_8` shows a strong skew towards Cover_Type 7, suggesting it is a  discriminative feature.
  
- **Geological Zones:**  
  The distribution for `geological_zone_7` appears rather uniform across all Cover_Type classes. Thus, it may not provide useful discriminative info.

**Recommendation:**   
Consider excluding `geological_zone_7` during model training to reduce noise and dimensionality.

You may choose to drop `geological_zone_7` from the dataset before training the model using:
```python
if "geological_zone_7" in X_train.columns:
    X_train.drop(columns=["geological_zone_7"], inplace=True)
if "geological_zone_7" in X_val.columns:
    X_val.drop(columns=["geological_zone_7"], inplace=True)
