## Kaggle ML2
## Matteo A. D'Alessandro, Carlo A. Patti

For basic statistics and visualizations check the profile_report.html file in ../assets

The study area includes four wilderness areas located in the Roosevelt National Forest of Northern Colorado. Each observation is a 30m x 30m patch. There are seven forest cover types:

1. Spruce/Fir
2. Lodgepole Pine
3. Ponderosa Pine
4. Cottonwood/Willow
5. Aspen
6. Douglas-fir
7. Krummholz

# Data Fields
- **Elevation** elevation in meters
- **Aspect** aspect in degrees azimuth
- **Slope** slope in degrees
- **Horizontal_Distance_To_Hydrology** Horz dist to nearest surface water features
- **Vertical_Distance_To_Hydrology** Vert dist to nearest surface water features
- **Horizontal_Distance_To_Roadways** Horz dist to nearest roadway
- **Hillshade_9am** (0 to 255 index) hillshade index at 9am, summer solstice
- **Hillshade_Noon** (0 to 255 index) hillshade index at noon, summer solstice
- **Hillshade_3pm** (0 to 255 index) hillshade index at 3pm, summer solstice
- **Horizontal_Distance_To_Fire_Points** Horz dist to nearest wildfire ignition points
- **Wilderness_Area** (4 binary cols, 0=abs or 1=pres) wilderness area designation
- **Soil_Type** (40 binary cols, 0=abs or 1=pres) soil type designation
- **Cover_Type** (7 types) forest cover type designation

The wilderness areas are:
1. Rawah
2. Neota
3. Comanche Peak
4. Cache la Poudre

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
from ydata_profiling import ProfileReport
import plotly.express as px

sys.path.append('../../src')
from dataloader import *

%reload_ext autoreload
%autoreload 2

plots_theme = "plotly_dark"

In [2]:
df = load_train_df(
    PATH = '../../data',
    decode_dummies=False
)

In [3]:
df.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,2881.0,130.0,22.0,210.0,54.0,1020.0,250.0,221.0,88.0,342.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,3005.0,351.0,14.0,242.0,-16.0,1371.0,194.0,215.0,159.0,842.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,3226.0,63.0,14.0,618.0,2.0,1092.0,232.0,210.0,107.0,2018.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,3298.0,317.0,8.0,661.0,60.0,752.0,198.0,233.0,174.0,1248.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,3080.0,35.0,6.0,175.0,26.0,3705.0,219.0,227.0,144.0,2673.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15120 entries, 0 to 15119
Data columns (total 55 columns):
 #   Column                              Non-Null Count  Dtype   
---  ------                              --------------  -----   
 0   Elevation                           15120 non-null  float64 
 1   Aspect                              15120 non-null  float64 
 2   Slope                               15120 non-null  float64 
 3   Horizontal_Distance_To_Hydrology    15120 non-null  float64 
 4   Vertical_Distance_To_Hydrology      15120 non-null  float64 
 5   Horizontal_Distance_To_Roadways     15120 non-null  float64 
 6   Hillshade_9am                       15120 non-null  float64 
 7   Hillshade_Noon                      15120 non-null  float64 
 8   Hillshade_3pm                       15120 non-null  float64 
 9   Horizontal_Distance_To_Fire_Points  15120 non-null  float64 
 10  Wilderness_Area1                    15120 non-null  float64 
 11  Wilderness_Area2            

In [5]:
print("The number of features is" ,df.shape[1] - 1)
print("The number of samples is" ,df.shape[0])

The number of features is 54
The number of samples is 15120


- Our dataset has **54** features and **1** target variable, `Cover_Type`. 
- From 54 features, 10 are numeric and 44 are categorical.
- From 44 categorical, 40 are `Soil_Type` and 4 of `Wilderness_Area`
- These are the following forest cover types in target variable `Cover_Type`:
    1. Spruce/Fir
    2. Lodgepole Pine
    3. Ponderosa Pine
    4. Cottonwood/Willow
    5. Aspen
    6. Douglas-fir
    7. Krummholz

# Data Exploration
## Feature Statistics
- Part 1. Describe **numerical features**
- Part 2. Describe **binary/categorical features**

In [6]:
# extract all numerical features from train
num_features = df.iloc[:,1:11]

# extract all binary features from train
cat_features = df.iloc[:, 11:-1]

### Section 1: Quantitative Analysis of Feature Characteristics

In [8]:
num_features.describe()

Unnamed: 0,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area1
count,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0
mean,155.834524,16.556746,228.376521,51.311706,1717.977712,213.028836,218.865741,134.477116,1527.357804,0.235979
std,109.745372,8.534602,209.196381,61.520488,1330.26345,30.638406,22.797288,46.070054,1116.636997,0.424623
min,0.0,0.0,0.0,-135.0,0.0,52.0,99.0,0.0,0.0,0.0
25%,65.0,10.0,67.0,5.0,760.0,197.0,207.0,106.0,750.0,0.0
50%,125.0,15.0,180.0,32.0,1315.0,220.0,223.0,138.0,1266.0,0.0
75%,257.0,22.0,330.0,80.0,2292.0,236.0,235.0,166.0,2002.0,0.0
max,360.0,50.0,1376.0,570.0,6803.0,254.0,254.0,251.0,7095.0,1.0




#### Statistical Distribution Overview
- The **average value** across various attributes spans a broad spectrum, with the mean of individual features ranging between 16 and 2749.
- Variability, as indicated by the **standard deviation (std)**, exhibits notable disparities among features. Specifically, `Horizontal_Distance_To_Roadways` demonstrates the greatest variance, succeeded by `Horizontal_Distance_To_Fire_Points` and `Elevation` in terms of data dispersion.
- The attribute `Slope`, alongside the trio of `Hillshade` metrics, manifests the closest alignment to their respective mean values, indicating a high density around the mean.
- The **minimum values** across the dataset predominantly anchor at 0, with the exceptions being `Elevation` and `Vertical_Distance_To_Hydrology`. The former records the highest minimum threshold, whereas the latter incorporates negative values.
- Excluding `Hillshade_3pm`, the `Hillshade` variables share a proximate **maximum value**.
- Among all features, `Horizontal_Distance_To_Fire_Points` secures the highest maximum value, closely followed by `Horizontal_Distance_To_Roadways`. These attributes also delineate the upper echelons in terms of overall range.
- `Slope` is distinguished by the lowest maximum value and range within the dataset, with the `Aspect` feature marginally exceeding in similar aspects.

#### Measurement Units and Implications
- It is pertinent to acknowledge the units of measurement as a contributing factor to the observed statistical distribution. Specifically, five out of the ten evaluated variables are quantified in meters, encompassing `Elevation`, `Horizontal_Distance_To_Hydrology`, `Vertical_Distance_To_Hydrology`, `Horizontal_Distance_To_Roadways`, and `Horizontal_Distance_To_Fire_Points`. This metric basis rationalizes the elevated figures and expansive ranges noted in these variables.
- Conversely, features such as `Aspect` and `Slope` are calibrated in degrees, inherently capping their maximum feasible values at 360. Similarly, the `Hillshade` attributes are constrained to a maximum potential value of 255, adhering to their distinct measurement scale.


### Part 2. Describe categorical features

In [9]:
cat_features.describe()

Unnamed: 0,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4,Soil_Type1,Soil_Type2,Soil_Type3,Soil_Type4,Soil_Type5,Soil_Type6,Soil_Type7,...,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40
count,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,...,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0
mean,0.037632,0.416799,0.30959,0.022421,0.041468,0.066534,0.055489,0.011971,0.044907,6.6e-05,...,0.020106,0.043849,0.040939,0.00119,0.006812,0.000926,0.002116,0.049206,0.041931,0.030159
std,0.190312,0.493045,0.46234,0.148052,0.199377,0.249222,0.228941,0.108758,0.207108,0.008133,...,0.140367,0.204766,0.198156,0.034484,0.082257,0.030416,0.045957,0.216306,0.200439,0.17103
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


- Categorical features, represented as binary variables (0 or 1), reveal significant insights through their **means**:
    - **Prevalence in Data**: `Wilderness_Area3` and `Wilderness_Area4` exhibit the highest means, indicating their predominant presence within the dataset. Conversely, `Wilderness_Area2` shows the least presence.
    - **Exclusive Observations**: The cumulative mean of all `Wilderness_Area` categories approximates 1, suggesting that observations exclusively belong to a single Wilderness Area category.
- **Probability Distribution**:
    - `Wilderness_Area3` holds the highest probability of occurrence at 42.0%, followed by `Wilderness_Area4` at 30.9%. Refer to **Barplot #2** in the *Feature Visualization Section* for a detailed distribution view.
    - A similar probability analysis applies to `Soil_Types`, detailed in **Barplot #3**.

Given the disparity in distribution across categorical features, feature scaling is recommended to normalize feature ranges between 0 and 1. This standardization is crucial as some algorithms may yield skewed results due to sensitivity to higher value ranges, whereas others may not be affected.


### Feature Skewness Insights

- **Ideal Distribution**: In a perfectly normal distribution, skewness is expected to be zero, indicating a balanced dataset.
- **Skewness Interpretation**:
    - **Negative Skewness**: Indicates leftward skew, where the left tail of the distribution extends longer than the right, suggesting the bulk of data values lie to the right of the mean.
    - **Positive Skewness**: Signifies rightward skew, characterized by a longer right tail, indicating that most data values are concentrated to the left of the mean.

In [10]:
skew = df.skew()
skew_df = pd.DataFrame(skew, index=None, columns=['Skewness'])

  skew = df.skew()


In [11]:
print(skew)

Elevation                               0.074424
Aspect                                  0.466449
Slope                                   0.532567
Horizontal_Distance_To_Hydrology        1.438858
Vertical_Distance_To_Hydrology          1.509920
Horizontal_Distance_To_Roadways         1.247749
Hillshade_9am                          -1.075491
Hillshade_Noon                         -0.942747
Hillshade_3pm                          -0.353418
Horizontal_Distance_To_Fire_Points      1.651684
Wilderness_Area1                        1.243720
Wilderness_Area2                        4.859704
Wilderness_Area3                        0.337543
Wilderness_Area4                        0.823789
Soil_Type1                              6.452361
Soil_Type2                              4.600249
Soil_Type3                              3.479008
Soil_Type4                              3.883709
Soil_Type5                              8.975746
Soil_Type6                              4.395326
Soil_Type7          