# Feature Engineering

## Kaggle ML2
## Matteo A. D'Alessandro, Carlo A. Patti

In this notebook we are going to:

- Delete columns which has `'0'` value for all observation.
- Delete observation which has null values in any of its features.
- Deleting duplicate entries but keeping first.
- Take a look at if any observations is present in more than one type in same category of Wilderness and Soil Type.
- Reducing features by keeping best.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sys

sys.path.append('../../src')
from dataloader import *

%reload_ext autoreload
%autoreload 2

plots_theme = "plotly_dark"

In [2]:
df = load_train_df(
    PATH = '../../data',
    decode_dummies=False
)

In [3]:
num_features = df.iloc[:,0:10]
cat_features = df.iloc[:, 10:-1]

wild_data, soil_data = cat_features.iloc[:,:4], cat_features.iloc[:,4:]

### Observation Cleaning

There's a possibility where we can have an observation where `Soil Type` and `Wilderness Area` are recorded as present for more than one type or maybe none.

Below code will show us if we have any..

**Checking for Wilderness Area.**

Checking if any observation have more than 1 presence of Wilderness area at same time or None

In [4]:
# Count for more than 1 presence
more_count = 0
# Count for none presence
none_count = 0
# total count
total = 0

for index, row in wild_data.iterrows():
    total = row.sum(axis=0)
    
    #checking greater than 1
    if total > 1:
        more_count =+ 1
        # reset the total
        total = 0
        break
        
    #checking for none   
    if total == 0:
        # if found, increment count by 1
        none_count =+ 1
        # reset the total
        total = 0      

print('We have ', more_count, ' observations that shows presence in more than 1 Wilderness Area.')
print('We have ' ,none_count, ' observations that shows no presence in any Wilderness Area.')

We have  0  observations that shows presence in more than 1 Wilderness Area.
We have  0  observations that shows no presence in any Wilderness Area.


**Checking for Soil Type.**

Checking if any observation have more than 1 presence of Soil Type area at same time or None

In [5]:
more_count = 0
none_count = 0
total = 0

for index, row in soil_data.iterrows():
    total = row.sum(axis=0)
    
    if total > 1:
        more_count =+ 1
        total = 0

        break
        
    if total == 0:
        none_count =+ 1
        total = 0      

print('We have ', more_count, ' observations that shows presence in more than 1 Soil Type Area.')
print('We have ' ,none_count, ' observations that shows no presence in any Soil Type Area.')

We have  0  observations that shows presence in more than 1 Soil Type Area.
We have  0  observations that shows no presence in any Soil Type Area.


#### Handling Missing Values

**Removing Observation which has any Missing Values in it**

In [6]:
df.dropna()

df.shape

(15120, 55)

There are no missing values in the dataset.

#### Handling Duplicates

In [7]:
df.drop_duplicates(keep='first')

df.shape

(15120, 55)

There are no duplicates in the dataset.

### Dimentionality Reduction

- Since we already have lots of observation now to train the model, we also happen to have lots of features. This will make algorithm run very slowly, have difficulty in learning and also tend to overfit in training set and do worse in testing.

- We also see above in visualization section that `Wilderness Area` and `Soil Type` Area have no category that has no observations of it. So every feature has presence or values of an observations so we can't just delete any feature since it may have an important informations for our models in predicting classes.

- To approach such a problem, we need to see how each feature has an impact on prediciting classes, and the best way to do this is by asking the models only.

- Classifiers like `Extra Trees, Random Forest, Gradient Boosting Classifiers and AdaBoost` offer an attribute called `'feature_importance_'` with which we can see that which feature has more importance compared to others and by how much.

So now let's run all the 4 classifiers on our entire model, train from it and give us which feature for that was important in terms of predicting classes.

#### Extra-Trees Classifier

In [8]:
from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier(random_state = 42)

X = df.iloc[:,:-1]
y = df['Cover_Type']

model.fit(X, y)

# extracting feature importance from model and making a dataframe of it in descending order
ETC_feature_importances = pd.DataFrame(model.feature_importances_, index = X.columns, columns=['ETC']).sort_values('ETC', ascending=False)

model = None

ETC_feature_importances.head(10)

Unnamed: 0,ETC
Elevation,0.166172
Horizontal_Distance_To_Roadways,0.081416
Horizontal_Distance_To_Fire_Points,0.067628
Horizontal_Distance_To_Hydrology,0.058656
Wilderness_Area4,0.052043
Vertical_Distance_To_Hydrology,0.049678
Aspect,0.049159
Hillshade_9am,0.048314
Hillshade_3pm,0.045403
Hillshade_Noon,0.045079


#### Random Forest Classifier

In [9]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state = 42)

model.fit(X, y)

# extracting feature importance from model and making a dataframe of it in descending order
RFC_feature_importances = pd.DataFrame(model.feature_importances_, index = X.columns, columns=['RFC']).sort_values('RFC', ascending=False)

model = None

RFC_feature_importances.head(10)

Unnamed: 0,RFC
Elevation,0.223542
Horizontal_Distance_To_Roadways,0.094263
Horizontal_Distance_To_Fire_Points,0.074507
Horizontal_Distance_To_Hydrology,0.06182
Vertical_Distance_To_Hydrology,0.053246
Hillshade_9am,0.050765
Aspect,0.047537
Hillshade_3pm,0.045444
Hillshade_Noon,0.044717
Wilderness_Area4,0.042976


#### AdaBoost Classifier

In [10]:
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier(random_state = 53)

model.fit(X, y)

# extracting feature importance from model and making a dataframe of it in descending order
ADB_feature_importances = pd.DataFrame(model.feature_importances_, index = X.columns, columns=['ADB']).sort_values('ADB', ascending=False)

model = None

ADB_feature_importances.head(10)

Unnamed: 0,ADB
Wilderness_Area4,0.46
Elevation,0.3
Horizontal_Distance_To_Fire_Points,0.08
Vertical_Distance_To_Hydrology,0.06
Aspect,0.04
Horizontal_Distance_To_Hydrology,0.02
Slope,0.02
Soil_Type4,0.02
Soil_Type20,0.0
Soil_Type21,0.0


#### Gradient Boosting Classifier

In [11]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(random_state = 53)

model.fit(X, y)

# extracting feature importance from model and making a dataframe of it in descending order
GBC_feature_importances = pd.DataFrame(model.feature_importances_, index = X.columns, columns=['GBC']).sort_values('GBC', ascending=False)

model = None

GBC_feature_importances.head(10)

Unnamed: 0,GBC
Elevation,0.585156
Soil_Type10,0.057545
Horizontal_Distance_To_Roadways,0.047709
Horizontal_Distance_To_Hydrology,0.04228
Hillshade_9am,0.039397
Horizontal_Distance_To_Fire_Points,0.03929
Soil_Type30,0.021908
Vertical_Distance_To_Hydrology,0.020034
Soil_Type4,0.016131
Soil_Type12,0.01435


There they are, each classifier giving its top choice of features.

- We can see that `RFC` and `ETC` show similar results, yes there are features which show-up different ranks but not of a great difference. Each feature show a little similar numbers.

- `GBC` also happens to show similar results but little different that those `RFC` and `ETC` classifier's results.

- `ADBC`, show a unique and very interesting results. The top `8` features are alone enough to predict classes and highest taken by `Wilderness Area4` followed by `Elevation` which is being followed in other classifiers!! This is interesting because `Wilderness Area4` isn't even present in the top `10` except of `RFC` which had showed about `4.47%` importance which is very different than `ADBC`'s result score of `46%`

- `Elevation` do take on similar dominance in predicting class. It is present in top `10` of all classifiers. `ETC`, `RFC` and `GBC` show it as the most important feature. `ADBC` shows it as the second most important feature.

- `Hillshade` features are seen on top 10 list of every classifier except for `ADB`. `ETC` and `RFC` show all `Hillshade` features having similar dominance while `GBC` shows a percent less.

- In above visualization section of [Correlation] in the EDA notebook, we saw that `Hillshade` features had nice correlation with each other also other features like `SLope`, `Aspect`, `Horizontal and Vertical Distance to Hydrology` showed high correlations values. They also show dominance here in predicting, meaning they might had correlated but they have very useful information in predicting target variable.

- `Elevation`, `Vertical and Horizontal distance to Hydrology` show presence in top 10 for all classifiers, hence important features.

- All these classification tell us one thing in common, Numerical Features dominate when it comes to predicting forest classes.

**Comparing at the top 20 feature scores evaluated by `Random Forest` and `Extra Tree Classifier` side by side**

In [12]:
# Select the top 20 features from each model
top24_RFC = RFC_feature_importances.head(20)
top24_ETC = ETC_feature_importances.head(20)

# Merge the top 20 features from both models for side-by-side comparison
# Use 'outer' join to ensure all unique features from both models are included
comparison_df = pd.merge(top24_RFC, top24_ETC, left_index=True, right_index=True, how='outer')

# Fill NaN values with 0, assuming that a missing value indicates the feature wasn't among the top 24 for that model
comparison_df.fillna(0, inplace=True)

# Sort by the sum of both models' scores to see the most important features across both models
comparison_df['Total_Importance'] = comparison_df.sum(axis=1)
comparison_df_sorted = comparison_df.sort_values(by='Total_Importance', ascending=False).drop('Total_Importance', axis=1)

# Display the comparison of the top 20 features
print(comparison_df_sorted.head(20))


                                         RFC       ETC
Elevation                           0.223542  0.166172
Horizontal_Distance_To_Roadways     0.094263  0.081416
Horizontal_Distance_To_Fire_Points  0.074507  0.067628
Horizontal_Distance_To_Hydrology    0.061820  0.058656
Vertical_Distance_To_Hydrology      0.053246  0.049678
Hillshade_9am                       0.050765  0.048314
Aspect                              0.047537  0.049159
Wilderness_Area4                    0.042976  0.052043
Hillshade_3pm                       0.045444  0.045403
Hillshade_Noon                      0.044717  0.045079
Slope                               0.036404  0.041639
Soil_Type10                         0.026358  0.034067
Soil_Type38                         0.021231  0.026985
Soil_Type39                         0.018348  0.023963
Soil_Type3                          0.018547  0.021386
Wilderness_Area3                    0.018334  0.021166
Wilderness_Area1                    0.017784  0.020716
Soil_Type4

In [13]:
top_20_feature_names = [
    "Elevation",
    "Horizontal_Distance_To_Roadways",
    "Horizontal_Distance_To_Fire_Points",
    "Horizontal_Distance_To_Hydrology",
    "Vertical_Distance_To_Hydrology",
    "Hillshade_9am",
    "Aspect",
    "Wilderness_Area4",
    "Hillshade_3pm",
    "Hillshade_Noon",
    "Slope",
    "Soil_Type10",
    "Soil_Type38",
    "Soil_Type39",
    "Soil_Type3",
    "Wilderness_Area3",
    "Wilderness_Area1",
    "Soil_Type4",
    "Soil_Type40",
    "Soil_Type30"
]

In [14]:
rs_df = df[top_20_feature_names]

### Feature Creation

* We notice that **Elevation, Horizontal_Distance_To_Roadways, Horizontal_Distance_To_Fire_Points, Horizontal_Distance_To_Hydrology and Vertical_Distance_To_Hydrology** are the most important features so let's generate new features from the mix of these features.
* Also **Horizontal_Distance_To_Hydrology and Vertical_Distance_To_Hydrology** are the same mesure from different axes so a better feature may be the hypotenuse of both.
* Finally since we want to use ensemble learning algorithms (such as RandomForest), we need to have consistant features, so we to group all **Soil_TypeXX** columns in one column. Same for **Wilderness_AreaX**. In fact random forest train many small trees based on a subset of the feature, it's no approriate to take some of the **Soil_TypeXX** columnw without the others.

In [15]:
 
rs_df['Dist_to_Hydrolody'] = (rs_df['Horizontal_Distance_To_Hydrology']**2 + rs_df['Vertical_Distance_To_Hydrology']**2 ) **0.5

rs_df['Elev_m_VDH'] = rs_df['Elevation'] - rs_df['Vertical_Distance_To_Hydrology']
        
rs_df['Elev_p_VDH'] = rs_df['Elevation'] + rs_df['Vertical_Distance_To_Hydrology']
        
rs_df['Elev_m_HDH'] = rs_df['Elevation'] - rs_df['Horizontal_Distance_To_Hydrology']
        
rs_df['Elev_p_HDH'] = rs_df['Elevation'] + rs_df['Horizontal_Distance_To_Hydrology']
    
rs_df['Elev_m_DH'] = rs_df['Elevation'] - rs_df['Dist_to_Hydrolody']
        
rs_df['Elev_p_DH'] = rs_df['Elevation'] + rs_df['Dist_to_Hydrolody']

rs_df['Hydro_p_Fire'] = rs_df['Horizontal_Distance_To_Hydrology'] + rs_df['Horizontal_Distance_To_Fire_Points']
    
rs_df['Hydro_m_Fire'] = rs_df['Horizontal_Distance_To_Hydrology'] - rs_df['Horizontal_Distance_To_Fire_Points']
    
rs_df['Hydro_p_Road'] = rs_df['Horizontal_Distance_To_Hydrology'] + rs_df['Horizontal_Distance_To_Roadways']
    
rs_df['Hydro_p_Road'] = rs_df['Horizontal_Distance_To_Hydrology'] - rs_df['Horizontal_Distance_To_Roadways']
    
rs_df['Fire_p_Road'] = rs_df['Horizontal_Distance_To_Fire_Points'] + rs_df['Horizontal_Distance_To_Roadways']
    
rs_df['Fire_m_Road'] = rs_df['Horizontal_Distance_To_Fire_Points'] - rs_df['Horizontal_Distance_To_Roadways']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rs_df['Dist_to_Hydrolody'] = (rs_df['Horizontal_Distance_To_Hydrology']**2 + rs_df['Vertical_Distance_To_Hydrology']**2 ) **0.5
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rs_df['Elev_m_VDH'] = rs_df['Elevation'] - rs_df['Vertical_Distance_To_Hydrology']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-c

In [16]:
new_features = ['Elevation','Aspect','Slope','Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology',
    'Horizontal_Distance_To_Roadways','Hillshade_9am','Hillshade_Noon','Hillshade_3pm','Horizontal_Distance_To_Fire_Points',
    'Elev_m_VDH', 'Elev_p_VDH', 'Dist_to_Hydrolody', 'Hydro_p_Fire', 'Hydro_m_Fire', 'Hydro_p_Road', 'Hydro_p_Road',
    'Fire_p_Road', 'Fire_m_Road', 'Soil', 'Wilderness_Area', 
    'Elev_m_HDH', 'Elev_p_HDH', 'Elev_m_VDH', 'Elev_p_VDH', 'Elev_m_DH', 'Elev_p_DH']

In [17]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state = 42)

model.fit(X, y)

# extracting feature importance from model and making a dataframe of it in descending order
RFC_feature_importances = pd.DataFrame(model.feature_importances_, index = X.columns, columns=['RFC']).sort_values('RFC', ascending=False)

model = None

RFC_feature_importances.head(20)

Unnamed: 0,RFC
Elevation,0.223542
Horizontal_Distance_To_Roadways,0.094263
Horizontal_Distance_To_Fire_Points,0.074507
Horizontal_Distance_To_Hydrology,0.06182
Vertical_Distance_To_Hydrology,0.053246
Hillshade_9am,0.050765
Aspect,0.047537
Hillshade_3pm,0.045444
Hillshade_Noon,0.044717
Wilderness_Area4,0.042976
