In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

import seaborn as sns

## Feature Importance

The objective of this notebook is to select the most relevant features. This procedure it's known to improve modeling in terms of overfitting, accuracy, and training time. Enabling algorithms to stop using noisy or misleading data to make decisions, hence reducing their complexity and computational time.

There are three general classes of feature selection:
- **Filter**
    - Filter methods apply statistical measures to assign a score to each feature, which are usually univariate and consider each feature independently;
- **Wrapper**
    - Consider the selection of a set of features as a search problem, where different combinations of features are prepared, evaluated and compared with other combinations. To evaluate those combinations a machine learning model is used to score each group of features based on an error metric prior established;
- **Embedded Methods**
    - Learn which features best contribute to the accuracy of the model while the model is being created, which is often done by regularization methods.

In this project it was first used a filter method known as **Pearson Correlation**, to select the most relevant WCD features, and afterwards, with the remaining features, a wrapper method, named as **Recursive Feature Elimination (RFE)**, was employed using XGBoost Model and Random Forest, [Fig. 1](#feature_selection_img).

<a id='feature_selection_img'></a>

<img src="https://imgur.com/hmA31x5.png" width="400" height="550" align="center"/>
<center> <b>Fig.1</b> - Procedures diagram of feature selection <center>

**Call DataFrames**

In [5]:
# Civil building
dt_civil_hour = pd.read_csv('Preprocessed_Data/_03_dt_01_civil_hour.csv', index_col=[0], parse_dates=[0], header=0)
dt_civil_day = pd.read_csv('Preprocessed_Data/_03_dt_01_civil_day.csv', index_col=[0], parse_dates=[0], header=0)
dt_civil_week = pd.read_csv('Preprocessed_Data/_03_dt_01_civil_week.csv', index_col=[0], parse_dates=[0], header=0)

# South tower
dt_stw_hour = pd.read_csv('Preprocessed_Data/_03_dt_01_south_tower_hour.csv', index_col=[0], parse_dates=[0], header=0)
dt_stw_day = pd.read_csv('Preprocessed_Data/_03_dt_01_south_tower_day.csv', index_col=[0], parse_dates=[0], header=0)
dt_stw_week = pd.read_csv('Preprocessed_Data/_03_dt_01_south_tower_week.csv', index_col=[0], parse_dates=[0], header=0)

### 1.1 Pearson's Correlation

The pearson correlation coefficient can be used to summarize the strength of the linear relationship between two data samples. It is calculated as the **covariance** of two variables divided by the product of the standard deviation of each data sample. 

Given two samples X and Y the Pearson's correlation is: `covariance(X, Y) / stdv(X) * stdv(Y)`

In [56]:
def pearson_corr(df):
    
    cols = df.columns
    
    if 'civil' in cols:
        building = 'civil'
        name = 'civil'
    else:
        building = 'south_tower'
        name = 'stw'
        
    return abs(df.corr()[building]).drop([building, 'miss_wt', 'miss_'+building]).to_frame(name=name+'_corr')

In [57]:
a = pearson_corr(dt_civil_hour)

In [58]:
a

Unnamed: 0,civil_corr
wt_temp,0.174258
wt_tmpap,0.110934
wt_hr,0.225638
wt_max_windgust,0.110871
wt_mean_windspd,0.094486
wt_mean_pres,0.041384
wt_mean_solarrad,0.454649
wt_rain_day,0.096881
t_hour,0.232984
t_month,0.060504


### 1.2 Shapiro Feature Importance

**------------------------------------ Work in Progress ------------------------------------**