In [1]:
!pip install C:\Users\lucasrc\Documents\Codes\mlutils\dist\mlutils-0.0.1-py3-none-any.whl

Processing c:\users\lucasrc\documents\codes\mlutils\dist\mlutils-0.0.1-py3-none-any.whl
Installing collected packages: mlutils
Successfully installed mlutils-0.0.1


You should consider upgrading via the 'c:\users\lucasrc\anaconda3\python.exe -m pip install --upgrade pip' command.


## 1.0 Feature Selection

The tutorial consists of the use of iris and boston datasets from sklearn Datasets, where the inputs are tested for each implemented feature selection algorithm and as a result a list of relevant features are presented.

The feature selection methods are:

Classification methods:

- `feature_selection_filter` -> This method is implemented based on sklearn.feature_selection using chi2 as the estimator for the selection of each relevant feature.

- `feature_selection_wrapper` -> This method is implemented as a wrapper of sklearn.feature_selection using LogisticRegression for the selection of each important feature.

- `feature_selection_embedded` -> This method is implemented based on sklearn.feature_selection using LightGBMClassifier as the estimator for the selection of each relevant feature.


Regression methods:

- `feature_selection_stepwise` -> This is a wrapper from stepwise method found in statsmodels.api, which consists of the selection of relevant features depending on p-value.

- `feature_selection_f_regression` -> The f_regression method extends of sklearn.feature_selection.f_regression, which is another method for the selection of input features based on p-value from sklearn framework for regression models.

- `feature_selection_mutual_information` -> This method is as wrapper from sklearn.feature_selection.mutual_info_regression, which is developed for the use of mutual information concept for the selection of significant features based on how it can reduce entropy.

Ordering method:

- This method is responsable to present the indexes of rows that need to be dropped considering lower or upper thresholds specified by the data scientist. The method apply ordenation for the feature and after that consider indexes to be dropped based on how the lower and upper are in terms of percentile.

## 1.1 Import modules

In [3]:
# for classification
from mlutils.feature_engineering import feature_selection_filter
from mlutils.feature_engineering import feature_selection_wrapper
from mlutils.feature_engineering import feature_selection_embedded

# for regression
from mlutils.feature_engineering import feature_selection_stepwise
from mlutils.feature_engineering import feature_selection_f_regression
from mlutils.feature_engineering import feature_selection_mutual_information

# for ordering
from mlutils.feature_engineering import ordering_filter

# Others
from sklearn import datasets
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## 1.2 Gathering the datasets

In [4]:
# for classification
iris_data = datasets.load_iris()
df_iris = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)
df_iris['class'] = iris_data.target
df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [5]:
# for regression
boston_data = datasets.load_boston()
df_boston = pd.DataFrame(boston_data.data, columns=boston_data.feature_names)
df_boston["MEDV"] = boston_data.target
df_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## 1.3 Using feature engineering for classification model

In [6]:
feature_selection_filter(df=df_iris, target="class", num_feats=3)

['sepal length (cm)', 'petal length (cm)', 'petal width (cm)']

In [7]:
feature_selection_wrapper(df=df_iris, target="class", num_feats=3)

['sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

In [8]:
feature_selection_embedded(df_iris, "class", 3, 50)

['petal length (cm)']

## 1.4 Appying feature engineering for regression model

In [9]:
feature_selection_stepwise(df_boston, target="MEDV", threshold_in=0.01, threshold_out=0.05, verbose=False)

['LSTAT', 'RM', 'PTRATIO', 'DIS', 'NOX', 'CHAS', 'B', 'ZN']

In [10]:
feature_selection_f_regression(df_boston, target="MEDV", num_feats=3)

['RM', 'PTRATIO', 'LSTAT']

In [11]:
feature_selection_mutual_information(df_boston, target="MEDV", num_feats=3)

['INDUS', 'RM', 'LSTAT']

## 2.0 Ordering Filter

The main objective of this function is the following: given a Dataframe as the input, show the indexes that might be dropped considering the thresholds of parameters "lower_percentile" ordered by low values and "upper_percentile" ordered by hight values.

In [12]:
df2 = pd.DataFrame(
        np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=["a", "b", "c"]
    )
df2

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [13]:
ordering_filter(df2, variables="a", lower_percentile=0.4, upper_percentile=0.1)

[0, 2]

## 3.0 Conclusion

This implementation is an advantage for feature selection process during the development of ML models due to its standardzation. Basically, you are able to run several different methods for feature selection only specifying basic hyperparameters and the dataFrame to be used. This makes it very easy to run a lot of tests in order to get best set of features for the train/test phase.

## References

[pandas](https://pandas.pydata.org/)


[SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)


[Chi2](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html?highlight=chi2#sklearn.feature_selection.chi2)


[RFE](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html?highlight=rfe#sklearn.feature_selection.RFE)

[f_regression](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html?highlight=f_regre#sklearn.feature_selection.f_regression)



[mutual_info_regression](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html?highlight=mutual_info#sklearn.feature_selection.mutual_info_regression)


[MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html?highlight=minmax#sklearn.preprocessing.MinMaxScaler)


[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)



[LGBMClassifier](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html)


[statsmodel.api](https://www.statsmodels.org/stable/index.html)