<a href="https://colab.research.google.com/github/BragatteMAS/BioinfoEstrutural/blob/master/DSZ_Features_selection_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Why select features?
[Tutorial](https://www.youtube.com/watch?v=4RGT2YRHERY&feature=em-uploademail)

*   Some features (not informative) add noise to the model;
*   Simple models are self explain, we must avoid the loss of explanability of our models;
*   Many features can cause problems such as excessive training times, or even difficulties in putting models into production.

## New Section

In [None]:
import pandas as pd

In [None]:
colnames = ['preg','plas','pres','skin','test','mass','pedi','age','class']

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv',names=colnames)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
x = df.drop(['class'], axis = 1) #separeting features and class
y = df['class']

## Statistics Tests
*   Used for classification problems
*   f-classif is adequated when the data is numeric and the target variable is categoryc
*   The mutual_info_classif is more suitable when not exist a linear dependencie betwen the features and the target variable
*   f_regression aplied for regression problems

In [None]:
from sklearn.feature_selection import SelectKBest #choose best top k features (k = reduce the number of relevant categories to work)
from sklearn.feature_selection import f_classif, mutual_info_classif #Not exist a linear dependencie use mutual_info

In [None]:
f_classif = SelectKBest(score_func=f_classif, k=4) #score_func from SelectKBest defines the math function

In [None]:
fit = f_classif.fit(x,y) #learn the representation and the poontuations

In [None]:
features = fit.transform(x) 

In [None]:
print(features) #Best characteristics

In [None]:
cols = fit.get_support(indices = True)
df.iloc[:,cols] #found columns with indices

## Aply Chi2
*   Normally used when both features and target variable are categorical 

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 

In [None]:
#features extraction
test = SelectKBest(chi2, k=4) 

In [None]:
fit= test.fit(x,y)

In [None]:
fit.get_support(indices=True) #best characteristcs

In [None]:
features  = fit.transform(x) #Apply the learning

In [None]:
print(features)

In [None]:
cols = fit.get_support(indices = True)
df.iloc[:,cols] #found columns with indices

## Recursive Feature Elimination - RFE
*   RFE build models from removed features
*   Use modedls acurracy to identify attributes or a combination of those that best contributed to a better performance
*   Larger dataset processing time could be an issue

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression(max_iter=2000) #standard 1000

In [None]:
from sklearn.feature_selection import RFE

In [None]:
rfe = RFE(model, 4)

In [None]:
fit = rfe.fit(x,y)

In [None]:
print(f"Number of features: {fit.n_features_}")

In [None]:
cols = fit.get_support(indices = True)
df.iloc[:,cols] #found columns with indices

### Feature importance

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# feature extraction
model = RandomForestClassifier(n_estimators=10) #ten trees
model.fit(x,y) #training model

In [None]:
print(model.feature_importances_) #print best features

In [None]:
colnames #name the features

In [None]:
import pandas as pd
feature_importances = pd.DataFrame(model.feature_importances_,
                                   index = x.columns,
                                   columns=['importance']).sort_values('importance', ascending=False)   

In [None]:
feature_importances

In [None]:
feature_importances.plot(kind='bar');

## Automatazing Feature Selection
*   If sklearn could use pipelines for automatization

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
#Creating a pipeline
clf = Pipeline([
                ('feature_selection', RFE(LogisticRegression(max_iter=2000),4)), #execute RFE with logistregression
                ('classification', RandomForestClassifier()) #execute classification with RandomForest
])

In [None]:
clf.fit(x,y)

In [None]:
clf.steps

## Which method should use?
*   First RFE if computacional resources are available
*   If working with classification and numeric features use f_classif or mutual_info_classif
*   If working with Regression and numeric features use f_regression or mutual_info_regression
*   If working with categorical features use Chi2
*   automatize steps with Pipelines to avoid errors





