<h1 style="color:black">Feature Engineering</h1>


<h4 style="line-height:1.5;background-color:antiquewhite;padding:20px;border:1px solid black">Feature Engineering is process of selecting those features in your data that contribute most to the prediction
variable<br>and fabricating new variable by make use of existing variables</h4>

<h4 style="color:red;border:2px solid black;padding:10px">Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression</h4>

__The objectives of feature engineering are:__

- __Reduces Overfitting__: Less redundant data means less opportunity to make decisions based on noise
- __Improves Accuracy__: Less misleading data improves modelling accuracy
- __Reduces Training Time__: Less data means less training time required

<h1 style="color:limegreen">1 . Univariate Selection</h1>

- One common feature selection method that is used with text data is the __Chi-Square Feature Selection__
- The __Chi-Square test__ is used in statistics to test the independence of 2 events
- More specifically, in feature selection we use this test to test __whether the occurence of a specific term and the occurance<br>
    of a specific class are independent__

<div style="background-color:beige;padding:20px;border:1px solid black">
<h5>More formally, given a document DD, we estimate the following Chi-Square values and rank them by their scores</h5>
<br>


<img src="images\\Feature_Engineering_D1.png" alt="image" width="400px" style="margin-left:20%;mix-blend-mode:multiply">

Where

- __N__ is the observed frequency in and __E__ the expected frequency
- e<sub>t</sub> takes the value 1 if the document contains term t and 0 otherwise
- e<sub>c</sub> takes the value 1 if the document is in class c and 0 otherwise
    
</div>

- For each feature/term, a corresponding __high Chi-Square Score indicates that the null hypothesis H<sub>0</sub> of independence__ <br>
    (meaning that the feature variable has no influence on dependent variable) __should be rejected__ and the occurrence of the feature<br>
    and class are dependent
- __High Chi-Square value suggest, feature is useful in predicting the class variable__

<h1 style="color:limegreen">2 . Recursive Feature Elimination</h1>

- The Recursive Feature Elimination (RFE) __works by recursively removing attributes and building a model on those
  attributes that remain__
- It __uses the model accuracy___ to identify which attributes (and combination of attributes) __contribute the most to predicting
    the target attribute__
- Genrally, __RFE will be used with the logistic regression__ algorithm to select the top features.
- The choice of algorithm does not matter too much as long as it is skilful and consistent

<h1 style="color:limegreen">3 . Tree Based Methods</h1>

- __Decision tree based methods__ can be used to find out the best feature.
- It does it based on the __Information Gain or Gini Index__

____

# Example Code

## Univariate Feature Selection

In [119]:
import pandas as pd
import numpy as np

In [121]:
# importing required packages

from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest  # for univariate selection

In [125]:
# importing dataset

df = pd.read_csv("Datasets\\pima-indians-diabetes.data.csv")
df

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0
...,...,...,...,...,...,...,...,...,...
762,10,101,76,48,180,32.9,0.171,63,0
763,2,122,70,27,0,36.8,0.340,27,0
764,5,121,72,23,112,26.2,0.245,30,0
765,1,126,60,0,0,30.1,0.349,47,1


In [127]:
# column have no names, let's add columns names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

df.columns = names
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


In [129]:
df.shape

(767, 9)

In [133]:
# dividing into dependent and independent sets

X = df.iloc[:,:-1]
y = df.iloc[:,-1]


In [135]:
X

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age
0,1,85,66,29,0,26.6,0.351,31
1,8,183,64,0,0,23.3,0.672,32
2,1,89,66,23,94,28.1,0.167,21
3,0,137,40,35,168,43.1,2.288,33
4,5,116,74,0,0,25.6,0.201,30
...,...,...,...,...,...,...,...,...
762,10,101,76,48,180,32.9,0.171,63
763,2,122,70,27,0,36.8,0.340,27
764,5,121,72,23,112,26.2,0.245,30
765,1,126,60,0,0,30.1,0.349,47


In [139]:
y

0      0
1      1
2      0
3      1
4      0
      ..
762    0
763    0
764    0
765    1
766    0
Name: class, Length: 767, dtype: int64

In [141]:
# feature selection 

test = SelectKBest(k=5,score_func=chi2)  # get top 5 features, K should be less than or equal to total no. of features

fit = test.fit(X,y)

In [145]:
# summarize scores

np.set_printoptions(precision=4)  # 4 digits after decimal 

fit.scores_

#For regression: f_regression, mutual_info_regression
#For classification: chi2, f_classif, mutual_info_classif
# for feature 'test': chi square value is 2175.5653, so this is most imp feature, then plas,pedi,....

array([ 110.7272, 1406.5905,   17.505 ,   51.0079, 2219.3978,  127.6715,
          5.3564,  178.0108])

In [165]:
# Let's convert to df

data = pd.DataFrame(fit.scores_).T
data

Unnamed: 0,0,1,2,3,4,5,6,7
0,110.727182,1406.590491,17.504998,51.007895,2219.397819,127.671491,5.356364,178.01076


In [167]:
# Assign columns names too

data.columns=names[:-1]
data

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age
0,110.727182,1406.590491,17.504998,51.007895,2219.397819,127.671491,5.356364,178.01076


In [177]:
data.sort_values(by = 0,ascending=False,axis=1)

Unnamed: 0,test,plas,age,mass,preg,skin,pres,pedi
0,2219.397819,1406.590491,178.01076,127.671491,110.727182,51.007895,17.504998,5.356364


In [179]:
df

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0
...,...,...,...,...,...,...,...,...,...
762,10,101,76,48,180,32.9,0.171,63,0
763,2,122,70,27,0,36.8,0.340,27,0
764,5,121,72,23,112,26.2,0.245,30,0
765,1,126,60,0,0,30.1,0.349,47,1


## Recursive Feature Elimination

In [184]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [186]:
# RFE is generally for logistic regression

In [188]:
model = LogisticRegression(max_iter=400) # to find out coefficients of logistic regression, for that it need to iterate
# inside logistic regression optimization algorithm is there. Using that, coefficients will be estimated.
# For that it has to iterate multiple times. Default value is 100.

In [190]:
rfe = RFE(model)   # see best features

fit = rfe.fit(X,y)

In [192]:
#Num Features: 4 top features are available

fit.n_features_

4

In [194]:
#Selected Features:
fit.support_

array([ True,  True, False, False, False,  True,  True, False])

In [196]:
names

['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [198]:
# Feature Ranking:
fit.ranking_

array([1, 1, 3, 5, 4, 1, 1, 2])

In [200]:
names

['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

## Feature Impotance using Decision Tree

In [203]:
from sklearn.tree import DecisionTreeClassifier

In [205]:
model = DecisionTreeClassifier()
model.fit(X,y)

In [209]:
model.feature_importances_   # feature importance scores

array([0.0532, 0.3292, 0.0923, 0.0234, 0.0288, 0.2235, 0.1328, 0.1167])

In [211]:
names

['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [217]:
imps = pd.DataFrame({'feature':names[:-1],'score':model.feature_importances_})
imps

Unnamed: 0,feature,score
0,preg,0.053232
1,plas,0.329244
2,pres,0.092314
3,skin,0.023428
4,test,0.028835
5,mass,0.223493
6,pedi,0.132769
7,age,0.116684


In [219]:
imps.sort_values('score',ascending=False)

Unnamed: 0,feature,score
1,plas,0.329244
5,mass,0.223493
6,pedi,0.132769
7,age,0.116684
2,pres,0.092314
0,preg,0.053232
4,test,0.028835
3,skin,0.023428


In [221]:
# Plas is the most important feature, then mass, pedi

<h3 style="color:red">All methods have different results, we have to consider features which are common in all results</h3>