### Feature Engineering -- 3 Technique
Feature Engineering is a process of selecting those feature from your dataset that contribute the most to the prediction Variable

#### Univariate Feature Selection -- 1st Technique

In [1]:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
import pandas as pd
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

#For regression: f_regression, mutual_info_regression
#For classification: chi2, f_classif, mutual_info_classif

In [2]:
#Loading the Dataset
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv("C:/Users/Akaash/Downloads/pima-indians-diabetes_data.csv",names=names)
dataframe.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
#Separating the Input / Output variable
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [4]:
# feature extraction - using chi2
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

In [5]:
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]


Inference: from the Univariate feature Selection , Using Chi2 method to get best feature, it gave that chi2 scores above by which we can determined 4,2,7,5 index columns are best features 

#### Recursive Feature Elimination -- 2nd Technique

In [6]:
#Importing the Required libraries
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings("ignore")  #--to ignore warnings

In [7]:
#Loading the Dataset
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv("C:/Users/Akaash/Downloads/pima-indians-diabetes_data.csv",names=names)
dataframe.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [8]:
#Separating the Input / Output variable
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [9]:
# feature extraction -- Using LogisticRegression
model = LogisticRegression(max_iter=400)
rfe = RFE(model, 4)
fit = rfe.fit(X, Y)

In [10]:
#Num Features: 
fit.n_features_

4

In [11]:
#Feature Ranking:
fit.ranking_

array([1, 1, 3, 5, 4, 1, 1, 2])

In [12]:
#Selected Features:
fit.support_

array([ True,  True, False, False, False,  True,  True, False])

Inference: from the Recursive Feature Elimination , Using Logistic Regression method to get best feature, it gave boolean value above by which we can determined 0,1,5,6 index columns are best features 

#### Feature Importance using Decision Tree -- 3rd Technique

In [13]:
#Importing the Required Libraries
import pandas as pd
from sklearn.tree import  DecisionTreeClassifier

In [14]:
#Loading the Dataset
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv("C:/Users/Akaash/Downloads/pima-indians-diabetes_data.csv",names=names)
dataframe.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [15]:
#Separating the Input / Output variable
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [16]:
# feature extraction -- Using the DecisionTreeClassifier()
model = DecisionTreeClassifier()
model.fit(X, Y)
print(model.feature_importances_)

[0.067 0.32  0.082 0.018 0.045 0.226 0.127 0.116]


Inference: from the Decision Tree , Using DecisionTreeClassifier() method to get best feature, it gave that importances Score above by which we can determined 1,2,5,6 index columns are best features 

#### inference: From All Three Technique Choose the Feature Which Occurs in all three Technique