### Feature Engineering

Whatever you do with features, we call it as Feature Engineering.

* Feature Elimination    - dropping the features.
* Feature Addition       - adding some features.
* Feature Transformation - transforming the given feature values into an another scale - Log Tranformation, Sqrt Transformation..
* Feature Selection      - deciding which features are important out of many features and choosing that features for model building.

#### Feature Selection Techniques

* sklearn - SelectFromModel
* sklearn - RFE(ie,Recursive Feature Elimination)

### Import Libraries

In [1]:
import pandas as pd

from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE,SelectFromModel
from sklearn.datasets import load_breast_cancer

import warnings
warnings.filterwarnings('ignore')

### Import Data

In [2]:
cancer=load_breast_cancer()
cancer_df=pd.DataFrame(data=cancer.data,columns=cancer.feature_names)
cancer_df['target']=cancer.target
cancer_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


### Data Understanding

In [3]:
cancer_df.shape

(569, 31)

In [4]:
cancer_df.isna().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

In [5]:
cancer_df.dtypes

mean radius                float64
mean texture               float64
mean perimeter             float64
mean area                  float64
mean smoothness            float64
mean compactness           float64
mean concavity             float64
mean concave points        float64
mean symmetry              float64
mean fractal dimension     float64
radius error               float64
texture error              float64
perimeter error            float64
area error                 float64
smoothness error           float64
compactness error          float64
concavity error            float64
concave points error       float64
symmetry error             float64
fractal dimension error    float64
worst radius               float64
worst texture              float64
worst perimeter            float64
worst area                 float64
worst smoothness           float64
worst compactness          float64
worst concavity            float64
worst concave points       float64
worst symmetry      

### Model Building

In [6]:
X=cancer_df.drop('target',axis=1)
y=cancer_df[['target']]

In [7]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,stratify=y)

In [8]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((455, 30), (114, 30), (455, 1), (114, 1))

##### ==================================================================================================================

### Feature Engineering - Select from Model

In [14]:
sfm=SelectFromModel(estimator=RandomForestClassifier(max_depth=5,random_state=12),max_features=None)
sfm.fit(X_train,y_train)

SelectFromModel(estimator=RandomForestClassifier(max_depth=5, random_state=12))

In [15]:
sfm.get_support()

array([ True, False, False,  True, False, False,  True,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True, False,  True,  True, False, False, False,
        True, False, False])

In [16]:
X_train.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [17]:
X_train.columns[sfm.get_support()]

Index(['mean radius', 'mean area', 'mean concavity', 'mean concave points',
       'area error', 'worst radius', 'worst perimeter', 'worst area',
       'worst concave points'],
      dtype='object')

In [18]:
len(X_train.columns[sfm.get_support()])

9

In [19]:
sfm_transformed_Xtrain=sfm.transform(X_train)

In [20]:
sfm_transformed_Xtest=sfm.transform(X_test)

In [21]:
def run_rf_classifer(X_train,X_test,y_train,y_test):
    rf_classifier=RandomForestClassifier(random_state=12)
    rf_classifier.fit(X_train,y_train)
    y_pred_rf=rf_classifier.predict(X_test)
    print("Accuracy is :",accuracy_score(y_test,y_pred_rf))
    return accuracy_score(y_test,y_pred_rf)

In [22]:
%%time
run_rf_classifer(X_train,X_test,y_train,y_test)

Accuracy is : 0.9473684210526315
Wall time: 254 ms


0.9473684210526315

In [23]:
%%time
run_rf_classifer(sfm_transformed_Xtrain,sfm_transformed_Xtest,y_train,y_test)

Accuracy is : 0.9210526315789473
Wall time: 219 ms


0.9210526315789473

### Feature Engineering - RFE - Recursive Feature Elimination

In [24]:
rfe=RFE(estimator=RandomForestClassifier(max_depth=5,random_state=12),n_features_to_select=7)
rfe.fit(X_train,y_train)

RFE(estimator=RandomForestClassifier(max_depth=5, random_state=12),
    n_features_to_select=7)

In [25]:
rfe.get_support()

array([False, False,  True, False, False, False,  True,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False,  True,  True, False, False, False,
        True, False, False])

In [26]:
X_train.columns[rfe.get_support()]

Index(['mean perimeter', 'mean concavity', 'mean concave points',
       'worst radius', 'worst perimeter', 'worst area',
       'worst concave points'],
      dtype='object')

In [27]:
len(X_train.columns[rfe.get_support()])

7

In [28]:
rfe_transformed_Xtrain=rfe.transform(X_train)
rfe_transformed_Xtest=rfe.transform(X_test)

In [29]:
rfe_transformed_Xtrain.shape,rfe_transformed_Xtest.shape

((455, 7), (114, 7))

In [30]:
%%time
run_rf_classifer(X_train,X_test,y_train,y_test)

Accuracy is : 0.9473684210526315
Wall time: 273 ms


0.9473684210526315

In [31]:
%%time
run_rf_classifer(rfe_transformed_Xtrain,rfe_transformed_Xtest,y_train,y_test)

Accuracy is : 0.9210526315789473
Wall time: 206 ms


0.9210526315789473

### How to find Optimal number of features to select

In [33]:
for i in range(1,31):
    rfe=RFE(estimator=RandomForestClassifier(max_depth=5,random_state=12),n_features_to_select=i)
    rfe.fit(X_train,y_train)
    rfe_transformedXtrain=rfe.transform(X_train)
    rfe_transformedXtest=rfe.transform(X_test)
    print("No of features :",i)
    run_rf_classifer(rfe_transformedXtrain,rfe_transformedXtest,y_train,y_test)

No of features : 1
Accuracy is : 0.8859649122807017
No of features : 2
Accuracy is : 0.8859649122807017
No of features : 3
Accuracy is : 0.9298245614035088
No of features : 4
Accuracy is : 0.9210526315789473
No of features : 5
Accuracy is : 0.9210526315789473
No of features : 6
Accuracy is : 0.9298245614035088
No of features : 7
Accuracy is : 0.9210526315789473
No of features : 8
Accuracy is : 0.9473684210526315
No of features : 9
Accuracy is : 0.9473684210526315
No of features : 10
Accuracy is : 0.9298245614035088
No of features : 11
Accuracy is : 0.9385964912280702
No of features : 12
Accuracy is : 0.956140350877193
No of features : 13
Accuracy is : 0.9385964912280702
No of features : 14
Accuracy is : 0.9473684210526315
No of features : 15
Accuracy is : 0.9385964912280702
No of features : 16
Accuracy is : 0.9473684210526315
No of features : 17
Accuracy is : 0.9473684210526315
No of features : 18
Accuracy is : 0.9385964912280702
No of features : 19
Accuracy is : 0.9473684210526315
No 

In [34]:
for i in range(1,31):
    rfe=RFE(estimator=GradientBoostingClassifier(max_depth=5,random_state=12),n_features_to_select=i)
    rfe.fit(X_train,y_train)
    rfe_transformedXtrain=rfe.transform(X_train)
    rfe_transformedXtest=rfe.transform(X_test)
    print("No of features :",i)
    run_rf_classifer(rfe_transformedXtrain,rfe_transformedXtest,y_train,y_test)

No of features : 1
Accuracy is : 0.8859649122807017
No of features : 2
Accuracy is : 0.8859649122807017
No of features : 3
Accuracy is : 0.9473684210526315
No of features : 4
Accuracy is : 0.9385964912280702
No of features : 5
Accuracy is : 0.956140350877193
No of features : 6
Accuracy is : 0.9649122807017544
No of features : 7
Accuracy is : 0.9473684210526315
No of features : 8
Accuracy is : 0.9473684210526315
No of features : 9
Accuracy is : 0.9649122807017544
No of features : 10
Accuracy is : 0.9473684210526315
No of features : 11
Accuracy is : 0.9385964912280702
No of features : 12
Accuracy is : 0.9473684210526315
No of features : 13
Accuracy is : 0.956140350877193
No of features : 14
Accuracy is : 0.9385964912280702
No of features : 15
Accuracy is : 0.9385964912280702
No of features : 16
Accuracy is : 0.9473684210526315
No of features : 17
Accuracy is : 0.9298245614035088
No of features : 18
Accuracy is : 0.9649122807017544
No of features : 19
Accuracy is : 0.9473684210526315
No o