# Introduction to Feature selection
Machine learning and deep learning algorithms learn from data, which consists of different types of features. The training time and performance of a machine learning algorithm depends heavily on the features in the dataset. Ideally, we should only retain those features in the dataset that actually help our machine learning model learn something.

Unnecessary and redundant features not only slow down the training time of an algorithm, but they also affect the performance of the algorithm. The process of selecting the most suitable features for training the machine learning model is called "feature selection".

### Advantages of feature selection
1) Models with less number of features have higher explainability

2) It is easier to implement machine learning models with reduced features
 
3) Fewer features lead to enhanced generalization which in turn reduces overfitting

4) Feature selection removes data redundancy

6) Training time of models with fewer features is significantly lower

7) Models with fewer features are less prone to errors

## Lets see feature selection by removing correlated features

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from sklearn import preprocessing

In [2]:
data=pd.read_csv(r'C:\Users\DELL\Desktop\breast cancer.csv')
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
encode=preprocessing.LabelEncoder()
target=data.iloc[:,1]
encode.fit(target)
y=encode.transform(target)
y

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,

In [4]:
train_features, test_features, train_labels, test_labels = train_test_split(
                                        data.drop(labels=['diagnosis', 'id'], axis=1), y,test_size=0.2,random_state=41)


In [6]:
correlated_features = set()
correlation_matrix = data.corr()

In [7]:
for i in range(len(correlation_matrix .columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > 0.8:
            colname = correlation_matrix.columns[i]
            correlated_features.add(colname)

In [8]:
len(correlated_features)

17

In [9]:
print(correlated_features)

{'fractal_dimension_worst', 'perimeter_worst', 'area_mean', 'radius_worst', 'area_worst', 'smoothness_worst', 'concave points_mean', 'perimeter_mean', 'concavity_mean', 'concave points_worst', 'concavity_se', 'compactness_worst', 'area_se', 'fractal_dimension_se', 'perimeter_se', 'texture_worst', 'concavity_worst'}


In [10]:
train_features.drop(labels=correlated_features, axis=1, inplace=True)
test_features.drop(labels=correlated_features, axis=1, inplace=True)

In [11]:
train_features

Unnamed: 0,radius_mean,texture_mean,smoothness_mean,compactness_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,smoothness_se,compactness_se,concave points_se,symmetry_se,symmetry_worst
475,12.83,15.73,0.09040,0.08269,0.1705,0.05913,0.1499,0.4875,0.004873,0.01796,0.008360,0.01601,0.3006
298,14.26,18.17,0.06576,0.05220,0.1635,0.05586,0.2300,0.6690,0.003169,0.01377,0.005243,0.01103,0.2636
220,13.65,13.16,0.09646,0.08711,0.1360,0.06344,0.2102,0.4336,0.004133,0.01695,0.006659,0.01371,0.2380
549,10.82,24.21,0.08192,0.06602,0.1976,0.06328,0.5196,1.9180,0.008263,0.01870,0.005917,0.02466,0.3059
567,20.60,29.33,0.11780,0.27700,0.2397,0.07016,0.7260,1.5950,0.006522,0.06158,0.016640,0.02324,0.4087
...,...,...,...,...,...,...,...,...,...,...,...,...,...
469,11.62,18.18,0.11750,0.14830,0.1957,0.07255,0.4101,1.7400,0.014590,0.03206,0.018410,0.01807,0.2660
407,12.85,21.37,0.07551,0.08316,0.1580,0.06114,0.4993,1.7980,0.006011,0.04480,0.013410,0.02669,0.2488
243,13.75,23.77,0.08043,0.06807,0.1773,0.05429,0.4347,1.0570,0.004351,0.02667,0.010070,0.02598,0.2663
321,20.16,19.66,0.08020,0.08564,0.1928,0.05096,0.5925,0.6863,0.004536,0.01376,0.012470,0.02193,0.3055


## Lets apply an Gradient Boosting Classifier machine learning algorithm

In [12]:
from sklearn.preprocessing import MinMaxScaler

In [13]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(train_features)
X_test_scaled = scaler.transform(test_features)

In [14]:
from sklearn.ensemble import GradientBoostingClassifier

In [15]:
gb_classifier = GradientBoostingClassifier(n_estimators=20, learning_rate=0.1, max_features=2, 
                                        max_depth=2, random_state=0)
gb_model = gb_classifier.fit(train_features,train_labels)

print("train score - " + str(gb_model.score(train_features, train_labels)))
print("test score - " + str(gb_model.score(test_features, test_labels)))

train score - 0.9406593406593406
test score - 0.9385964912280702
