# <center> <u> Feature Selection </u> </center>

<h2>What's the Purpose of Feature Selection</h2>
<p>Many learning algorithms perform poorly on high-dimensional data. This is known as the <b>curse of dimensionality</b>
    <p>There are other reasons we may wish to reduce the number of features including:
        <p>1. Reducing computational cost
            <p>2. Reducing the cost associated with data collection
                <p>3. Improving Interpretability

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import  SelectFromModel
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
from math import sqrt

## 1.Filter Methods:



Filter method applies a statistical measure to assign a scoring to each feature.Then we can decide to keep or remove those features based on those scores. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.

In this section we will cover below approaches:

1. Missing Value Ratio Threshold
2. Variance Threshold
3. $Chi^2$ Test
4. Anova Test

## (a) Missing Value Ratio Threshold

Data Dict:
---

**Pregnancies:** Number of times pregnant <br>
**Glucose:** Plasma glucose concentration a 2 hours in an oral glucose tolerance test.<br>
**BloodPressure:** Diastolic blood pressure (mm Hg).<br>
**SkinThickness:** Triceps skin fold thickness (mm).<br>
**Insulin:** 2-Hour serum insulin (mu U/ml).<br>
**BMI:** Body mass index (weight in kg/(height in m)^2). <br>
**DiabetesPedigreeFunction:** A function which scores likelihood of diabetes based on family history<br>
**Age:** Age (years)<br>
**Outcome:** Class variable (0 or 1)




In [2]:
diabetes =pd.read_csv('diabetes.csv') 
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
#will impute zeros with nan value in these features.'Glucose'].replace(0, np.nan,inplace=True)
diabetes['BloodPressure'].replace(0, np.nan,inplace=True)
diabetes['SkinThickness'].replace(0, np.nan,inplace=True)
diabetes['Insulin'].replace(0, np.nan,inplace=True)
diabetes['BMI'].replace(0, np.nan,inplace=True)

In [4]:
diabetes.isnull().sum()

Pregnancies                   0
Glucose                       0
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Now let's see for each feature what is the percentage of having missing values.

In [5]:
#percentage of missing values for Glucose
diabetes['Glucose'].isnull().sum()/len(diabetes)*100

0.0

In [6]:
diabetes['BloodPressure'].isnull().sum()/len(diabetes)*100

4.557291666666666

In [7]:
diabetes['SkinThickness'].isnull().sum()/len(diabetes)*100

29.557291666666668

In [8]:
diabetes['Insulin'].isnull().sum()/len(diabetes)*100

48.69791666666667

In [9]:
diabetes['BMI'].isnull().sum()/len(diabetes)*100

1.4322916666666665

 large number of data missing in SkinThickness and Insulin.

keep only those features which are having missing data less than 10% as our threshold.

In [10]:
int(diabetes.shape[0]*0.9)

691

In [11]:
diabetes_missing_value_threshold = diabetes.dropna(thresh=int(diabetes.shape[0] * 0.9), axis = 1)
diabetes_missing_value_threshold 

Unnamed: 0,Pregnancies,Glucose,BloodPressure,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72.0,33.6,0.627,50,1
1,1,85,66.0,26.6,0.351,31,0
2,8,183,64.0,23.3,0.672,32,1
3,1,89,66.0,28.1,0.167,21,0
4,0,137,40.0,43.1,2.288,33,1
...,...,...,...,...,...,...,...
763,10,101,76.0,32.9,0.171,63,0
764,2,122,70.0,36.8,0.340,27,0
765,5,121,72.0,26.2,0.245,30,0
766,1,126,60.0,30.1,0.349,47,1


In [12]:

diabetes_missing_value_threshold_features = diabetes_missing_value_threshold.drop('Outcome',axis=1)

diabetes_missing_value_threshold_label= diabetes_missing_value_threshold['Outcome']


In [13]:
diabetes_missing_value_threshold_features

Unnamed: 0,Pregnancies,Glucose,BloodPressure,BMI,DiabetesPedigreeFunction,Age
0,6,148,72.0,33.6,0.627,50
1,1,85,66.0,26.6,0.351,31
2,8,183,64.0,23.3,0.672,32
3,1,89,66.0,28.1,0.167,21
4,0,137,40.0,43.1,2.288,33
...,...,...,...,...,...,...
763,10,101,76.0,32.9,0.171,63
764,2,122,70.0,36.8,0.340,27
765,5,121,72.0,26.2,0.245,30
766,1,126,60.0,30.1,0.349,47


In [14]:
diabetes_missing_value_threshold_label

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

## (b) Variance Threshold

<p style='text-align: right;'> 20 points</p>

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed.

Variance will also be very low for a feature if only a handful of observations of that feature differ from a constant value.


In [15]:
diabetes =pd.read_csv('diabetes.csv') 
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [16]:
X =diabetes.iloc[:,0:8]
Y =diabetes.iloc[:,8]

In [17]:
X.var(axis=0)


Pregnancies                    11.354056
Glucose                      1022.248314
BloodPressure                 374.647271
SkinThickness                 254.473245
Insulin                     13281.180078
BMI                            62.159984
DiabetesPedigreeFunction        0.109779
Age                           138.303046
dtype: float64

DiabetesPedigreeFunction variance is less so it brings almost no information because it is (almost) constant , this can be a justification to remove DiabetesPedigreeFunction column but before considering this we should scale these features because they are of different scales.

In [18]:
from sklearn.preprocessing import minmax_scale
X_scaled_df =pd.DataFrame(minmax_scale(X,feature_range=(0,10)),columns=X.columns)

In [19]:
X_scaled_df 

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,3.529412,7.437186,5.901639,3.535354,0.000000,5.007452,2.344150,4.833333
1,0.588235,4.271357,5.409836,2.929293,0.000000,3.964232,1.165670,1.666667
2,4.705882,9.195980,5.245902,0.000000,0.000000,3.472429,2.536294,1.833333
3,0.588235,4.472362,5.409836,2.323232,1.111111,4.187779,0.380017,0.000000
4,0.000000,6.884422,3.278689,3.535354,1.985816,6.423249,9.436379,2.000000
...,...,...,...,...,...,...,...,...
763,5.882353,5.075377,6.229508,4.848485,2.127660,4.903130,0.397096,7.000000
764,1.176471,6.130653,5.737705,2.727273,0.000000,5.484352,1.118702,1.000000
765,2.941176,6.080402,5.901639,2.323232,1.323877,3.904620,0.713066,1.500000
766,0.588235,6.331658,4.918033,0.000000,0.000000,4.485842,1.157131,4.333333


In [20]:
X_scaled_df.var(axis=0)

Pregnancies                 3.928739
Glucose                     2.581370
BloodPressure               2.517114
SkinThickness               2.596401
Insulin                     1.855649
BMI                         1.380594
DiabetesPedigreeFunction    2.001447
Age                         3.841751
dtype: float64

In [21]:
from sklearn.feature_selection import VarianceThreshold
select_features = VarianceThreshold(threshold=1)

In [22]:
X_variance_threshold_df = select_features.fit_transform(X_scaled_df)


In [23]:
X_variance_threshold_df

array([[3.52941176, 7.43718593, 5.90163934, ..., 5.00745156, 2.3441503 ,
        4.83333333],
       [0.58823529, 4.27135678, 5.40983607, ..., 3.96423249, 1.16567037,
        1.66666667],
       [4.70588235, 9.1959799 , 5.24590164, ..., 3.47242921, 2.53629377,
        1.83333333],
       ...,
       [2.94117647, 6.08040201, 5.90163934, ..., 3.90461997, 0.71306576,
        1.5       ],
       [0.58823529, 6.33165829, 4.91803279, ..., 4.48584203, 1.15713066,
        4.33333333],
       [0.58823529, 4.67336683, 5.73770492, ..., 4.53055142, 1.01195559,
        0.33333333]])

In [24]:
X_variance_threshold_df = pd.DataFrame(X_variance_threshold_df)

In [25]:
X_variance_threshold_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,3.529412,7.437186,5.901639,3.535354,0.0,5.007452,2.34415,4.833333
1,0.588235,4.271357,5.409836,2.929293,0.0,3.964232,1.16567,1.666667
2,4.705882,9.19598,5.245902,0.0,0.0,3.472429,2.536294,1.833333
3,0.588235,4.472362,5.409836,2.323232,1.111111,4.187779,0.380017,0.0
4,0.0,6.884422,3.278689,3.535354,1.985816,6.423249,9.436379,2.0


In [26]:
def get_selected_features(raw_df,processed_df):
    selected_features=[]
    for i in range(len(processed_df.columns)):
        for j in range(len(raw_df.columns)):
            if (processed_df.iloc[:,i].equals(raw_df.iloc[:,j])):
                selected_features.append(raw_df.columns[j])
    return selected_features

In [27]:
selected_features = get_selected_features(X_scaled_df,X_variance_threshold_df)
selected_features

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

In [28]:
X_variance_threshold_df.columns = selected_features
X_variance_threshold_df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,3.529412,7.437186,5.901639,3.535354,0.000000,5.007452,2.344150,4.833333
1,0.588235,4.271357,5.409836,2.929293,0.000000,3.964232,1.165670,1.666667
2,4.705882,9.195980,5.245902,0.000000,0.000000,3.472429,2.536294,1.833333
3,0.588235,4.472362,5.409836,2.323232,1.111111,4.187779,0.380017,0.000000
4,0.000000,6.884422,3.278689,3.535354,1.985816,6.423249,9.436379,2.000000
...,...,...,...,...,...,...,...,...
763,5.882353,5.075377,6.229508,4.848485,2.127660,4.903130,0.397096,7.000000
764,1.176471,6.130653,5.737705,2.727273,0.000000,5.484352,1.118702,1.000000
765,2.941176,6.080402,5.901639,2.323232,1.323877,3.904620,0.713066,1.500000
766,0.588235,6.331658,4.918033,0.000000,0.000000,4.485842,1.157131,4.333333


## (c) Chi-Squared statistical test (SelectKBest)

<p style='text-align: right;'> 20 points</p>

Chi2 is a measure of dependency between two variables. It gives us a goodness of fit measure because it measures how well an observed distribution of a particular feature fits with the distribution that is expected if two features are independent.

Scikit-Learn offers a feature selection estimator named SelectKBest which select K numbers of features based on the statistical analysis.



In [29]:
def generate_feature_scores_df(X,Score):
    feature_score=pd.DataFrame()
    for i in range(X.shape[1]):
        new =pd.DataFrame({"Features":X.columns[i],"Score":Score[i]},index=[i])
        feature_score=pd.concat([feature_score,new])
    return feature_score

In [30]:
# create a data frame named diabetes and load the csv file again
diabetes =pd.read_csv('diabetes.csv') 

In [31]:
# assign features to X variable and 'outcome' to y variable from the dataframe diabetes
X =diabetes.iloc[:,0:8]
Y =diabetes.iloc[:,8]

In [32]:
#import chi2 and SelectKBest
from sklearn.feature_selection import SelectKBest, chi2

In [33]:
# converting data cast to a float type.
X=X.astype(np.float64)

In [34]:
chi2_test = SelectKBest(score_func=chi2, k=4  )
chi2_model = chi2_test.fit(X,Y)

In [35]:
chi2_model.scores_

array([ 111.51969064, 1411.88704064,   17.60537322,   53.10803984,
       2175.56527292,  127.66934333,    5.39268155,  181.30368904])

In [36]:
feature_score_df = generate_feature_scores_df(X,chi2_model.scores_)
feature_score_df

Unnamed: 0,Features,Score
0,Pregnancies,111.519691
1,Glucose,1411.887041
2,BloodPressure,17.605373
3,SkinThickness,53.10804
4,Insulin,2175.565273
5,BMI,127.669343
6,DiabetesPedigreeFunction,5.392682
7,Age,181.303689


higher the score better the feature

In [37]:
X_new = chi2_model.transform(X)

In [38]:
X_new= pd.DataFrame(X_new)

In [39]:
selected_features = get_selected_features( X ,X_new)
selected_features

['Glucose', 'Insulin', 'BMI', 'Age']

Let have X with all features given in list selected_features and save this dataframe in variable chi2_best_features

In [40]:
chi2_best_features = X[selected_features]
chi2_best_features.head()

Unnamed: 0,Glucose,Insulin,BMI,Age
0,148.0,0.0,33.6,50.0
1,85.0,0.0,26.6,31.0
2,183.0,0.0,23.3,32.0
3,89.0,94.0,28.1,21.0
4,137.0,168.0,43.1,33.0


## (d) Anova-F Test

The F-value scores examine the varaiance by grouping the numerical feature by the target vector, the means for each group are significantly different.

In [41]:
from sklearn.feature_selection import f_classif,SelectPercentile
Anova_test = SelectPercentile(f_classif, percentile=80)
Anova_model= Anova_test.fit(X,Y)

In [42]:
Anova_model.scores_

array([ 39.67022739, 213.16175218,   3.2569504 ,   4.30438091,
        13.28110753,  71.7720721 ,  23.8713002 ,  46.14061124])

In [43]:
feature_scores_df = generate_feature_scores_df(X, Anova_model.scores_)
feature_scores_df

Unnamed: 0,Features,Score
0,Pregnancies,39.670227
1,Glucose,213.161752
2,BloodPressure,3.25695
3,SkinThickness,4.304381
4,Insulin,13.281108
5,BMI,71.772072
6,DiabetesPedigreeFunction,23.8713
7,Age,46.140611


In [44]:
cols =  Anova_model.get_support(indices = True)
X_new = X.iloc[:, cols]

In [45]:
X_new.head()

Unnamed: 0,Pregnancies,Glucose,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6.0,148.0,0.0,33.6,0.627,50.0
1,1.0,85.0,0.0,26.6,0.351,31.0
2,8.0,183.0,0.0,23.3,0.672,32.0
3,1.0,89.0,94.0,28.1,0.167,21.0
4,0.0,137.0,168.0,43.1,2.288,33.0


# 2. Wrapper Methods

Wrapper methods are used to select a set of features by preparing where different combinations of features, then each combination is evaluated and compared to other combinations.Next a predictive model is used to assign a score based on model accuracy and to evaluate the combinations of these features.

In [46]:
diabetes =pd.read_csv('diabetes.csv') 
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [47]:
X =diabetes.iloc[:,0:8]
Y =diabetes.iloc[:,8]

X,Y


(     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
 0              6      148             72             35        0  33.6   
 1              1       85             66             29        0  26.6   
 2              8      183             64              0        0  23.3   
 3              1       89             66             23       94  28.1   
 4              0      137             40             35      168  43.1   
 ..           ...      ...            ...            ...      ...   ...   
 763           10      101             76             48      180  32.9   
 764            2      122             70             27        0  36.8   
 765            5      121             72             23      112  26.2   
 766            1      126             60              0        0  30.1   
 767            1       93             70             31        0  30.4   
 
      DiabetesPedigreeFunction  Age  
 0                       0.627   50  
 1                    

## (a) Recursive Feature Elemination

Recursive Feature Elimination selects features by recursively considering smaller subsets of features by pruning the least important feature at each step. Here models are created iteartively and in each iteration it determines the best and worst performing features and this process continues until all the features are explored.Next ranking is given on eah feature based on their elimination orde. In the worst case, if a dataset contains N number of features RFE will do a greedy search for $N^2$ combinations of features.

In [48]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [49]:
model = LogisticRegression(solver = 'liblinear')
rfe = RFE(model,n_features_to_select=4 )

In [50]:
fit = rfe.fit(X,Y)

In [51]:
print("Number of selected features", fit.n_features_)
print("Selected features", fit.support_)
print("Feature Rankings", fit.ranking_)

Number of selected features 4
Selected features [ True  True False False False  True  True False]
Feature Rankings [1 1 2 4 5 1 1 3]


In [52]:
def feature_ranks(X,Rank,Support):
    feature_rank=pd.DataFrame()
    for i in range(X.shape[1]):
        new =pd.DataFrame({"Features":X.columns[i],"Rank":Rank[i],'Selected':Support[i]},index=[i])
        feature_rank=pd.concat([feature_rank,new])
    return feature_rank


In [53]:
feature_rank_df =feature_ranks(X, fit.ranking_,fit.support_)
feature_rank_df

Unnamed: 0,Features,Rank,Selected
0,Pregnancies,1,True
1,Glucose,1,True
2,BloodPressure,2,False
3,SkinThickness,4,False
4,Insulin,5,False
5,BMI,1,True
6,DiabetesPedigreeFunction,1,True
7,Age,3,False


We can see there are four features with rank 1 ,RFE states that these are the most significant features.

In [54]:
# filter feature_rank_df  with selected column values as True  
recursive_feature_names = feature_rank_df.loc[feature_rank_df['Selected'] == True] 
recursive_feature_names


Unnamed: 0,Features,Rank,Selected
0,Pregnancies,1,True
1,Glucose,1,True
5,BMI,1,True
6,DiabetesPedigreeFunction,1,True


In [55]:
RFE_selected_features = X[recursive_feature_names['Features'].values]
RFE_selected_features.head()

Unnamed: 0,Pregnancies,Glucose,BMI,DiabetesPedigreeFunction
0,6,148,33.6,0.627
1,1,85,26.6,0.351
2,8,183,23.3,0.672
3,1,89,28.1,0.167
4,0,137,43.1,2.288


# 3. Embedded Method using random forest

Feature selection using Random forest comes under the category of Embedded methods. Embedded methods combine the qualities of filter and wrapper methods. They are implemented by algorithms that have their own built-in feature selection methods. Some of the benefits of embedded methods are :
1. They are highly accurate.
2. They generalize better.
3. They are interpretable

In [56]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
import pandas as pd


In [57]:
diabetes =pd.read_csv('diabetes.csv') 
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [58]:
# assign features to X and target 'outcome' to Y(Think why the 'outcome' column is taken as the target)
X =diabetes.iloc[:,0:8]
Y =diabetes.iloc[:,8]


In [59]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [60]:
#create an instance of Select from Model. Pass an object of Random Forest Classifier with n_estimators=100 as argument. 
sel = SelectFromModel(RandomForestClassifier(n_estimators=100))
# fit sel on X and y 
sel.fit(X_train,y_train)


In [61]:
sel.get_support()

array([False,  True, False, False, False,  True, False,  True])

In [62]:
selected_feat = X_train.columns[(sel.get_support())]
len(selected_feat)

3

In [63]:
print(selected_feat)

Index(['Glucose', 'BMI', 'Age'], dtype='object')


## Feature selection using SelectFromModel

SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or featureimportances attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or featureimportances values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.


In [64]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

In [65]:
m = SelectFromModel(LinearSVC())
m.fit(X,Y)



In [66]:
selected_feat = X_train.columns[(m.get_support())]
print (selected_feat)

Index(['Pregnancies', 'DiabetesPedigreeFunction'], dtype='object')


 # 4. Handling Multicollinearity with VIF

Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related. We have perfect multicollinearity if, for example as in the equation above, the correlation between two independent variables is equal to 1 or −1.

Variance inflation factor measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables.

VIF has big defination but for now understand that:-
Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables

In [67]:
dia_df = pd.read_csv('diabetes_cleaned.csv')
dia_df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,218.937760,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,70.189298,26.6,0.351,31.0,0
2,8.0,183.0,64.0,29.0,269.968908,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94.000000,28.1,0.167,21.0,0
4,0.0,137.0,40.0,35.0,168.000000,43.1,2.288,33.0,1
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,48.0,180.000000,32.9,0.171,63.0,0
764,2.0,122.0,70.0,27.0,158.815881,36.8,0.340,27.0,0
765,5.0,121.0,72.0,23.0,112.000000,26.2,0.245,30.0,0
766,1.0,126.0,60.0,29.0,173.820363,30.1,0.349,47.0,1


In [68]:
dia_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,121.539062,72.405184,29.108073,152.222767,32.307682,0.471876,33.240885,0.348958
std,3.369578,30.49066,12.096346,8.791221,97.387162,6.986674,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,-17.757186,18.2,0.078,21.0,0.0
25%,1.0,99.0,64.0,25.0,89.647494,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.202592,29.0,130.0,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,188.448695,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


As we can see range of these features are very different that means they all are in different scales so lets standardize the features using sklearn's scale function.

reference doc: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html

In [69]:
from sklearn import preprocessing
for i in dia_df:
    dia_df[[i]]= preprocessing.scale(dia_df[[i]].astype('float64')) # code here

In [70]:
dia_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,-6.476301e-17,-4.625929e-18,6.915764e-16,-1.526557e-16,-3.4694470000000005e-17,1.272131e-16,2.174187e-16,1.931325e-16,7.401487e-17
std,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652
min,-1.141852,-2.5447,-4.004245,-2.516429,-1.746542,-2.020543,-1.189553,-1.041549,-0.7321202
25%,-0.8448851,-0.7396938,-0.695306,-0.4675972,-0.64296,-0.7172147,-0.6889685,-0.7862862,-0.7321202
50%,-0.2509521,-0.1489643,-0.01675912,-0.01230129,-0.2283386,-0.04406715,-0.3001282,-0.3608474,-0.7321202
75%,0.6399473,0.6140612,0.6282695,0.3291706,0.3722209,0.6147581,0.4662269,0.6602056,1.365896
max,3.906578,2.542136,4.102655,7.955377,7.128551,4.983056,5.883565,4.063716,1.365896


In [71]:
from sklearn.model_selection import train_test_split

In [72]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [73]:
X =dia_df.iloc[:,0:8]
Y =dia_df.iloc[:,8]
x_train,x_test,y_train,y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [74]:
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]


In [75]:
vif['Features'] = X.columns

In [76]:
vif.round(2)

Unnamed: 0,VIF Factor,Features
0,1.43,Pregnancies
1,2.07,Glucose
2,1.24,BloodPressure
3,1.43,SkinThickness
4,2.04,Insulin
5,1.58,BMI
6,1.05,DiabetesPedigreeFunction
7,1.62,Age


* VIF = 1: Not correlated
* VIF =1-5: Moderately correlated
* VIF >5: Highly correlated

Glucose, Insulin, and Age are having large VIF scores, drop it.



In [77]:
X = X.drop(['Glucose'],axis=1)

Now again we calculate the VIF for the rest of the features

Again repeat the previous steps to assign an empty dataframe() to vif and make a new column 'VIF Factor' and calculate the variance_inflation_factorfor each X 


In [78]:
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values,i) for i in range (X.shape[1])]

In [79]:
vif['Features'] = X.columns
vif['Features'] = X.columns
vif.round(2)

Unnamed: 0,VIF Factor,Features
0,1.43,Pregnancies
1,1.22,BloodPressure
2,1.43,SkinThickness
3,1.15,Insulin
4,1.58,BMI
5,1.04,DiabetesPedigreeFunction
6,1.61,Age


So now colinearity of features has been reduced using VIF.