<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Feature-Selection" data-toc-modified-id="Feature-Selection-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Feature Selection</a></span><ul class="toc-item"><li><span><a href="#Variance-Thresholding" data-toc-modified-id="Variance-Thresholding-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Variance Thresholding</a></span></li><li><span><a href="#Anova-Test" data-toc-modified-id="Anova-Test-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Anova Test</a></span></li><li><span><a href="#RFE-(Recursive-Feature-Elimination)" data-toc-modified-id="RFE-(Recursive-Feature-Elimination)-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>RFE (Recursive Feature Elimination)</a></span></li><li><span><a href="#Feature-Selection-using-RandomForest" data-toc-modified-id="Feature-Selection-using-RandomForest-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Feature Selection using RandomForest</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Modelling</a></span><ul class="toc-item"><li><span><a href="#Decision-tree-modelling" data-toc-modified-id="Decision-tree-modelling-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Decision tree modelling</a></span></li><li><span><a href="#Hyperparameter-tuning-the-decision-tree" data-toc-modified-id="Hyperparameter-tuning-the-decision-tree-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Hyperparameter tuning the decision tree</a></span></li></ul></li></ul></div>

<center><h1>Feature Selection</h1></center>

In [34]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import (
    MinMaxScaler, 
    StandardScaler
)

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics

from sklearn.model_selection import train_test_split, KFold

from sklearn.feature_selection import (
    VarianceThreshold, 
    f_classif,
    SelectKBest,
    SelectFromModel,
    RFE
)

from prettytable import PrettyTable

In [2]:
df = pd.read_csv("./Data_removed_outlier_iqr.csv")

In [3]:
df.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,35280,717.703,264.99525,170.035245,1.558472,0.766994,35772,211.943132,0.703616,0.986246,0.860694,0.7998,0.007511,0.001896,0.63968,0.996923,DERMASON
1,83296,1142.638,446.765889,239.013317,1.869209,0.844861,84270,325.662035,0.702588,0.988442,0.801709,0.728932,0.005364,0.000934,0.531342,0.99319,CALI
2,35594,689.634,254.572928,178.441837,1.426644,0.713214,35966,212.884213,0.811629,0.989657,0.940479,0.836241,0.007152,0.002157,0.699298,0.99765,DERMASON
3,52710,872.7,326.039383,207.39945,1.572036,0.771592,53280,259.06072,0.677419,0.989302,0.869707,0.794569,0.006186,0.001521,0.63134,0.992488,SIRA
4,62855,1004.759,413.879306,194.299306,2.130112,0.882954,63781,282.894807,0.59834,0.985482,0.782395,0.68352,0.006585,0.000887,0.4672,0.995188,HOROZ


In [4]:
X = df.drop('Class', axis=1)
y = df['Class'].astype('category').cat.codes

In [5]:
X.var(numeric_only=True)

Area               2.822575e+08
Perimeter          2.552720e+04
MajorAxisLength    4.481631e+03
MinorAxisLength    8.885643e+02
AspectRation       6.266115e-02
Eccentricity       8.710692e-03
ConvexArea         2.930358e+08
EquivDiameter      1.712622e+03
Extent             2.414141e-03
Solidity           2.150092e-05
roundness          3.647966e-03
Compactness        3.917317e-03
ShapeFactor1       9.303870e-07
ShapeFactor2       3.371308e-07
ShapeFactor3       1.007840e-02
ShapeFactor4       1.858476e-05
dtype: float64

In [6]:
scaler = MinMaxScaler(feature_range=(1, 10))
X_scaled = scaler.fit_transform(X)

In [7]:
X_scaled_df = pd.DataFrame(X_scaled, columns=df.columns[:-1].tolist())

In [8]:
X_scaled_df.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4
0,2.098625,3.01691,3.087722,2.683622,4.417042,8.122872,2.109502,2.727128,5.293329,8.994033,7.665143,5.132964,6.251251,4.536541,4.656919,9.514172
1,5.648529,7.458373,7.750059,5.127365,6.406905,9.134902,5.675823,6.601082,5.263568,9.256008,6.605667,3.293449,3.512836,1.566158,2.929442,8.86855
2,2.121839,2.72353,2.820394,2.98145,3.572851,7.4239,2.123768,2.759187,8.420327,9.400973,9.098218,6.078862,5.793397,5.344295,5.607558,9.639797
3,3.387254,4.636954,4.653477,4.007355,4.503898,8.182628,3.396961,4.332238,4.53494,9.358605,7.827025,4.997184,4.560889,3.378249,4.523934,8.747308
4,4.137291,6.017248,6.906533,3.543245,8.077647,9.629996,4.169156,5.144171,2.24558,8.902799,6.258752,2.114683,5.069827,1.41947,1.906669,9.214072


In [9]:
X_scaled_df.var()

Area               1.542790
Perimeter          2.788754
MajorAxisLength    2.948464
MinorAxisLength    1.115265
AspectRation       2.569564
Eccentricity       1.471410
ConvexArea         1.584579
EquivDiameter      1.987491
Extent             2.023309
Solidity           0.306082
roundness          1.176912
Compactness        2.639371
ShapeFactor1       1.512696
ShapeFactor2       3.215439
ShapeFactor3       2.562470
ShapeFactor4       0.555753
dtype: float64

## Variance Thresholding

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed.

Variance will also be very low for a feature if only a handful of observations of that feature differ from a constant value.

What we can do is set a threshold and drop features with low variance 

In [10]:
vt = VarianceThreshold(threshold=1)
X_scaled_var_feats = vt.fit_transform(X_scaled)

In [11]:
X_scaled_var_feats_df = pd.DataFrame(
    X_scaled_var_feats, 
    columns=X_scaled_df.columns[vt.get_support(indices=True)]
)

In [12]:
X_scaled_var_feats_df.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3
0,2.098625,3.01691,3.087722,2.683622,4.417042,8.122872,2.109502,2.727128,5.293329,7.665143,5.132964,6.251251,4.536541,4.656919
1,5.648529,7.458373,7.750059,5.127365,6.406905,9.134902,5.675823,6.601082,5.263568,6.605667,3.293449,3.512836,1.566158,2.929442
2,2.121839,2.72353,2.820394,2.98145,3.572851,7.4239,2.123768,2.759187,8.420327,9.098218,6.078862,5.793397,5.344295,5.607558
3,3.387254,4.636954,4.653477,4.007355,4.503898,8.182628,3.396961,4.332238,4.53494,7.827025,4.997184,4.560889,3.378249,4.523934
4,4.137291,6.017248,6.906533,3.543245,8.077647,9.629996,4.169156,5.144171,2.24558,6.258752,2.114683,5.069827,1.41947,1.906669


**Obvservation**

 - As we can see 2 low variance features are removed which are `Solidity` and `ShapeFactor4`

## Anova Test

Analysis of variance (ANOVA) is a statistical technique that is used to check if the means of two or more groups are significantly different from each other. ANOVA checks the impact of one or more factors by comparing the means of different samples. 

If we had categorical variables we would do another test called the $\chi^2$ test. Since we have all numeric features we do the ANOVA test.

In [13]:
anova_filter = SelectKBest(score_func=f_classif, k=8)
anova_filter.fit(X, y)

SelectKBest(k=8)

In [14]:
anova_df = pd.DataFrame({
    'features': df.columns[anova_filter.get_support(indices=True)],
    'scores': anova_filter.scores_[anova_filter.get_support(indices=True)]
})

In [15]:
anova_df

Unnamed: 0,features,scores
0,Area,12106.046095
1,Perimeter,13571.034479
2,MajorAxisLength,13811.838587
3,AspectRation,10508.820302
4,ConvexArea,12251.935448
5,EquivDiameter,12284.5185
6,Compactness,10309.663853
7,ShapeFactor2,10833.619482


**Obvservation**

 - We have selected 8 best features according to the anova scores. 
 - We can select more features also, it can be a thought of as a hyperparameter to be tuned i.e trying different models with different number of features
 - I selected the number 8 as in the previous EDA notebook we saw that 8 principal components could be used to describe the whole dataset

## RFE (Recursive Feature Elimination)

Recursive Feature Elimination selects features by recursively considering smaller subsets of features by pruning the least important feature at each step. Here models are created iteartively and in each iteration it determines the best and worst performing features and this process continues until all the features are explored.Next ranking is given on eah feature based on their elimination orde. In the worst case, if a dataset contains N number of features RFE will do a greedy search for $N^2$ combinations of features.

In [16]:
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=8, verbose=3)
rfe.fit(X, y)

Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.


RFE(estimator=DecisionTreeClassifier(), n_features_to_select=8, verbose=3)

In [19]:
rfe.get_feature_names_out()

array(['Perimeter', 'MajorAxisLength', 'MinorAxisLength', 'Solidity',
       'roundness', 'ShapeFactor1', 'ShapeFactor3', 'ShapeFactor4'],
      dtype=object)

In [20]:
rfe_df = pd.DataFrame(
    {
        'features': X.columns.to_list(),
        'rank': rfe.ranking_,
        'selected': rfe.support_
    }
)

In [21]:
rfe_df

Unnamed: 0,features,rank,selected
0,Area,8,False
1,Perimeter,1,True
2,MajorAxisLength,1,True
3,MinorAxisLength,1,True
4,AspectRation,9,False
5,Eccentricity,6,False
6,ConvexArea,5,False
7,EquivDiameter,2,False
8,Extent,4,False
9,Solidity,1,True


**Obvservation**

 - We can see that the ones which are marked as `True` are the features selected.

## Feature Selection using RandomForest

Feature selection using Random forest comes under the category of Embedded methods. Embedded methods combine the qualities of filter and wrapper methods. They are implemented by algorithms that have their own built-in feature selection methods. Some of the benefits of embedded methods are :

1. They are highly accurate.
2. They generalize better.
3. They are interpretable

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [23]:
selector = SelectFromModel(estimator=RandomForestClassifier(n_estimators=100, verbose=1, n_jobs=-1))
selector.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.5s finished


SelectFromModel(estimator=RandomForestClassifier(n_jobs=-1, verbose=1))

In [24]:
print(f"Selected features are: {selector.get_feature_names_out()}")

Selected features are: ['Perimeter' 'MajorAxisLength' 'MinorAxisLength' 'AspectRation'
 'ConvexArea' 'Compactness' 'ShapeFactor1' 'ShapeFactor3']


## Conclusion

I now enlist all the methods used and the features selected

In [25]:
methods = [
    'Variance Thresholding (threshold = 1)',
    'ANOVA F-test',
    'Recursive Feature Elimination (estimator = DecisionTreeClassifier)',
    'Using RandomForest feature importance'
]

features = [
    X_scaled_var_feats_df.columns.to_list(),
    df.columns[anova_filter.get_support(indices=True)].to_list(),
    rfe.get_feature_names_out().tolist(),
    selector.get_feature_names_out().tolist()
]

In [26]:
table = PrettyTable(['Methods', 'Features Selected'])
for m, f in zip(methods, features):
    table.add_row([m, f])

In [27]:
table

Methods,Features Selected
Variance Thresholding (threshold = 1),"['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength', 'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter', 'Extent', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2', 'ShapeFactor3']"
ANOVA F-test,"['Area', 'Perimeter', 'MajorAxisLength', 'AspectRation', 'ConvexArea', 'EquivDiameter', 'Compactness', 'ShapeFactor2']"
Recursive Feature Elimination (estimator = DecisionTreeClassifier),"['Perimeter', 'MajorAxisLength', 'MinorAxisLength', 'Solidity', 'roundness', 'ShapeFactor1', 'ShapeFactor3', 'ShapeFactor4']"
Using RandomForest feature importance,"['Perimeter', 'MajorAxisLength', 'MinorAxisLength', 'AspectRation', 'ConvexArea', 'Compactness', 'ShapeFactor1', 'ShapeFactor3']"


# Modelling

Let's start with features selected using the ANOVA F-test

Our metric of choice is going to be `Weighted F1-score`. 
Here's a very good explanation for it: [link](https://stats.stackexchange.com/questions/463224/which-performance-metrics-for-highly-imbalanced-multiclass-dataset) 

In [28]:
feats = anova_filter.get_feature_names_out().tolist()

In [29]:
X[feats].head(10)

Unnamed: 0,Area,Perimeter,MajorAxisLength,AspectRation,ConvexArea,EquivDiameter,Compactness,ShapeFactor2
0,35280,717.703,264.99525,1.558472,35772,211.943132,0.7998,0.001896
1,83296,1142.638,446.765889,1.869209,84270,325.662035,0.728932,0.000934
2,35594,689.634,254.572928,1.426644,35966,212.884213,0.836241,0.002157
3,52710,872.7,326.039383,1.572036,53280,259.06072,0.794569,0.001521
4,62855,1004.759,413.879306,2.130112,63781,282.894807,0.68352,0.000887
5,36112,723.077,225.328776,1.100957,36709,214.427672,0.951621,0.003156
6,59442,975.979,402.887444,2.118794,60289,275.107079,0.682839,0.000909
7,58931,957.164,388.354344,2.001383,59526,273.922032,0.70534,0.001006
8,34010,681.989,257.82964,1.528818,34410,208.093433,0.807097,0.001984
9,27280,603.203,223.021086,1.428199,27550,186.370531,0.835663,0.002459


In [32]:
X_ = X[feats].values
y_ = y.values

In [66]:
def cross_validate(X, y, model):
    kfold = KFold(n_splits=10)
    idx = 1
    for train_index, test_index in kfold.split(X=X, y=y):
        X_train_ , X_test_ = X.iloc[train_index,:],X.iloc[test_index,:]
        y_train_ , y_test_ = y[train_index] , y[test_index]
        
        model.fit(X_train_, y_train_)
        y_pred_ = model.predict(X_test_)
            
        print(
            f"[FOLD {idx}] " 
            f"Weighted F1-score: {round(metrics.f1_score(y_pred_, y_test_, average='weighted'), 3)} "
        )
        
        idx += 1

## Decision tree modelling

In [67]:
cross_validate(X, y, DecisionTreeClassifier())

[FOLD 1] Weighted F1-score: 0.897 
[FOLD 2] Weighted F1-score: 0.874 
[FOLD 3] Weighted F1-score: 0.895 
[FOLD 4] Weighted F1-score: 0.902 
[FOLD 5] Weighted F1-score: 0.901 
[FOLD 6] Weighted F1-score: 0.885 
[FOLD 7] Weighted F1-score: 0.886 
[FOLD 8] Weighted F1-score: 0.9 
[FOLD 9] Weighted F1-score: 0.882 
[FOLD 10] Weighted F1-score: 0.885 


## Hyperparameter tuning the decision tree