## Feature Selection
#### Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

## Variance Threshold
________
#### Variance threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold as it is assumed that features with a higher variance may contain more useful information. 
#### Here it is used as a feature selector that removes all low-variance features.

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/10/Image-6-1.png)

#### Features having variance lower than or equal to the threshold value will be returned as 'False' (in the array returned by .get_support()) and will be dropped. 
#### After dropping these features  we check the accuracy of the new dataset and compare it with original.
#### For choosing the threshold values, I experimented with different numbers and chose the suitable ones.

__________

### Importing the required libraries 

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
import warnings
warnings.filterwarnings("ignore")


from sklearn.model_selection import train_test_split
def split(df,label):
    X_tr, X_te, Y_tr, Y_te = train_test_split(df, label, test_size=0.25, random_state=42)
    return X_tr, X_te, Y_tr, Y_te


from sklearn.feature_selection import VarianceThreshold
def variance_threshold(df,th):
    var_thres=VarianceThreshold(threshold=th)
    var_thres.fit(df)
    new_cols = var_thres.get_support()
    return df.iloc[:,new_cols]
    

from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, cross_val_score

classifiers = ['LinearSVM', 'RadialSVM', 
               'Logistic',  'RandomForest', 
               'AdaBoost',  'DecisionTree', 
               'KNeighbors','GradientBoosting']

models = [svm.SVC(kernel='linear'),
          svm.SVC(kernel='rbf'),
          LogisticRegression(max_iter = 1000),
          RandomForestClassifier(n_estimators=200, random_state=0),
          AdaBoostClassifier(random_state = 0),
          DecisionTreeClassifier(random_state=0),
          KNeighborsClassifier(),
          GradientBoostingClassifier(random_state=0)]

def acc_score(df,label):
    Score = pd.DataFrame({"Classifier":classifiers})
    j = 0
    acc = []
    X_train,X_test,Y_train,Y_test = split(df,label)
    for i in models:
        model = i
        model.fit(X_train,Y_train)
        predictions = model.predict(X_test)
        acc.append(accuracy_score(Y_test,predictions))
        j = j+1     
    Score["Accuracy"] = acc
    Score.sort_values(by="Accuracy", ascending=False,inplace = True)
    Score.reset_index(drop=True, inplace=True)
    return Score

def acc_score_thr(df,label,thr_list):
    Score = pd.DataFrame({"Classifier":classifiers})
    for k in range(len(thr_list)):
        df2 = variance_threshold(df,thr_list[k])
        X_train,X_test,Y_train,Y_test = split(df2,label)
        j = 0
        acc = []
        for i in models:
            model = i
            model.fit(X_train,Y_train)
            predictions = model.predict(X_test)
            acc.append(accuracy_score(Y_test,predictions))
            j = j+1  
        feat = str(thr_list[k])
        Score[feat] = acc
    return Score

        
def plot2(df,l1,l2,p1,p2,c = "b"):
    feat = []
    feat = df.columns.tolist()
    feat = feat[1:]
    plt.figure(figsize = (16, 18))
    for j in range(0,df.shape[0]):
        value = []
        k = 0
        for i in range(1,len(df.columns.tolist())):
            value.append(df.iloc[j][i])
        plt.subplot(4, 4,j+1)
        ax = sns.pointplot(x=feat, y=value,color = c )
        plt.text(p1,p2,df.iloc[j][0])
        plt.xticks(rotation=90)
        ax.set(ylim=(l1,l2))
        k = k+1
        

def highlight_max(data, color='aquamarine'):
    attr = 'background-color: {}'.format(color)
    if data.ndim == 1:  
        is_max = data == data.max()
        return [attr if v else '' for v in is_max]
    else: 
        is_max = data == data.max().max()
        return pd.DataFrame(np.where(is_max, attr, ''),
                            index=data.index, columns=data.columns)

_______
### Function Description
#### 1. split():
Splits the dataset into training and test set.
#### 2. variance_threshold():
Returns the dataframe after dropping features with lower variance than the threshold value.
#### 3. acc_score():
Returns accuracy for all the classifiers.
#### 4. acc_score_thr():
Returns accuracy for all the classifiers for the respective threshold value.
#### 5. plot2():
For plotting the results.
___________
### The following 3 datasets are used:
1. Breast Cancer
2. Parkinson's Disease
3. PCOS
________
### Plan of action:
* Looking at dataset (includes a little preprocessing)
* Checking Accuracy (comparing accuracies with the new dataset)
* Visualization (Plotting the graphs)
_______

______________
# Breast Cancer
_____________

### 1. Looking at dataset

In [None]:
data_bc = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")
label_bc = data_bc["diagnosis"]
label_bc = np.where(label_bc == 'M',1,0)
data_bc.drop(["id","diagnosis","Unnamed: 32"],axis = 1,inplace = True)

print("Breast Cancer dataset:\n",data_bc.shape[0],"Records\n",data_bc.shape[1],"Features")

In [None]:
display(data_bc.head())
print("All the features in this dataset have continuous values")

### 2. Checking Accuracy

In [None]:
score1 = acc_score(data_bc,label_bc)
score1

In [None]:
threshold_bc = [0.04,0.02,0.01,0.008,0.004,0.001]
classifiers = score1["Classifier"].tolist()
score_bc = acc_score_thr(data_bc,label_bc,threshold_bc)
score_bc.style.apply(highlight_max, subset = score_bc.columns[1:], axis=None)

#### Best Accuracy with all features : RandomForest Classifier - 0.972
#### Best Accuracy after applying with VarianceThreshold() : LinearSVM - for threshold = (0.04,0.02,0.01,0.008) - 0.979
#### Here we can only see a slight improvement.

### 3. Visualization

In [None]:
plot2(score_bc,0.90,1,2.5,0.91,c = "gold")

______
# Parkinson's disease
_______

### 1. Looking at dataset

In [None]:
data_pd = pd.read_csv("../input/parkinson-disease-detection/Parkinsson disease.csv")
label_pd = data_pd["status"]
data_pd.drop(["status","name"],axis = 1,inplace = True)

print("Parkinson's disease dataset:\n",data_pd.shape[0],"Records\n",data_pd.shape[1],"Features")

In [None]:
display(data_pd.head())
print("All the features in this dataset have continuous values")

### 2. Checking Accuracy

In [None]:
score3 = acc_score(data_pd,label_pd)
score3

In [None]:
threshold_pd = [0.05,0.01,0.005,0.001,0.0001,0.0005]
classifiers = score3["Classifier"].tolist()
score_pd = acc_score_thr(data_pd,label_pd,threshold_pd)
score_pd.style.apply(highlight_max, subset = score_pd.columns[1:], axis=None)

#### Best Accuracy with all features : RandomForest Classifier - 0.918
#### Best Accuracy after applying with VarianceThreshold() : RandomForest Classifier - for threshold = (0.005,0.0001,0.0005) - 0.938 and LiearSVM - for threshold = (0.0005) - 0.938
#### Here we can see an improvement of 2%.

### 3. Visualization

In [None]:
plot2(score_pd,0.65,1.0,1,0.7,c = "orange")

________
# PCOS
________

### 1. Looking at dataset

In [None]:
data_pcos = pd.read_csv("../input/pcos-dataset/PCOS_data.csv")
label_pcos = data_pcos["PCOS (Y/N)"]
data_pcos.drop(["Sl. No","Patient File No.","PCOS (Y/N)","Unnamed: 44","II    beta-HCG(mIU/mL)","AMH(ng/mL)"],axis = 1,inplace = True)
data_pcos["Marraige Status (Yrs)"].fillna(data_pcos['Marraige Status (Yrs)'].describe().loc[['50%']][0], inplace = True) 
data_pcos["Fast food (Y/N)"].fillna(1, inplace = True) 

print("PCOS dataset:\n",data_pcos.shape[0],"Records\n",data_pcos.shape[1],"Features")

In [None]:
display(data_pcos.head())
print("The features in this dataset have both discrete and continuous values")

### 2. Checking Accuracy

In [None]:
score4 = acc_score(data_pcos,label_pcos)
score4

In [None]:
threshold_pcos = [0.17,0.19,0.21,0.23,0.5,0.8]
classifiers = score4["Classifier"].tolist()
score_pcos = acc_score_thr(data_pcos,label_pcos,threshold_pcos)
score_pcos.style.apply(highlight_max, subset = score_pcos.columns[1:], axis=None)

#### Best Accuracy with all features : RandomForest Classifier - 0.889
#### Best Accuracy after applying with VarianceThreshold() : DecisionTree Classifier - for threshold = (0.19) - 0.897
#### Here we can see an improvement of ~1%.

### 3. Visualization

In [None]:
plot2(score_pcos,0.3,1.0,1,0.35,c = "limegreen")

________

#### From looking at these results we can see that there is a possibility of slight improvement in the accuracy after removing certain features with low variance.
#### Link to other feature selection methods:
##### [Genetic Algorithm](https://www.kaggle.com/tanmayunhale/genetic-algorithm-for-feature-selection)
##### [Pearson Correlation](https://www.kaggle.com/tanmayunhale/feature-selection-pearson-correlation)
##### [F-score](https://www.kaggle.com/tanmayunhale/feature-selection-f-score)
