## FEATURE SELECTION

* Proper feature selection:
* Reduces overfitting
* Improves Accuracy 
* Reduces Training time 
* Improves Interpretablity

* Feature Selection Algorithms:
1. Filter Methods 
2. Wrapper Methods 
3. Embedded Methods 

## FILTER METHODS
* Ranks each feature using statistical methods 
* There are 2 filter methods :
* Variance Threshold 
* Univariate Techniques

### VARIANCE THRESHOLD 
* Ranks features by their variance (higher variance - more predictable information)
* If a feature has no variance (constant) then it has no predicting power at all 
* It is very sensitive to scale 
* Hence feature scaling is required 

### UNIVARIATE TECHNIQUES 
* Gives the relationship between individual features and target features 

## WRAPPER METHODS 
* Train model with combination of features 
* This is recursive Feature extraction

## EMBEDDED METHODS 
* For some models like dtr and rfr the feature importance can be found while training the features itself 
* But not all models have the ability to rank features during the training process 

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
df= pd.read_csv('C:/Users/Palla Anuraag Sharma/Downloads/Datacamp/Datasets/Iris DataSet/Iris.csv')

In [4]:
df = df.set_index('Id')

In [5]:
df.sample(5)

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
51,7.0,3.2,4.7,1.4,Iris-versicolor
103,7.1,3.0,5.9,2.1,Iris-virginica
82,5.5,2.4,3.7,1.0,Iris-versicolor
64,6.1,2.9,4.7,1.4,Iris-versicolor
91,5.5,2.6,4.4,1.2,Iris-versicolor


In [7]:
x = df[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]

In [8]:
from sklearn.preprocessing import LabelEncoder 
le = LabelEncoder()
df['Species_cat'] = le.fit_transform(df.Species)

In [9]:
y = df['Species_cat']

### Filter Methods 

### Variance Threshold

In [12]:
# Without feature Scaling 
from sklearn.feature_selection import VarianceThreshold 
vt = VarianceThreshold()
vt.fit_transform(x)
for var, name in zip(vt.variances_, x):
    print(f'{name:>10} variance = {var:5.3f}')

SepalLengthCm variance = 0.681
SepalWidthCm variance = 0.187
PetalLengthCm variance = 3.092
PetalWidthCm variance = 0.579


In [14]:
# With Feature Scaling 
from sklearn.preprocessing import MinMaxScaler
x_scaled = MinMaxScaler().fit_transform(x)
vt.fit_transform(x_scaled)
for var,name in zip(vt.variances_,x):
    print(f'{name:>10} variance = {var:5.3f}')

SepalLengthCm variance = 0.053
SepalWidthCm variance = 0.032
PetalLengthCm variance = 0.089
PetalWidthCm variance = 0.100


In [15]:
# Therefore now after feature scaling and then finding the variance it shows that PETALLENGTH affects the target value most 

### Univariate Techniques

In [17]:
from sklearn.feature_selection import SelectKBest
skb = SelectKBest(k='all')
fs = skb.fit(x,y)
for var, name in zip(fs.scores_, x):
    print(f'{name:>18} score = {var:5.3f}')

     SepalLengthCm score = 119.265
      SepalWidthCm score = 47.364
     PetalLengthCm score = 1179.034
      PetalWidthCm score = 959.324


In [20]:
from sklearn.feature_selection import mutual_info_classif
skb = SelectKBest(mutual_info_classif,k='all')
fs = skb.fit(x,y)
for var,name in zip(fs.scores_,x):
    print(f'{name:>18} score = {var:5.3f}')

     SepalLengthCm score = 0.509
      SepalWidthCm score = 0.227
     PetalLengthCm score = 0.986
      PetalWidthCm score = 0.995


### WRAPPER METHODS 

### Recursive Feature Elimination

In [28]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import RFE

svc = LinearSVC(random_state=2)
rfe = RFE(svc,1)
rfe.fit(x,y)
for var, name in sorted(zip(rfe.ranking_,x), key=lambda x: x[0]):
    print(f'{name:>18} rank = {var}')

      PetalWidthCm rank = 1
      SepalWidthCm rank = 2
     PetalLengthCm rank = 3
     SepalLengthCm rank = 4




### EMBEDDED METHODS

### For embedded in rfc and dtc

In [29]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x,y)
print(f'{"Label":18s}: Importance')
print(26*'-')
for val, name in sorted(zip(rfc.feature_importances_, x), 
                        key=lambda x: x[0], reverse=True):
    print(f'{name:>18}: {100.0*val:05.2f}%')

Label             : Importance
--------------------------
     PetalLengthCm: 47.91%
      PetalWidthCm: 40.85%
     SepalLengthCm: 08.83%
      SepalWidthCm: 02.41%
