[Reference](https://towardsdatascience.com/intro-to-feature-selection-methods-for-data-science-4cae2178a00a)

There are three types of feature selection: Wrapper methods (forward, backward, and stepwise selection), Filter methods (ANOVA, Pearson correlation, variance thresholding), and Embedded methods (Lasso, Ridge, Decision Tree). 

# Wrapper methods

## Forward selection

```python
def forward_selection(X, y, initial_list=[], threshold_in=0.01, verbose=True):
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        ...  # Check the entire function in my github page. 
    return included
```

## Backward selection

```python
def back_selection(X, y, threshold_out = 0.05,verbose=True):
    included = X.columns.tolist()
    while True:
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)
         #check the entire function on my github page 
    return included
```

## Stepwise selection

```python
def stepwise_selection(X, y,initial_list=[],threshold_in=0.01,threshold_out = 0.05,verbose=True):
    included = list(initial_list)
    while True:
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        #check the entire function on my github page 
        if not changed:
            break
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
    return included 
```

# Filter methods

## ANOVA (Analysis of variance)

```python
def ANOVA(X,y):
    '''Univariate linear regression tests
    Quick linear model for sequentially testing the effect of many regressors
    Using scikit learn's Feature selection toolbox
    Returns:
        F (array) = F-values for regressors
        pvalues (array) = p-values for F-scores'''
    (F,pvalues) = f_regression(X,y)
    return (F,pvalues)
```

## Pearson correlation

```python
imt_features=[]
imt_features.extend(df.corr()["SalePrice"].sort_values(ascending=False).index.tolist()[:30])
imt_features.extend(df.corr()["SalePrice"].sort_values(ascending=True).index.tolist()[:30])
#check the entire analysis on my github page 
```

## Variance thresholding

```python
# Create VarianceThreshold object with a variance with a default threshold of 0.5
def variance_threshold_selector(data, threshold=0.5):
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]
```

## Interactions

```python
#Add one interacting term and check its effect on model improvement. 
X_int=X_orig.copy()
X_int["Quality*Condition"]=X_int["OverallQual"]*X_int["OverallCond"]
```

# Embedded Methods

## Ridge regression

![Ridge](https://miro.medium.com/max/1312/0*KYSUZMCtRuwrlmYW)

```python
from sklearn.linear_model import Ridge
rr = Ridge(alpha=0.01) # higher the alpha value, more restriction on the coefficients; low alpha > more generalization, coefficients are barely
# restricted and in this case linear and ridge regression resembles
rr.fit(X_train, y_train)
```

## Lasso Regression

![Lasso](https://miro.medium.com/max/1274/0*7wWuuNtsjiGbBXre)

```python

from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(X_train,y_train)
train_score=lasso.score(X_train,y_train)
test_score=lasso.score(X_test,y_test)
coeff_used = np.sum(lasso.coef_!=0)
```

## Decision Tree

```python
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(X_train, y_train)
# Get numerical feature importances
importances = list(rf.feature_importances_)
```

- Feature: an x variable, most often a column in a dataset
- Feature selection: optimizing a model by selecting a subset of the features to use
- Wrapper method: trying models with different subsets of features and picking the best combination
- Forward selection: adding features one by one to reach the optimal model
Backward selection: removing features one by one to reach the optimal model
- Stepwise selection: hybrid of forward and backward selection.adding and removing features one by one to reach the optimal model
Filter method: selecting a subset of features by a measure other than error (a measure that is inherent to the feature and not dependent on a model)
- Pearson Correlation: a measure of the linear correlation between two variables
- Variance thresholding: selecting the features above a variance cutoff to preserve most of the information from the data
- ANOVA: (analysis of variance) a group of statistical estimation procedures and models that is used to observe differences in treatment (sample) means; can be used to tell when a feature is statistically significant to a model
- Interacting term: quantifies the relationship between two of the features when they depend on the value of the other; alleviates multicollinearity and can provide further insight into the data
- Multicollinearity: occurs when two or more independent variables are highly correlated with each other
Embedded method: selecting and tuning the subset of features during the model creation process
- Ridge Regression: a modified least squares regression that penalizes features for having inflated beta coefficients by applying a lambda term to the cost function
- Lasso Regression: similar to ridge regression, but different in that the lambda term added to the cost function can force a beta coefficient to zero
- Decision Tree: a non-parametric model that using features as nodes to split samples to correctly classify an observation. In a random forest model, feature importance can be calculated using mean decrease gini score. 
- Cross Validation: a method to iteratively generate training and test datasets to estimate model performance on future unknown datasets