# Feature Selection and Dimensionality Reduction
"feature selection is the process of selecting a subset of relevent features for use in model construction".  In normal circumstances, domain knowledge plays an important role.  Unfortunately, here in the Don't Overfit II competition, we have a binary target and 300 continuous variables "of mysterious origin" which forces us to try automatic feature selection techniques. https://www.kaggle.com/c/dont-overfit-ii



In [None]:
import numpy as np 
import pandas as pd
#import matplotlib.pyplot as plt
import seaborn as sns

### Load the data

In [None]:
df = pd.read_csv('data/mysterious.csv')
df.head()

In [None]:
df.describe()

* we have only 250 samples and 302 columns 
* the risk to overfit the data is high

#### create our X and y

In [None]:
# prepare for modeling
X = df.drop(['id', 'target'], axis=1)
y = df['target']

## Feature Scaling

While many algorithms (K-nearest neighbors, and logistic regression) require features to be normalized, intuitively we can think of Principle Component Analysis (PCA) as being a prime example of when normalization is important. In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled. As a change in height of one meter can be considered much more important than the change in weight of one kilogram, this is clearly incorrect.

In [None]:
# scaling data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

### Split the data

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y)

## Baseline Models
We'll use logistic regression is a good baseline as it is fast to train and predict and scales well.  We'll also use random forest.  With  its attribute feature_importances_ we can get a sense of which features are most important.


In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)

### fit the base models

In [None]:
lr.fit(x_train, y_train)
rfc.fit(x_train, y_train)

## baseline scores

In [None]:
print ("linear regression train "  , lr.score(x_train,y_train))
print ("Random Forest  train  "  , rfc.score(x_train,y_train))
print ("linear regression test "  , lr.score(x_test,y_test))
print ("Random Forest test "  , rfc.score(x_test,y_test))

In [None]:
#list(zip(df.columns[2:], rfc.feature_importances_))

In [None]:
featureImportance = pd.DataFrame(rfc.feature_importances_, columns=["importance"])
featureImportance = featureImportance.sort_values(["importance"], ascending = False) ### newer version might require sort not sort_values
featureImportance.head(10)

# We can apply feature selection techniques to improve model performance.

## 1.  Remove features with missing values
This one is pretty self explanatory.  First we check for missing values and then can remove columns exceeding a threshold we define.

In [None]:
# check missing values
df.isnull().any().any()

The dataset has no missing values and therefore no features to remove at this step.


## 2.  Remove highly correlated features

Features that are highly correlated or colinear can cause overfitting.  Here we will explore correlations among features.

In [None]:
# find correlations to target
corr_matrix = df.corr().abs()

**Using abs() ** the minus becoms plus so we can sort correctly

In [None]:
sns.heatmap(corr_matrix)

* no much correlation

## 3.  correlation to the target

In [None]:
print(corr_matrix['target'].sort_values(ascending=False).head(10))

From the above correlation matrix we see that there are no highly correlated features in the dataset.  And even exploring correlation to target shows feature 33 with the highest correlation of only 0.37. (obviously target is 100% correlated with itself)

## 4.  Univariate Feature Selection
We can use sklearn's SelectKBest to select a number of features to keep.  This method uses statistical tests to select features having the highest correlation to the target.  Here we will keep the top 100 features. https://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html#sphx-glr-auto-examples-feature-selection-plot-feature-selection-py

In [None]:
from sklearn.feature_selection import SelectKBest

# feature extraction
k_best = SelectKBest( k=100)
# fit on train set
k_best.fit(x_train, y_train)
# transform train set
univariate_features = k_best.transform(x_train)


* create the model again
* fit the model with the selected features

In [None]:
lr = LogisticRegression()
rfc = RandomForestClassifier(n_estimators=100)
lr.fit(univariate_features, y_train)
rfc.fit(univariate_features, y_train)

* we need to transform the test data as well

In [None]:
univariate_features_test = k_best.transform(x_test)

In [None]:
print ("linear regression train "  , lr.score(univariate_features,y_train))
print ("Random Forest  train  "  , rfc.score(univariate_features,y_train))
print ("linear regression test "  , lr.score(univariate_features_test,y_test))
print ("Random Forest test "  , rfc.score(univariate_features_test,y_test))

In [None]:
univariate_features.shape

In [None]:
featureImportance = pd.DataFrame(rfc.feature_importances_, columns=["importance"])
featureImportance = featureImportance.sort_values(["importance"], ascending = False) ### newer version might require sort not sort_values
featureImportance.head(10)

## 5. PCA

PCA (Principle Component Analysis) is a dimensionality reduction technique that projects the data into a lower dimensional space.
PCA can be useful in many situations, but especially in cases with excessive multicollinearity or explanation of predictors is not a priority.

In [None]:
from sklearn.decomposition import PCA
# pca - keep 90% of variance
pca = PCA(0.50)

x_train_components = pca.fit_transform(x_train)
x_train_components.shape

* we need to transform the test

In [None]:
x_test_components = pca.transform(x_test)
x_test_components.shape

In [None]:
y_train.shape

In [None]:
lr = LogisticRegression()
rfc = RandomForestClassifier(n_estimators=100, max_depth=100)
lr.fit(x_train_components, y_train)
rfc.fit(x_train_components, y_train)

In [None]:
print ("linear regression train "  , lr.score(x_train_components,y_train))
print ("Random Forest  train  "  , rfc.score(x_train_components,y_train))

print ("linear regression test "  , lr.score(x_test_components,y_test))
print ("Random Forest test "  , rfc.score(x_test_components,y_test))

## YOUR TURN 

* try to keep the 70% of the data
* try to use n_components=10 in the PCA parameter
* what is the main advantage of using PCA in this case?
* how would do you reduce the random forest accuracy in the training data?
* 

In [None]:
pca = PCA(n_components=1)

x_train_components = pca.fit_transform(x_train)
print(x_train_components.shape)

x_test_components = pca.transform(x_test)
print(x_test_components.shape)

lr = LogisticRegression()
rfc = RandomForestClassifier(n_estimators=100, max_depth=3)
lr.fit(x_train_components, y_train)
rfc.fit(x_train_components, y_train)

print ("linear regression train "  , lr.score(x_train_components,y_train))
print ("Random Forest  train  "  , rfc.score(x_train_components,y_train))

print ("linear regression test "  , lr.score(x_test_components,y_test))
print ("Random Forest test "  , rfc.score(x_test_components,y_test))
