### Feature Selection: Forward Step-Wise

So far I've tried lasso regression on all datapoints to select features including non-linearly transformed features. A polynomial PCA of 21 selected features shows a pattern in the potent compounds, but not good separation from inactive compounds.

I want to explore new methods for feature selection and compare these methods using the predictive accuracy of OSM-S-106 potency and through visualizations of their dimensionality. To visually show the performance of the models, select models that output feature importance and pass through all principal components of the underlying features, then use the most important dimensions in the visuals.

**Methods to combine**
* Forward step-wise feature selection
* Include decoy data-points
* Cross-Validate with held out potent compounds for step-wise selection
* Test feature selection and models with OSM-S-106 for final comparison

Hypothesis: 
* A model can accurately predict the potency of OSM-S-106 without using it in the training data or any derivative from it's data

Assumption:
* At least some of the relatively potent compounds measured must perform in a similar way to OSM-S-106

In [1]:
import pandas as pd
from sklearn.decomposition import KernelPCA
from matplotlib import pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from scipy.stats import skewtest
from sklearn.linear_model import Lasso
%matplotlib inline

#### Read in data and pre-process variables

In [2]:
df = pd.read_csv("data/Series3_6.15.17_padel.csv")
# Drop examples without IC50
df = df[~df.IC50.isnull()]

# Column types and counts
np.unique(df.dtypes)
len(df.columns[df.iloc[:, :].dtypes == 'O'])
len(df.columns[df.iloc[:, :].dtypes == 'int64'])
len(df.columns[df.iloc[:, :].dtypes == 'float64'])

""" Preprocessing Variables """
# Categorical Variables: No missing values
sum(df[df.columns[df.iloc[:, :].dtypes == 'int64']].isnull().sum())
# Get dummy vars: filter to int type, convert to object, pass to get_dummies.
cat_vars_df = pd.get_dummies(
    df[df.columns[df.iloc[:, :].dtypes == 'int64']].astype('O'))

# Continuous Variables: 67 columns have missing values
sum(df[df.columns[df.iloc[:, :].dtypes == 'float64']].isnull().sum())
# Impute or remove? (for now remove any columns with nan)
cont_vars_df = df[df.columns[df.iloc[:, :].dtypes == 'float64']].dropna(axis=1)
# Drop target variable
cont_vars_df.drop("IC50", axis=1, inplace=True)

#### Process continuous variables: Add transformations, remove skewed features

In [4]:
def add_transformations(df, feat):
    feature_df = df.loc[:, feat].copy()
    if feature_df.min() > 0:  # Avoid 0 or negative
        df.loc[:, feat + "_log"] = feature_df.apply(np.log)  # log
        df.loc[:, feat + "_log2"] = feature_df.apply(np.log2)  # log2
        df.loc[:, feat + "_log10"] = feature_df.apply(np.log10)  # log10
    df.loc[:, feat + "_cubert"] = feature_df.apply(
        lambda x: np.power(x, 1/3))  # cube root
    df.loc[:, feat + "_sqrt"] = feature_df.apply(np.sqrt)  # square root
    if feature_df.max() < 10:  # Avoid extremely large values
        df.loc[:, feat + "_sq"] = feature_df.apply(np.square)  # square
        df.loc[:, feat + "_cube"] = feature_df.apply(
            lambda x: np.power(x, 3))  # cube
        df.loc[:, feat + "_exp"] = feature_df.apply(np.exp)  # exp
        df.loc[:, feat + "_exp2"] = feature_df.apply(np.exp2)  # exp2
    return df

# Add above transformations for all continuous variables
for feature in cont_vars_df.columns:
    cont_vars_df = add_transformations(cont_vars_df, feature)
# Drop any new columns with NaN due to improper transformation
cont_vars_df.replace([np.inf, -np.inf], np.nan, inplace=True)
cont_vars_df.dropna(axis=1, inplace=True)

# Assume skewed if we can reject the null hypothesis with 95% certainty
# Remove any skewed features after adding transformations
cont_vars_df = cont_vars_df.loc[:, cont_vars_df.apply(
    lambda x: skewtest(x)[1] > .05).values]

# Combine datasets
vars_df = pd.concat([cat_vars_df, cont_vars_df], axis=1)