In [3]:
import pandas as pd
from sklearn import linear_model
from sklearn import svm
import statsmodels.api as sm
import matplotlib.pyplot as plt
import skfeature
from sklearn import metrics
%matplotlib inline

# An Introduction to Feature Engineering

By Eric Schles

Today we are going to cover feature engineering.  We'll start with the easiest form of feature engineering - creating dummy variables.  And then we'll move onto taking the log of diffierent variables, taking the variable and the square of variables, adding the results of other models as features to the model, and finally automatic feature selection, via skfeature and sklearn.  

To summarize we'll cover:

* dummy variables
* logs of variables
* polynomial variants of variables
* the outputs of other models as features
* automatic feature selection via skfeature
* automatic feature selection via PCA
* automatic feature selection via T-SNE
* automatic feature selection via DBSCAN

In [4]:
# first let's import some data

train = pd.read_csv("train.csv")

In [6]:
# starting with dummy variables
# We'll want to pick dummary variables that occur a somewhat balanced number of times
# that makes variables like the following poor choices:

train["Street"].value_counts()

Pave    1454
Grvl       6
Name: Street, dtype: int64

In [14]:
# And variables like LotShape pretty good:
train["LotShape"].value_counts()

Reg    925
IR1    484
IR2     41
IR3     10
Name: LotShape, dtype: int64

## Operationalizing dummy variables

Now that we've figured out what a good dummy variable looks like, let's operationalize it with a function!  

In [15]:
def generate_dummy_variables(df, length_cut_off_percent = 0.75, max_concentration = 0.65):
    candidate_columns = []
    for column in df.columns:
        if df[column].dtype == object:
            candidate_columns.append(column)
    for column in candidate_columns:
        value_counts = df[column].value_counts()
        if len(value_counts) < len(df)*length_cut_off_percent: 
            candidate_columns.remove(column)
        sum_vals = sum(value_counts)
        percentages = [elem/sum_vals > max_concentration for elem in value_counts]
        if any(percentages):
            candidate_columns.remove(column)
    for column in candidate_columns:
        dummy_columns = pd.get_dummies(df[column])
        df = pd.concat([df, dummy_columns], axis=1)
        df.drop(column, axis=1, inplace=True)
    return df