In [1]:
import pandas as pd
from sklearn import linear_model
from sklearn import svm
import statsmodels.api as sm
import matplotlib.pyplot as plt
import skfeature
from sklearn import metrics
import numpy as np
%matplotlib inline

  from pandas.core import datetools


# An Introduction to Feature Engineering

By Eric Schles

Today we are going to cover feature engineering.  We'll start with the easiest form of feature engineering - creating dummy variables.  And then we'll move onto taking the log of diffierent variables, taking the variable and the square of variables, adding the results of other models as features to the model, and finally automatic feature selection, via skfeature and sklearn.  

To summarize we'll cover:

* dummy variables
* logs of variables
* multiplying variables
* the outputs of other models as features
* automatic feature selection via skfeature
* automatic feature selection via PCA
* automatic feature selection via TSNE
* automatic feature selection via DBSCAN

In [2]:
# first let's import some data

train = pd.read_csv("train.csv")

# A brief introduction to Dummy variables

A dummy variable is a transformation of a categorical variable into a numeric variable.  This allows us to process a whole new set of data points, that we couldn't before.  However, there are some dangers.  We shouldn't interpret dummy variables as truly numeric, because there isn't necessarily a meaning to the transformation we choose.  As long as we map to variables that don't have large magnitudinal differences, and there are only a few categories, dummy variables give us a lot of power.

But if we have a ton of categories, a strict dummy variable mapping may be ill advised.

In [6]:
# starting with dummy variables
# We'll want to pick dummary variables that occur a somewhat balanced number of times
# that makes variables like the following poor choices:

train["Street"].value_counts()

Pave    1454
Grvl       6
Name: Street, dtype: int64

In [14]:
# And variables like LotShape pretty good:
train["LotShape"].value_counts()

Reg    925
IR1    484
IR2     41
IR3     10
Name: LotShape, dtype: int64

## Operationalizing dummy variables

Now that we've figured out what a good dummy variable looks like, let's operationalize it with a function!  

In [17]:
def generate_dummy_variables(df, length_cut_off_percent = 0.75, max_concentration = 0.65):
    candidate_columns = []
    for column in df.columns:
        if df[column].dtype == object:
            candidate_columns.append(column)
    
    for column in candidate_columns:
        value_counts = df[column].value_counts()
        if len(value_counts) < len(df)*length_cut_off_percent: 
            candidate_columns.remove(column)
            continue
        sum_vals = sum(value_counts)
        percentages = [elem/sum_vals > max_concentration for elem in value_counts]
        if any(percentages):
            candidate_columns.remove(column)
        continue
    
    for column in candidate_columns:
        dummy_columns = pd.get_dummies(df[column])
        df = pd.concat([df, dummy_columns], axis=1)
        df.drop(column, axis=1, inplace=True)
    return df

In [21]:
# now let's see which variables we are going to turn into dummies
train_with_dummies = generate_dummy_variables(train)
print(train.shape, train_with_dummies.shape)

(1460, 83) (1460, 175)


As you can see we went from 83 columns to 175.  That's a lot more features!  

# Logarithmic transformation

I'm not sure if logarithmic transformations are considered "feature engineering" however they are a powerful tool used often in econometrics to bring flexibility to the modeling of data.  In order to make use of and interpret this next technique we will need some information about our dataset.  Does it makes sense to take the log of any of the variables?

One of the ways you can tell this, is if the dependent variable and the independent variables are not in the same scale.  Which is often the case when dealing with prices.  So let's go ahead and build a rather standard model with our data:

`log(SalePrice) = B[0] + B[1]*log(LotArea) + B[2]*log(BedroomAbvGr) + B[3]*log(ExterQual) + B[4]*log(ExterCond) + u`

```
SalePrice = the price the house was sold for 
LotArea = the size of the lot
BedroomAbvGr = the number of bedrooms above basement level
ExterQual = the quality of the exterior of the home (See Note)
ExterCond = the condition of the exterior of the home (See Note)
```

Note - Even though these are categorical variables, there is an ordering to each variable Excellent, Good, Average, Fair and Poor.  So we can translate these variables into numeric ones.

In [3]:
mapping = {
    "Ex": 5, # Excellent
    "Gd": 4, # Good
    "TA": 3, # Average/Typical
    "Fa": 2, # Fair
    "Po": 1 # Poor
}
train["ExterQual"] = train["ExterQual"].map(mapping)
train["ExterCond"] = train["ExterCond"].map(mapping)

In [4]:
train["log_SalePrice"] = np.log(train["SalePrice"])
train["log_LotArea"] = np.log(train["LotArea"])
train["log_BedroomAbvGr"] = np.log(train["BedroomAbvGr"])
train["log_ExterQual"] = np.log(train["ExterQual"])
train["log_ExterCond"] = np.log(train["ExterCond"])

sale_price = train["log_SalePrice"]
X = train[["log_LotArea", "log_BedroomAbvGr", "log_ExterQual","log_ExterCond"]]
X = sm.add_constant(X)

  This is separate from the ipykernel package so we can avoid doing imports until


In [5]:
model = sm.OLS(sale_price, X)

In [None]:
model_results = model.fit()

In [None]:
model_results.summary()

## Interpretting our results

Generally speaking when one applies a log to both sides of an OLS model, we are asking "What is the elasticity of sale price with respect to lot area, number of above ground bedrooms, external quality of the material of the house and external condition of the house?"  Thinking of this another way, we can see this as the percentage change in price with respect to a percentage change in a dependent variable.  

Thus, we have put both sides of our equation in relative, not absolute terms.  In terms of a percentage change, things can be misleading, if not well understood.  But when treated carefully, percentage changes can shed more light than just dealing with absolute numbers.  Especially when changes in Y appear insensitive to changes in X, due to scaling reasons.  

This that in mind, let's look at our results!
