# Feature Engineering

In the following example, we will build a Scikit-learn transformer to create a feature for the Boston housing dataset.

Boston House Prices dataset
---------------------------

Data Set Characteristics:  

Number of Instances: 506 

Number of Attributes: 13 numeric/categorical predictive

Median Value (attribute 14) is usually the target

Attribute Information (in order):

- CRIM -    per capita crime rate by town
- ZN -      proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS -   proportion of non-retail business acres per town
- **CHAS -    Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)**
- NOX -     nitric oxides concentration (parts per 10 million)
- RM -      average number of rooms per dwelling
- AGE -     proportion of owner-occupied units built prior to 1940
- DIS -     weighted distances to five Boston employment centres
- RAD -     index of accessibility to radial highways
- **TAX -     full-value property-tax rate per 10,000 dollars**
- PTRATIO - pupil-teacher ratio by town
- B -       1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT -   % lower status of the population
- MEDV -    Median value of owner-occupied homes in 1000's

Missing Attribute Values: None

Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

In [3]:
from sklearn.datasets import load_boston
import pandas as pd

# load boston in
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target

X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


# Split data

Even if we're not going to end up fitting a model, splitting your data is still a best-practices stage

In [8]:
from sklearn.model_selection import train_test_split

seed = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=seed)

In [11]:
X_train.TAX.quantile(0.75)

666.0

# Extract a new feature

Perhaps we want to add a feature that indicates whether a home is near the Charles River *and* its tax rate is in the top quartile. We will create a custom scikit-learn Transformer for the task.

In [20]:
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.utils.validation import check_is_fitted
import numpy as np

class RiverHighTaxTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, feat_name='RIVER_AND_HIGH_TAX'):
        self.feat_name = feat_name
    
    def fit(self, X, y=None):
        # assert it's a dataframe
        if not isinstance(X, pd.DataFrame):
            raise TypeError("X must be a DataFrame")
            
        # get the upper quartile
        self.upper_q_ = X.TAX.quantile(0.75)
        return self
    
    def transform(self, X):
        check_is_fitted(self, 'upper_q_')
        
        # assert it's a dataframe
        if not isinstance(X, pd.DataFrame):
            raise TypeError("X must be a DataFrame")
        X = X.copy()
        
        X[self.feat_name] = (X.TAX >= self.upper_q_).astype(int) * X.CHAS
        return X

In [22]:
# define
transformer = RiverHighTaxTransformer()

# fit
transformer.fit(X_train, y_train)

# transform
transformer.transform(X_train)\
           .sort_values('RIVER_AND_HIGH_TAX', 
                        ascending=False)\
           .head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,RIVER_AND_HIGH_TAX
358,5.20177,0.0,18.1,1.0,0.77,6.127,83.4,2.7227,24.0,666.0,20.2,395.43,11.48,1.0
364,3.47428,0.0,18.1,1.0,0.718,8.78,82.9,1.9047,24.0,666.0,20.2,354.55,5.29,1.0
370,6.53876,0.0,18.1,1.0,0.631,7.016,97.5,1.2024,24.0,666.0,20.2,392.05,2.96,1.0
357,3.8497,0.0,18.1,1.0,0.77,6.395,91.0,2.5052,24.0,666.0,20.2,391.34,13.27,1.0
369,5.66998,0.0,18.1,1.0,0.631,6.683,96.8,1.3567,24.0,666.0,20.2,375.33,3.73,1.0
