# Massive Machine Learning Pipelines - Part 4

## Feature Engineering

All of the above represents a fairly 'vanilla' massive machine learning pipeline. Summarizing, we have done the following:
* Classified each column as either string or numeric and as either nominal, ordinal, or continuous
* Built a pipeline for each of these 5 column groupings by filling in missing values and either one-hot encoding or standardizing
* Used cross validation to estimate the root mean squared log error on the test set
* Modeled on the log of the sale price
* Used grid search to find the optimal value for the penalty term for ridge regression

We have essentially reached the limit of our learning without doing too much thinking. True, we did not choose a more complex type of model like a Random Forest or Gradient Boosted Tree. If we desire a better model, we will need to think about how to engineer better features from our data. 

### Replacing low-frequency categorical values

Let's look at individual columns for particular values that appear very few times (sometimes referred to as outliers).  Categorical values that appear infrequently are candidates to be reclassified as a another similar category or to be grouped together with other infrequent categories into an 'other' category.

### Why recclassify low-frequency categoricals?
A primary goal of machine learning is to build a model that generalizes well to future, unseen data. If our model is built with too many low-frequency categorical values, it may overfit to those particular categories. As a concrete example, imagine that there are just 2 houses in our training data from a particular neighborhood and both of these houses, by chance, just happen to be very poor quality houses and are not representative of the entire neighborhood. Our model might unfairly give too much negative weight to that neighborhood and then make poor predictions in the future.

Of course, this isn't always the case and a single unique category can actually give useful information. Perhaps there is a single house that has a solid-gold toilet that massively increases the value of the house. 

But, in general, I like to experiment with consolidating low-frequency categories so that the model can generalize better.

### Finding low-frequency categoricals
The `value_counts` Series method find the number of times each category appears. Let's see an example

In [None]:
housing['LotConfig'].value_counts()

In this example, the `LotConfig` feature has 5 unique values but 'FR3' only appears 4 times. By looking at the data dictionary, we can see its description is similar to that of FR2, so we can consider replacing it.

### An automated way to find low-frequency categoricals
We can loop through each column and run the `value_counts` method on it if it is a string column ('object' in pandas). There are several columns in this dataset that are numeric, but represent discrete categories such as we saw with the first column 'MSSubClass'. We can also add a condition to run `value_counts` if the number of unique values is below a certain threshold. Below, only the columns that have a one category that appears 5 or fewer times is printed to the screen.

In [None]:
for col in housing.columns:
    if housing[col].dtype == 'object' or housing[col].nunique() < 30:
        vc = housing[col].value_counts(dropna=False)
        if vc.min() <= 5:
            print(f'\nColumn {col}')
            print(vc)

In [None]:
print(open('data/data_description.txt').read())

Remove completely
* Utilities, Condition2, RoofMatl, Heating, LowQualFinSF, PoolArea, MSSubClass
* MSSubClass is just about a 1-1 correlation with HouseStyle

### Replacing with `replace`

We can use the pandas DataFrame `replace` method helps us replace values within particular columns. We do this by creating a dictionary mapping the column name to another dictionary that maps the value to be replaced with its replacement value.

In [None]:
replace_dict = \
{
    'LotConfig': {'FR3': 'FR', 'FR2': 'FR'},
    'Condition1': {'PosA': 'Pos', 
                   'PosN': 'Pos', 
                   'RRAe': 'RR', 
                   'RRNe': 'RR', 
                   'RRAn': 'RR', 
                   'RRNn': 'RR'},
    'OverallQual': {1: 2},
    'OverallCond': {1: 2},
    'Exterior1st': {'BrkComm': 'OTHER', 
                    'Stone': 'OTHER', 
                    'AsphShn': 'OTHER',
                    'CBlock': 'OTHER',
                    'ImStucc': 'OTHER',
                    'Other': 'OTHER',
                    'Brk Cmn': 'OTHER'},
    'Exterior2nd': {'BrkComm': 'OTHER', 
                    'Stone': 'OTHER', 
                    'AsphShn': 'OTHER',
                    'CBlock': 'OTHER',
                    'ImStucc': 'OTHER',
                    'Other': 'OTHER',
                    'Brk Cmn': 'OTHER'},
    'ExterCond':{'Po': 'Fa', 
                 'Ex': 'Gd'},
    'Foundation': {'Stone': 'OTHER',
                   'Wood': 'OTHER'},
    'BsmtCond': {'Po': 'Fa'},
    'Functional': {'Sev': 'Maj',
                   'Maj1': 'Maj',
                   'Maj2': 'Maj',
                   'Min1': 'Min',
                   'Min2': 'Min'},
    'HeatingQC': {'Po': 'Fa'},
    'GarageQual': {'Ex': 'Gd', 'Po': 'Fa'},
    'GarageCond': {'Ex': 'Gd', 'Po': 'Fa'},
    'MoSold': {12: 'Winter', 1: 'Winter', 2: 'Winter',
                3: 'Spring', 4: 'Spring',5: 'Spring',
                6: 'Summer', 7: 'Summer',8: 'Summer',
                9: 'Fall', 10: 'Fall', 11: 'Fall'}
}

keep_dict = {
    'RoofStyle': ['Gable', 'Hip'],
    'SaleType': ['WD', 'New', 'COD'],
    'Heating':  ['GasA', 'GasW']
}

clip_dict = {
    'BsmtFullBath': {'upper': 2},
    'BedroomAbvGr': {'lower': 1, 'upper': 5},
    'TotRmsAbvGrd': {'lower': 3, 'upper': 11},
    'Fireplaces': {'upper': 2},
    'GarageCars': {'upper': 3}
}

binarizer_dict = {
    'LotShape': ['Reg', True],
    'LandSlope': ['Gtl', False],
    'BsmtHalfBath': [0, True],
    '3SsnPorch': [0, True],
    'WoodDeckSF': [0, True],
    'EnclosedPorch': [0, True],
    'OpenPorchSF': [0, True],
    'ScreenPorch': [0, True],
    'PoolQC': [lambda x: x.notna(), False],
    'MiscFeature': ['Shed', False],
    'Electrical': ['SBrkr', False]
}

percentile_dict = {
    'LowQualFinSF': 'GrLivArea'
    '2ndFlrSF': 'GrLivArea',
    'BsmtFinSF1': 'TotalBsmtSF', 
    'BsmtFinSF2': 'TotalBsmtSF', 
    'BsmtUnfSF': 'TotalBsmtSF'
}

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class Replacer(BaseEstimator, TransformerMixin):
    
    def __init__(self, replace_dict=None, keep_dict=None):
        self.replace_dict = replace_dict
        self.keep_dict = keep_dict
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        if self.replace_dict:
            X = X.replace(self.replace_dict)
        else:
            X = X.copy()
        if self.keep_dict:
            for col, keep_vals in self.keep_dict.items():
                keep = X[col].isin(keep_vals)
                X[col] = X[col].where(keep, 'OTHER')
        return X
    
class Clipper(BaseEstimator, TransformerMixin):
    
    def __init__(self, clip_dict):
        self.clip_dict = clip_dict
        
    def fit(self, X, y=None):
        return self
        
    def transform(self, X, y=None):
        X = X.copy()
        for col, clip_kwargs in self.clip_dict.items():
            X[col] = X[col].clip(**clip_kwargs)
        return X
    
class MyBinarizer(BaseEstimator, TransformerMixin):
    
    def __init__(self, binarize_dict):
        self.binarize_dict = binarize_dict
        
    def fit(self, X, y=None):
        return self
        
    def transform(self, X, y=None):
        X = X.copy()
        for col, (item, reverse) in self.binarize_dict.items():
            if callable(item):
                X[col] = item(X[col])
            else:
                X[col] = X[col] == item
            if reverse:
                X[col] = 1 - X[col]
        return X
    
class Percentiler(BaseEstimator, TransformerMixin):
    
    def __init__(self, percentile_dict=None):
        self.percentile_dict = percentile_dict
        
    def fit(self, X, y=None):
        return self
        
    def transform(self, X, y=None):
        X = X.copy()
        for col, total_col in self.percentile_dict.items():
            X[col] = X[col] / X[total_col]
        return X

In [None]:
not_used = ['Utilities', 'Condition2', 'RoofMatl', 'Heating', 
            'PoolArea', 'MSSubClass', '1stFlrSF']

str_nomial = ['MSZoning', 'Street', 'Alley', 'LandContour', 'LotConfig', 
              'Neighborhood', 'Condition1', 'BldgType', 'HouseStyle',
              'RoofStyle', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation',
              'CentralAir', 'GarageType', 'GarageFinish', 'PavedDrive',
              'SaleType', 'SaleCondition', 'MoSold']
str_ordinal = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 
               'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 
               'Functional', 'GarageQual', 'GarageCond', 'Fence', 'FireplaceQu']

numeric_nominal = ['YrSold']
numeric_ordinal = ['OverallQual', 'OverallCond', 'FullBath', 'HalfBath',
                   'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars']
numeric_cont = ['LotFrontage', 'LotArea', 'MasVnrArea', 'TotalBsmtSF', 
                'GrLivArea', 'GarageArea']
numeric_years = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']
numeric_perc = ['LowQualFinSF', '2ndFlrSF', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF']

all_binarized = ['WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 
                 'BsmtHalfBath', 'LotShape', 'LandSlope', 'PoolQC', 'MiscFeature', 'Electrical']

from sklearn.preprocessing import KBinsDiscretizer

numeric_years_steps = [
    ('si', SimpleImputer(strategy='median')),
    ('kbd', KBinsDiscretizer(n_bins=5, encode='onehot-dense', strategy='kmeans'))
]
numeric_perc_steps = [
    ('si', SimpleImputer(strategy='constant', fill_value=0))
]
all_binarized_steps = [
    ('si', SimpleImputer(strategy='constant', fill_value=0))
]
numeric_years_pipe = Pipeline(numeric_years_steps)
numeric_perc_pipe = Pipeline(numeric_perc_steps)
all_binarized_pipe = Pipeline(all_binarized_steps)

transformers = [
    ('str_nominal_pipe', str_nominal_pipe, str_nomial),
    ('str_ordinal_pipe', str_ordinal_pipe, str_ordinal),
    ('numeric_nominal_pipe', numeric_nominal_pipe, numeric_nominal),
    ('numeric_ordinal_pipe', numeric_ordinal_pipe, numeric_ordinal),
    ('numeric_cont_pipe', numeric_cont_pipe, numeric_cont),
    ('numeric_years_pipe', numeric_years_pipe, numeric_years),
    ('numeric_perc_pipe', numeric_perc_pipe, numeric_perc),
    ('all_binarized_pipe', all_binarized_pipe, all_binarized)
]
ct = ColumnTransformer(transformers)

In [None]:
mb = MyBinarizer(binarizer_dict)
clipper = Clipper(clip_dict)
replacer = Replacer(replace_dict, keep_dict)
percentiler = Percentiler(percentile_dict)
rp = Pipeline([('mb', mb), 
               ('clipper', clipper), 
               ('replacer', replacer),
               ('percentiler', percentiler),
               ('ct', ct), 
               ('ridge', ridge)])
rp_ttr = TransformedTargetRegressor(rp, func=np.log, inverse_func=np.exp)

In [None]:
param_grid = {'regressor__ridge__alpha': np.logspace(-1, 3, 10)}
gs = GridSearchCV(rp_ttr, param_grid, cv=kf, scoring='neg_mean_squared_log_error')
gs.fit(housing, y);
rp_ttr_best = gs.best_estimator_