# Massive Machine Learning Pipelines - Part 2
We will move on to building the massive machine learning pipeline. The overall architecture will look similar to the mini-pipeline from above with the major difference being the number of distinct column groups. Each of the column groupings we create will have its own pipeline.

## Create column groupings
For this massive pipeline, we will use all of the columns. Typically, it's a good idea to do exploratory data analysis first to manually inspect the columns and possibly select a subset to model on. We will forgo this step and go straight into pipeline construction. Further along in the tutorial we will do some data inspection and feature engineering.

Below, we separate the data into five separate column groupings. The data type of each column is either numeric or string. Both numeric and string columns can be categorical, but only numeric data can be continuous. Categorical data is further subdivided into nominal (no natural ordering) or ordinal (has a natural ordering - basement quality for example). The data dictionary was used to help classify each column correctly.

In [None]:
str_nomial = ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 
              'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
              'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation',
              'Heating', 'CentralAir', 'Electrical', 'GarageType', 'GarageFinish', 'PavedDrive',
              'MiscFeature', 'SaleType', 'SaleCondition']
str_ordinal = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
               'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'Functional', 'GarageQual', 'GarageCond',
               'PoolQC', 'Fence', 'FireplaceQu']

numeric_nominal = ['MSSubClass', 'MoSold', 'YrSold', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt']
numeric_ordinal = ['OverallQual', 'OverallCond', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
                   'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', ]
numeric_cont = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 
                'TotalBsmtSF','1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 
                'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
                'MiscVal']

### Nominal vs Ordinal
Most of the time it is easy to determine whether a categorical column is nominal or ordinal. The neighborhood column, for instance has no natural ordering and would therefore be classified as nominal. 

But with a column like LotShape, it isn't so clear. The values are Reg (Regular), IR1 (Slightly Irregular), IR2 (Moderately Irregular), and IR3 (Irregular). The values appear to have an order with Reg being the 'best' and IR3 the 'worst'. But, without being an expert in this field, it's probably safer to assume less and treat it as a nominal.

### One-Hot encoding all categorical columns
While scikit-learn does provide an `OrdinalEncoder` to encode ordinal features, we will choose opt to one-hot encode them instead. This is done for a couple reasons. First, ordinal encoding places a stricter assumption on the data, that each category is worth precisely one more than the prior one. Second, we would need to provide an ordered list of categories for each column and that would be quite tedious. So, partly out of laziness, we will one-hot encode all categorical columns regardless if they are nominal or ordinal.

### Create pipeline steps for each column group

We will now create a pipeline for each subset of columns. We will fill missing values with the most frequent for all of the categorical columns and with the median for the continuous.

In [None]:
str_nominal_steps = [
    ('si', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
]
str_orinal_steps = [
    ('si', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
]
numeric_nominal_steps = [
    ('si', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
]
numeric_ordinal_steps = [
    ('si', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
]
numeric_cont_steps = [
    ('si', SimpleImputer(strategy='median')),
    ('ss', StandardScaler())
]

### Create pipeline for each column group
We instantiate each pipeline with the steps (list of two-item tuples) from above. Notice, that all pipelines are identical except for the continuos one. We could have used just two pipelines to transform the data like we did in our first attempt. In upcoming examples, we will see different sets of transformations applied to each pipeline. Regardless, it is still useful to see how different columns many be categorized.

In [None]:
str_nominal_pipe = Pipeline(str_nominal_steps)
str_ordinal_pipe = Pipeline(str_orinal_steps)
numeric_nominal_pipe = Pipeline(numeric_nominal_steps)
numeric_ordinal_pipe = Pipeline(numeric_ordinal_steps)
numeric_cont_pipe = Pipeline(numeric_cont_steps)

### Create ColumnTransformer
We instantiate our `ColumnTransformer` with a list of three-item tuples.

In [None]:
transformers = [
    ('str_nominal_pipe', str_nominal_pipe, str_nomial),
    ('str_ordinal_pipe', str_ordinal_pipe, str_ordinal),
    ('numeric_nominal_pipe', numeric_nominal_pipe, numeric_nominal),
    ('numeric_ordinal_pipe', numeric_ordinal_pipe, numeric_ordinal),
    ('numeric_cont_pipe', numeric_cont_pipe, numeric_cont)
]
ct = ColumnTransformer(transformers)

### Create one last pipeline to add machine learning
A final pipeline is created to connect the transformed data to the Ridge regression model.

In [None]:
final_steps = [
    ('ct', ct),
    ('ridge', Ridge())
]
final_pipe2 = Pipeline(final_steps)

### Fit and predict
We can now pass our data into the final pipeline to transform and ultimately train our ridge regression model. We follow this up by predicting the sale price for the test set and saving a new submission file.

In [None]:
final_pipe2.fit(housing, y)
y_pred = final_pipe2.predict(housing_test)

### Automate submission to kaggle
We can define a function to create the csv, submit the prediction, and return the score.

In [None]:
def submit_kaggle(model, X_test, file, message):
    y_pred = model.predict(X_test)
    sub = pd.DataFrame({'Id': X_test['Id'], 'SalePrice': y_pred})
    sub.to_csv(file, index=False)
    kaggle.api.competition_submit(file, message, competition)

In [None]:
file = 'data/submissions/20190710/sub02.csv'
message = '''
One hot encoded all categorical columns and standardized all continuous
columns. Modeled with ridge regression with alpha=1
'''
# submit_kaggle(final_pipe2, housing, file, message)