# Ames Housing Prices - Step 4: Modeling
We are now ready to begin building our regression model to predict prices.  This notebook demonstrates how to use the previous work (cleaning and feature prep) to quickly build up the engineered features we need to train our ML model.

In [18]:
# Basic setup
%run config.ipynb

In [19]:
# Connect to Cortex 5 and create a Builder instance
cortex = Cortex.client()
builder = cortex.builder()

### Training Data
We will start with the training dataset from our previous steps and run the _features_ pipeline to get cleaned and prepared data.

In [20]:
train_ds = cortex.dataset('kaggle/ames-housing-train')

In [21]:
pipeline = train_ds.pipeline('features')
train_df = pipeline.run()

running pipeline [clean] for dataset [kaggle/ames-housing-train]:
> drop_unused 
> drop_outliers 
> fill_zero_cols 
> fill_median_cols 
> fill_na_none 
running pipeline [features] for dataset [kaggle/ames-housing-train]:
> scale_target 


### Feature Framing
We now need to split out our target variable from the training data and convert our categorical values into _dummies_.

In [22]:
y = train_df['SalePrice']

In [23]:
def drop_target(pipeline, df):
    df.drop('SalePrice', 1, inplace=True)
    
def get_dummies(pipeline, df):
    return pd.get_dummies(df)

pipeline = train_ds.pipeline('engineer', depends=['features'], clear_cache=True)
pipeline.reset()
pipeline.add_step(drop_target)
pipeline.add_step(get_dummies)

# Run the feature engineering pipeline to prepare for model training
train_df = pipeline.run()

# Remember the full set of engineered columns we need to produce for the model
pipeline.set_context('columns', train_df.columns.tolist())

print('\nTrain shape: (%d, %d)' % train_df.shape)

running pipeline [clean] for dataset [kaggle/ames-housing-train]:
> drop_unused 
> drop_outliers 
> fill_zero_cols 
> fill_median_cols 
> fill_na_none 
running pipeline [features] for dataset [kaggle/ames-housing-train]:
> scale_target 
running pipeline [engineer] for dataset [kaggle/ames-housing-train]:
> drop_target 
> get_dummies 

Train shape: (1458, 303)


## Model Training, Validation, and Experimentation
We are going to try a variety of alogithms and parameters to achieve optimal results.  This will be an iterative process that Cortex 5 will help us track and reproduce in the future by recording the data pipeline used, the model parameters, model metrics, and model artifacts in Experiments.

In [24]:
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [25]:
def train(x, y, **kwargs):
    alphas = kwargs.get('alphas', [1, 0.1, 0.001, 0.0001])

    # Select alogrithm
    mtype = kwargs.get('model_type')
    if mtype == 'Lasso':
        model = LassoCV(alphas=alphas)
    elif mtype == 'Ridge':
        model = RidgeCV(alphas=alphas)
    elif mtype == 'ElasticNet':
        model = ElasticNetCV(alphas=alphas)
    else:
        model = LinearRegression()

    # Train model
    model.fit(x, y)
    
    return model

In [26]:
def predict_and_score(model, x, y):
    predictions = model.predict(x)
    rmse = np.sqrt(mean_squared_error(predictions, y))
    return [predictions, rmse]

In [27]:
X_train, X_test, y_train, y_test = train_test_split(train_df, y.values, test_size=0.20, random_state=10)

### Experiment Management
We are ready to run our train and validation loop and select the optimal model.  As we run our experiment, Cortex will track each run and record the key params, metrics, and artifacts needed to reproduce and/or deploy the model later.

In [34]:
%%time

best_model = None
best_model_type = None
best_rmse = 1.0

exp = cortex.experiment('kaggle/ames-housing-regression')
# exp.reset()
exp.set_pipeline(pipeline)
exp.set_meta('style', 'supervised')
exp.set_meta('function', 'regression')

with exp.start_run() as run:
    alphas = [1, 0.1, 0.001, 0.0001]
    for model_type in ['Linear', 'Lasso', 'Ridge', 'ElasticNet']:
        print('---'*30)
        print('Training model using {} regression algorithm'.format(model_type))
        model = train(X_train, y_train, model_type=model_type, alphas=alphas)
        [predictions, rmse] = predict_and_score(model, X_train, y_train)
        print('Training error:', rmse)
        [predictions, rmse] = predict_and_score(model, X_test, y_test)
        print('Testing error:', rmse)
        
        if rmse < best_rmse:
            best_rmse = rmse
            best_model = model
            best_model_type = model_type
    
    r2 = best_model.score(X_test, y_test)
    run.log_metric('r2', r2)
    run.log_metric('rmse', best_rmse)
    run.log_param('model_type', best_model_type)
    run.log_param('alphas', alphas)
    run.log_artifact('model', best_model)

print('---'*30)

------------------------------------------------------------------------------------------
Training model using Linear regression algorithm
Training error: 0.08792096455489082
Testing error: 0.11715496123176918
------------------------------------------------------------------------------------------
Training model using Lasso regression algorithm
Training error: 0.10474725124109076
Testing error: 0.11210731416333446
------------------------------------------------------------------------------------------
Training model using Ridge regression algorithm
Training error: 0.08952814678982611
Testing error: 0.1108089661962949
------------------------------------------------------------------------------------------
Training model using ElasticNet regression algorithm
Training error: 0.09986249373851433
Testing error: 0.10851964458526744
------------------------------------------------------------------------------------------
CPU times: user 1.83 s, sys: 20.8 ms, total: 1.85 s
Wall time: 3

In [35]:
print('Best model: ' + best_model_type)
print('Best testing error: %.6f' % best_rmse)
print('R2 score: %.6f' % r2)

Best model: ElasticNet
Best testing error: 0.108520
R2 score: 0.920501


In [36]:
exp

ID,Date,Took,Params,Params,Metrics,Metrics
ID,Date,Took,alphas,model_type,r2,rmse
uy8oaok,"Tue, 21 Aug 2018 21:48:56 GMT",1.36 s,"[1, 0.1, 0.001, 0.0005]",Lasso,0.920696,0.108386
joaoamy,"Tue, 21 Aug 2018 21:49:07 GMT",1.82 s,"[1, 0.1, 0.001, 0.0001]",ElasticNet,0.920501,0.10852
