# Model Building

First define a few functions for reuseability. 

We will use K-fold cross validation both in basic model testing and parameter searching, see documentation:
 - [KFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
 - [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)
 - [GridSearchCV](http://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error

train_x0, train_y = split_xy(makeStr(featureFix(scalePrice(impData0))))
train_x1, _ = split_xy(makeStr(featureFix(scalePrice(impData1))))

kf = KFold(n_splits=10, shuffle = True).get_n_splits(train_x0)

def rmse(model,x,y = train_y):
    rmse= np.sqrt(-cross_val_score(model, x.values, y.values, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

In [2]:
# Note: I'm far more familiar with parallelizing in C++ and Scala, this code is likely not a canonical implementation

from pathos.multiprocessing import ProcessingPool as Pool

def helper_itModel(modelConstructor, data0, data1):
    model0 = modelConstructor
    model1 = modelConstructor
    cv0 = rmse(model0,data0)
    cv1 = rmse(model1,data1)
    return (np.mean(cv0),np.mean(cv1),np.std(cv0),np.std(cv1))

#### This is where the number of threads is used  ####
def itModel(n, modelConstructor, data0, data1, threads = nthreads):
    def tempFunc(modelConstructor):
        return helper_itModel(modelConstructor,data0,data1)
    procList = []
    for i in range(n):
        procList.append(modelConstructor)
    with Pool(processes=threads) as pool:
        multiOut = pool.map(tempFunc,procList)
    means = np.zeros(shape=(2,n))
    stds = np.zeros(shape=(2,n))
    for i in range(n):
        means[0,i] = multiOut[i][0]
        means[1,i] = multiOut[i][1]
        stds[0,i] = multiOut[i][2]
        stds[1,i] = multiOut[i][3]
    return(means,stds)

ModuleNotFoundError: No module named 'pathos'

In [None]:
from scipy.stats import ttest_ind

def ttest_model(n, modelConstructor, data0, data1, threads = nthreads):
    means, _ = itModel(n, modelConstructor, data0, data1, nthreads)
    ttest = ttest_ind(means[0,:],means[1,:])
    diff = np.mean(means[0,:]) - np.mean(means[1,:])
    return (ttest,diff)

# Model and Imputation Comparison

As promised, here is the comparison between imputing zero on LotFrontage (index zero) and the more complicated method of imputing on the modal value in each neighborhood (index 1.) We'll also look into the efficacy of flagging where values were imputed in the data set for decision trees, random forests, and gradient boosted trees using XG Boost.

In [None]:
tree1_ttest, tree1_diff = ttest_model(50,DecisionTreeRegressor(), train_x0, train_x1)

print(tree1_diff)
print(tree1_ttest)

In [None]:
rf_ttest, rf_diff   = ttest_model(50,RandomForestRegressor(), train_x0, train_x1)

print(rf_diff)
print(rf_ttest)

Interestingly, imputing zero on the LotFrontage feature works slightly better (statistically significant) for a decision tree regressor but has an insignificant effect  for a random forest regressor. 

In [None]:
def imputeVals_noflag_0(in_df):
    df = in_df.copy()
    for i in fillNone:
        df[i] = df[i].fillna("None")
    for i in fillZero:
        #df["null_%s" % (i)] = df[i].isnull()                           # mark which zeros are imputed
        df[i] = df[i].fillna(0)
    df.Electrical = df.Electrical.fillna("SBrkr")
    df.Functional = df.Functional.fillna("Typ")                        # Documentation instructs to assume "typical" unless otherwise noted
    df.CentralAir = df.CentralAir.fillna("Y")
    #df["null_LotFrontage"] = df.LotFrontage.isnull()
    df.LotFrontage = df.LotFrontage.fillna(0)                             
    df.MSZoning = df.MSZoning.fillna(df.Neighborhood.map(zoning))
    df.Utilities = df.Utilities.fillna(df.Neighborhood.map(utilities))
    df.KitchenQual = df.KitchenQual.fillna("Po")                      #one house missing kitchen data
    df.SaleType = df.SaleType.fillna("Oth")                           # only one missing value, fill the already defined "other"
    df.Exterior1st = df.Exterior1st.fillna("Other")
    df.Exterior2nd = df.Exterior2nd.fillna("Other")                  # the same house is responsible for the missing exterior 1st and 2nd, other is predefined
    df = df.drop(columns=["Id"])
    return(df)

def imputeVals_noflag_1(in_df):
    df = in_df.copy()
    for i in fillNone:
        df[i] = df[i].fillna("None")
    for i in fillZero:
        #df["null_%s" % (i)] = df[i].isnull()
        df[i] = df[i].fillna(0)
    df.Electrical = df.Electrical.fillna("SBrkr")
    df.Functional = df.Functional.fillna("Typ")                        
    df.CentralAir = df.CentralAir.fillna("Y")
    df.LotFrontage = df.LotFrontage.fillna(df.Neighborhood.map(frontage))            # This is the only line different in these two functions, maybe a more elegant solution is possible               
    df.MSZoning = df.MSZoning.fillna(df.Neighborhood.map(zoning))
    df.Utilities = df.Utilities.fillna(df.Neighborhood.map(utilities))
    df.KitchenQual = df.KitchenQual.fillna("Po")                     
    df.SaleType = df.SaleType.fillna("Oth")                           
    df.Exterior1st = df.Exterior1st.fillna("Other")
    df.Exterior2nd = df.Exterior2nd.fillna("Other")                 
    df = df.drop(columns=["Id"])
    return(df)

In [None]:
train_x_noflag_0, _ = split_xy(makeStr(featureFix(scalePrice(imputeVals_noflag_0(trainData)))))
train_x_noflag_1, _ = split_xy(makeStr(featureFix(scalePrice(imputeVals_noflag_1(trainData)))))

This confirms the suspicion that adding those flags does little to improve the quality of the base decision tree, despite being statistically significant. Maybe we would use them if we were playing for inches.  
How about with a random forest?

rf_0_flags_test, rf_0_flags_diff = ttest_model(n = 50, modelConstructor = RandomForestRegressor(), data0 = train_x0, data1 = train_x_noflag_0)
print(rf_0_flags_diff)
print(rf_0_flags_test)

Flags appear to do nothing for the random forest.

It is necessary to randomly generate the state for XGBRegressor because it is otherwise defaulted to a value of zero. Otherwise we will produce a set of identical values for each set.

In [None]:
from random import randint
def nextRand(): 
    return randint(0,10000)

In [None]:
# LONG RUN TIME
#xgb_noflag_test, xgb_noflag_diff = ttest_model(n = 50, modelConstructor = XGBRegressor(random_state=nextRand()), data0 = train_x_noflag_0, data1 = train_x_noflag_1, threads = nthreads)
#print(xgb_noflag_diff)
#print(xgb_noflag_test)

$\Delta$ = 0.0014741226742583935  
Ttest_indResult(statistic=185888033141677.97, pvalue=0.0)  

This result is likely due to the the random state seeding of XGBoost however a workaround is still a work in progress. In other languages call by name evaluation of nextRand() would take care of the problem. It's also possible that the pickling used in multithreading reduces to a single value, this would also be inconsequential in languages with native parallelization.

It should be noted that the exact same values are produced whether flags are kept or not if the random state is not modulated.

$\Delta$ = 0.0014741226742583935  
Ttest_indResult(statistic=185888033141677.97, pvalue=0.0)  

This result is likely due to the the random state seeding of XGBoost however a workaround is still a work in progress. In other languages call by name evaluation of nextRand() would take care of the problem. It's also possible that the pickling used in multithreading reduces to a single value, this would also be inconsequential in languages with native parallelization.

It should be noted that the exact same values are produced whether flags are kept or not if the random state is not modulated.

Light GBM performs the same function as XGBoost but it's faster though more sensitive to over fitting. Let's see how it performs on a data set of this size (~1400 items)

In [None]:
lgb_noflag_test, lgb_noflag_diff = ttest_model(n = 50, modelConstructor = lgb.LGBMRegressor(objective='regression'), data0 = train_x_noflag_0, data1 = train_x_noflag_1, threads = nthreads)
print(lgb_noflag_diff)
print(lgb_noflag_test)

Again it seems we are stuck due to random seeds. It should be noted that this is much faster.

In [None]:
from sklearn.model_selection import GridSearchCV

Text block below is for parameter search and has been converted to markdown to avoid being run

In [None]:
xgb_params = {
         "learning_rate": np.linspace(0.032,0.033,3),
         "max_depth": [3],
         "n_estimators": [2000],
         "colsample_bytree": np.linspace(0.065,0.075,3),
         "gamma": [0.01],
         "silent": [1],
         "min_child_weight": [1],
         "reg_alpha": np.linspace(0.2,0.4,3),
         "reg_lambda": np.linspace(0.55,0.65,3),
         "subsample": [0.5],
         "nthread": [1]
         }

xgb0 = XGBRegressor()

###   Potential LONG RUN TIME    ### 
xgb_grid = GridSearchCV(xgb0,
                        xgb_params,
                        cv = 3,
                        scoring = "neg_mean_squared_error",
                        n_jobs = -1,
                        verbose=50)

xgb_grid.fit(train_x_noflag_1,train_y)

print(np.mean(rmse(xgb.XGBRegressor(**xgb_grid.best_params_),train_x_noflag_1,train_y)))
print(xgb_grid.best_score_)
print(xgb_grid.best_params_)

lgbm_train = lgb.Dataset(train_x_noflag_1, label = train_y)

lgbm_params = {'objective' : ['regression'],
               'metric': ['mse'],
               'num_leaves' : [5,8,12],
               'learning_rate' : np.linspace(0.03,0.035,3),
               'lambda': np.linspace(0.01,0.03,3),
               'max_bin': [100],
               'bagging_fraction' : [0.01],
               'feature_fraction' : [0.173],
               'num_rounds' : [500],
               'n_estimators': [750],
               'sub_feature' : [0.5],
               'verbose': [0]
              }

lgbm_test = lgb.LGBMRegressor()

lgbm_grid =  GridSearchCV(lgbm_test,
                        lgbm_params,
                        scoring = "neg_mean_squared_error",
                        cv = 5,
                        n_jobs = -1,
                        verbose=50)

lgbm_grid.fit(train_x_noflag_1,train_y)

print(np.mean(rmse(lgb.LGBMRegressor(**lgbm_grid.best_params_),train_x_noflag_1,train_y)))
print(lgbm_grid.best_score_)
print(lgbm_grid.best_params_)

# Prediction and Submission

In [None]:
trainLen = trainData.shape[0]

train_x, train_y = split_xy(featureFix(scalePrice(imputeVals_noflag_1(trainData))))
test_x = featureFix(imputeVals_noflag_1(testData))

withDummy = makeStr(pd.concat([train_x,test_x]))

train_x = withDummy.iloc[0:trainLen]
test_x = withDummy.iloc[trainLen:]

print(train_x.shape)

In [None]:
xgb = XGBRegressor(colsample_bytree = 0.075, gamma = 0.01, learning_rate = 0.0325, max_depth = 3,
                  min_child_weight = 1, n_estimators = 2000, nthread = nthreads, reg_alpha = 0.2, reg_lambda = 0.65,
                  silent = 1, subsample = 0.5)

xgb.fit(train_x,train_y)

xgb_preds = xgb.predict(test_x)

In [None]:
np.mean(rmse(xgb,train_x))

In [None]:
lgbm_params = {'bagging_fraction': 0.1, 'feature_fraction': 0.1, 'lambda': 0.1, 'learning_rate': 0.0325, 'max_bin': 60, 'metric': 'mse', 'n_estimators': 750, 'num_leaves': 8, 'objective': 'regression', 'sub_feature': 0.5, 'verbose': 0}
lgbm = lgb.LGBMRegressor(**lgbm_params)
lgbm.fit(train_x,train_y)
lgbm_preds = lgbm.predict(test_x)

In [None]:
lgbm_params = {'bagging_fraction': 0.1, 'feature_fraction': 0.1, 'lambda': 0.1, 'learning_rate': 0.0325, 'max_bin': 60, 'metric': 'mse', 'n_estimators': 750, 'num_leaves': 8, 'objective': 'regression', 'sub_feature': 0.5, 'verbose': 0}
lgbm2 = lgb.LGBMRegressor(**lgbm_params)
np.mean(rmse(lgbm2,train_x))


In [None]:
submit_frame = pd.DataFrame()
submit_frame['Id'] = testID
submit_frame['SalePrice'] = invPrice(xgb_preds)
submit_frame.to_csv('submission.csv',index=False)

# Towards the future!

This performance with a single model is pretty good, but we can do better. To climb the leader boards, it will be necessary to implement an ensemble model.   
As a point of good practice, we should also have withheld a train-test split to test our models on in addition to using cross validation. To some extent this is done when you submit for ranking but it can't be done rapidly or in a manner that is conducive to good study of the model.

For now, this will suffice until I feel like looking at this data set again.