#### Objective: Predict home sale prices

#### Approach: Use LASSO to generate a model from a dense and minimally pruned feature space

# Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from IPython.display import display
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Loading Data and EDA

In [2]:
df = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')

In [None]:
# Checking relative shapes of datasets. Test set is missing target column as
# expected. Parity between the two DataFrames is imperative to making
# predictions. As such, until I find a better way, parity will be manually
# maintained
print('Training set shape:', df.shape)
print('Test set shape:', test.shape)

In [None]:
df.info() # Plenty of non-numericl data and nulls to clean up

In [None]:
df.describe()

In [None]:
display(df.head())

In [None]:
# Manually determine which categorical features have promise as dummies.
# Criteria are as follows:
# 1) Not too homogenous - if there is one dominent category, it won't be
# significant
# 2) Not too sparse. The data needs to speak to us. To do that, it must be there
# 3) Not too scattered. Too many values generates more columns than it may be
# worth working with
# 4) Must be non-ordinal
# ***************************************************************************** #
# Since we will be using LASSO to cull our feature set, I'm not too worried if
# some of these don't add much. They will be dealt with when the time comes.
# The commented lines were deemed unfit for inclusion.

#df['Street'].value_counts()
df['Year Built'].value_counts()
#df['Land Contour'].value_counts()
#df['Functional'].value_counts()
df['Neighborhood'].value_counts()
#df['Land Slope'].value_counts()
df['MS SubClass'].value_counts()
df['MS Zoning'].value_counts()
#df['Alley'].value_counts() # Mostly null
df['Lot Shape'].value_counts()
#df['Utilities'].value_counts()
df['Lot Config'].value_counts()
df['Condition 1'].value_counts()
#df['Condition 2'].value_counts()a
df['Bldg Type'].value_counts()
df['House Style'].value_counts()
df['Year Built'].value_counts() # Non-ordinal
df['Year Remod/Add'].value_counts() # Non-ordinal
#df['Roof Style'].value_counts()
#df['Roof Matl'].value_counts()
df['Exterior 1st'].value_counts()
#df['Exterior 2nd'].value_counts()
df['Mas Vnr Type'].value_counts()
df['Exter Qual'].value_counts()
df['Exter Cond'].value_counts()
df['Foundation'].value_counts()
df['Bsmt Qual'].value_counts()
#df['Bsmt Cond'].value_counts()
df['Bsmt Exposure'].value_counts()
df['BsmtFin Type 1'].value_counts()
#df['BsmtFin Type 2'].value_counts() # Not many with 2 basements
#df['Heating'].value_counts() # Too homogenous
df['Heating QC'].value_counts()
df['Central Air'].value_counts() # Maybe... Only one column...
#df['Electrical'].value_counts() # Too homogenous
df['Kitchen Qual'].value_counts()
#df['Functional'].value_counts() # Too homogenous
#df['Fireplace Qu'].value_counts() # Mostly null
df['Garage Type'].value_counts()
#df['Garage Yr Blt'].value_counts() # Non ordinal, though newer is better I guess
df['Garage Finish'].value_counts()
#df['Garage Qual'].value_counts() # Too homogenous
#df['Garage Cond'].value_counts() # Too homogenous
df['Paved Drive'].value_counts() # Pretty homogenous but also only adds two columns...
#df['Pool QC'].value_counts() # Mostly null values
#df['Fence'].value_counts() # Mostly null values
##df['Misc Feature'].value_counts() # Mostly null but always adds value to property
##df['Misc Val'].value_counts()
df['Mo Sold'].value_counts() # Non ordinal. Is it even relevant?
df['Yr Sold'].value_counts() # Non ordinal; prices do change through the years
df['Sale Type'].value_counts() # No idea what it means but it's worth checking out

In [3]:
# Drop columns composed largely of null values. Still on the fence about
# the Misc Features category
df = df.drop(columns=['Alley', 'Pool QC', 'Fence', 'Misc Feature',
                      'Lot Frontage', 'Fireplace Qu'])
test = test.drop(columns=['Alley', 'Pool QC', 'Fence', 'Misc Feature',
                          'Lot Frontage', 'Fireplace Qu'])
# Drop non-numeric columns that are not good dummy candidates
df.drop(columns=['Street', 'Land Contour', 'Functional', 'Land Slope',
                     'Utilities', 'Condition 2', 'Roof Style', 'Roof Matl',
                     'Exterior 2nd', 'Bsmt Cond', 'BsmtFin Type 2', 'Heating',
                     'Electrical', 'Functional', 'Garage Yr Blt', 'Garage Qual',
                     'Garage Cond'], inplace=True)
test.drop(columns=['Street', 'Land Contour', 'Functional', 'Land Slope',
                     'Utilities', 'Condition 2', 'Roof Style', 'Roof Matl',
                     'Exterior 2nd', 'Bsmt Cond', 'BsmtFin Type 2', 'Heating',
                     'Electrical', 'Functional', 'Garage Yr Blt', 'Garage Qual',
                     'Garage Cond'], inplace=True)

In [None]:
# Check correlations between features and target using a heatmap
fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(df.corr(), annot=True);

In [4]:
# Drop remaining columns with low correlation to target before scaling, since
# correlation is independent of scale and origin
df.drop(columns=['Pool Area', 'Screen Porch', '3Ssn Porch', 'Enclosed Porch',
                'Kitchen AbvGr', 'Bedroom AbvGr', 'Bsmt Half Bath',
                'Low Qual Fin SF', 'BsmtFin SF 2', 'PID'], inplace=True)
test.drop(columns=['Pool Area', 'Screen Porch', '3Ssn Porch', 'Enclosed Porch',
                'Kitchen AbvGr', 'Bedroom AbvGr', 'Bsmt Half Bath',
                'Low Qual Fin SF', 'BsmtFin SF 2', 'PID'], inplace=True)

In [5]:
# Generate dummies with potential value
df = pd.get_dummies(df, columns=['Year Built', 'Neighborhood', 'MS SubClass',
               'MS Zoning', 'Lot Shape', 'Lot Config', 'Condition 1',
               'Bldg Type', 'House Style', 'Year Remod/Add', 'Exterior 1st',
               'Mas Vnr Type', 'Foundation', 'Garage Type',
               'Paved Drive', 'Mo Sold', 'Yr Sold', 'Sale Type'],
                    drop_first=True)
test = pd.get_dummies(test, columns=['Year Built', 'Neighborhood', 'MS SubClass',
               'MS Zoning', 'Lot Shape', 'Lot Config', 'Condition 1',
               'Bldg Type', 'House Style', 'Year Remod/Add', 'Exterior 1st',
               'Mas Vnr Type', 'Foundation', 'Garage Type',
               'Paved Drive', 'Mo Sold', 'Yr Sold', 'Sale Type'],
                      drop_first=True)

In [None]:
print(df.info(), '\n')
print(test.info())

In [6]:
# Determine what dummy columns the training set has in excess of our test set
missing_cols = set(df.columns) - set(test.columns)
print(missing_cols)

{'Exterior 1st_ImStucc', 'MS SubClass_150', 'Year Built_1879', 'Year Built_1895', 'MS Zoning_C (all)', 'Neighborhood_GrnHill', 'Year Built_1929', 'Year Built_1898', 'Exterior 1st_CBlock', 'Neighborhood_Landmrk', 'Year Built_1942', 'Year Built_1901', 'Exterior 1st_Stone', 'Year Built_1893', 'Year Built_1896', 'Year Built_1880', 'Year Built_1875', 'Year Built_1911', 'Year Built_1913', 'SalePrice'}


In [7]:
# Determine what additional dummy columns the test set has in relation to the training set
add_cols = set(test.columns) - set(df.columns)
print(add_cols)

{'Year Built_1882', 'Exterior 1st_PreCast', 'Year Built_1904', 'Year Built_1907', 'Year Built_1906', 'Sale Type_VWD', 'Year Built_1902', 'Mas Vnr Type_CBlock'}


In [8]:
# Seperate target column before it is dropped
y = df['SalePrice']

In [9]:
# Restore parity between datasets
df.drop(columns=missing_cols, inplace=True)
test.drop(columns=add_cols, inplace=True)

In [10]:
# Find location of null vals
df.isnull().sum().sort_values(ascending=False).head()

Garage Finish     114
Bsmt Exposure      58
Bsmt Qual          55
BsmtFin Type 1     55
Mas Vnr Area       22
dtype: int64

In [11]:
# Fill all remaining null cells with 0's in order to model
df.fillna(0, inplace=True)
test.fillna(0, inplace=True)

In [12]:
# Define dictionaries to change ordinal values to be numerically represented
bsqual_dict = {'NA': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5} # also use
# for ex_qual, ex_cond, ht_qc, kit_qual
bsexp_dict = {'NA': 0, 'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4}
bsfin_dict = {'NA': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6}
ac_dict = {'N': 0, 'Y': 1}
grfin_dict = {'NA': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}

In [13]:
# Change ordinal columns with strings to numeric ordinal values
df['Bsmt Qual'].replace(bsqual_dict, inplace=True)
df['Bsmt Exposure'].replace(bsexp_dict, inplace=True)
df['BsmtFin Type 1'].replace(bsfin_dict, inplace=True)
df['Exter Qual'].replace(bsqual_dict, inplace=True)
df['Exter Cond'].replace(bsqual_dict, inplace=True)
df['Heating QC'].replace(bsqual_dict, inplace=True)
df['Kitchen Qual'].replace(bsqual_dict, inplace=True)
df['Central Air'].replace(ac_dict, inplace=True)
df['Garage Finish'].replace(grfin_dict, inplace=True)

test['Bsmt Qual'].replace(bsqual_dict, inplace=True)
test['Bsmt Exposure'].replace(bsexp_dict, inplace=True)
test['BsmtFin Type 1'].replace(bsfin_dict, inplace=True)
test['Exter Qual'].replace(bsqual_dict, inplace=True)
test['Exter Cond'].replace(bsqual_dict, inplace=True)
test['Heating QC'].replace(bsqual_dict, inplace=True)
test['Kitchen Qual'].replace(bsqual_dict, inplace=True)
test['Central Air'].replace(ac_dict, inplace=True)
test['Garage Finish'].replace(grfin_dict, inplace=True)

In [14]:
# Generate promising interaction terms
bsmt_features = df[['BsmtFin Type 1', 'BsmtFin SF 1', 'Bsmt Unf SF', 'Total Bsmt SF',
                'Bsmt Full Bath']]
gr_features = df[['Garage Cars', 'Garage Area', 'Garage Finish']]
int_features = df[['1st Flr SF', '2nd Flr SF', 'Gr Liv Area', 'Full Bath', 'Half Bath',
               'Kitchen Qual', 'TotRms AbvGrd', 'Fireplaces']]

poly = PolynomialFeatures(interaction_only=True)
bsmt_poly = poly.fit_transform(bsmt_features)

poly = PolynomialFeatures(interaction_only=True)
gr_poly = poly.fit_transform(gr_features)

poly = PolynomialFeatures(interaction_only=True)
int_poly = poly.fit_transform(int_features)

# *************************** Test set ************************************** #
bsmt_features = test[['BsmtFin Type 1', 'BsmtFin SF 1', 'Bsmt Unf SF', 'Total Bsmt SF',
                'Bsmt Full Bath']]
gr_features = test[['Garage Cars', 'Garage Area', 'Garage Finish']]
int_features = test[['1st Flr SF', '2nd Flr SF', 'Gr Liv Area', 'Full Bath', 'Half Bath',
               'Kitchen Qual', 'TotRms AbvGrd', 'Fireplaces']]

poly = PolynomialFeatures(interaction_only=True)
bsmt_poly = poly.fit_transform(bsmt_features)

poly = PolynomialFeatures(interaction_only=True)
gr_poly = poly.fit_transform(gr_features)

poly = PolynomialFeatures(interaction_only=True)
int_poly = poly.fit_transform(int_features)

# Modeling Time

In [32]:
# Define X
exclusion = ['SalePrice']
features = [col for col in df.columns if not col in exclusion]
X = df[features]

In [16]:
# Calculate baseline
max(y.value_counts(normalize=True))

0.01218917601170161

In [33]:
# TTS
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

### Linear Regression Pipe and Grid Search

In [None]:
# Just a basic linear regression to see how good of a fit we can achieve
# without regularization
ss = StandardScaler()
lr = LinearRegression()

In [None]:
LinearRegression()
lr_pipe = Pipeline([
    ('ss', ss),
    ('lr', lr),
])

In [None]:
# Despite lacking parameters, GridSearch offers integrated CV which makes it
# preferable to pipe.fit() and so on.
lr_params = {}
gs = GridSearchCV(lr_pipe, lr_params, cv=5)
gs.fit(X_train, y_train)

In [None]:
# Linear Regression Scores. It doesn't make sense for the gs.score to be as
# low as it is. That being said, there is no harm in leaving that to Kaggle to
# decide.
### Upon rerunning the code, the RMSE has gone to a more reasonable value.
print(gs.best_score_)
print(gs.score(X_test, y_test))

In [None]:
pred = gs.predict(X)
print('RMSE for Linear Model w/o Regularization:',
      mean_squared_error(y, pred) ** .5)

### kNN Pipe and Grid Search

In [None]:
# Though it's unlikely to perform well at all given the current form of the
# data, it doesn't hurt to try and see if kNN offers value.
ss = StandardScaler()
knn = KNeighborsClassifier()

In [None]:
KNeighborsClassifier()
knn_pipe = Pipeline([
    ('ss', ss),
    ('knn', knn)
])

In [None]:
knn_params = {
     'knn__n_neighbors': range(7, 25, 2)   
}
gs = GridSearchCV(knn_pipe, param_grid=knn_params, cv=5)
gs.fit(X_train, y_train)

In [None]:
# kNN Scores - not too promising as predicted. Not worth checking RMSE.
print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_test, y_test))

### LASSO Pipe and Grid Search

In [18]:
# Given the massive number of columns and lack of aggressive feature
# engineering, my hopes are highest for the LASSO model's preformance
ss = StandardScaler()
lasso = Lasso()

In [34]:
Lasso()
lasso_pipe = Pipeline([
    ('ss', ss),
    ('lasso', lasso)
])

In [35]:
# I don't know what exactly to expect, nor what range my alpha values can fall
# in, so I'll cast a wide net and see what I catch.
lasso_params = {
    'lasso__max_iter': [10000],
    'lasso__alpha': [(x * 10) + 3500 for x in range(100)]
}
gs = GridSearchCV(lasso_pipe, param_grid=lasso_params, cv=3) # CV = 3 for time
gs.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('ss', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lasso', Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'lasso__max_iter': [10000], 'lasso__alpha': [3500, 3510, 3520, 3530, 3540, 3550, 3560, 3570, 3580, 3590, 3600, 3610, 3620, 3630, 3640, 3650, 3660, 3670, 3680, 3690, 3700, 3710, 3720, 3730, 3740, 3750, 3760, 3770, 3780, 3790, 3800, 3810, 3820, 3830, 3840, 3850, 3860, 3870, 3880, 3890, 390...30, 4340, 4350, 4360, 4370, 4380, 4390, 4400, 4410, 4420, 4430, 4440, 4450, 4460, 4470, 4480, 4490]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [36]:
print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_test, y_test))

0.8113047628600992
{'lasso__alpha': 3500, 'lasso__max_iter': 10000}
0.8813379590898037


In [37]:
pred = gs.predict(X)
print('RMSE for LASSO Model (alpha = 525):',
      mean_squared_error(y, pred) ** .5)

RMSE for LASSO Model (alpha = 525): 31092.381934088353


In [None]:
Lasso()
lasso_pipe = Pipeline([
    ('ss', ss),
    ('lasso', lasso)
])

In [26]:
lasso_params = {
    'lasso__max_iter': [10000],
    'lasso__alpha': [x + 501 for x in range(50)]
}
gs = GridSearchCV(lasso_pipe, param_grid=lasso_params, cv=3) # CV = 3 for time
gs.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('ss', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lasso', Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'lasso__max_iter': [10000], 'lasso__alpha': [501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [27]:
print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_test, y_test))

0.8395976166452542
{'lasso__alpha': 521, 'lasso__max_iter': 10000}
0.8763544880844494


In [28]:
pred = gs.predict(X)
print('RMSE for LASSO Model (alpha = 525):',
      mean_squared_error(y, pred) ** .5)

RMSE for LASSO Model (alpha = 525): 26139.705802079166


In [None]:
# I don't know what exactly to expect, nor what range my alpha values can fall
# in, so I'll cast a wide net and see what I catch.
lasso_params = {
    'lasso__max_iter': [10000],
    'lasso__alpha': [0.1, 0.3, 0.5, 0.75, 1.0, 1.5, 2.0, 3.0, 4.0, 5.0,
                    6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0,
                    16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0,
                    25.0, 26.0, 27.0, 29.0, 30.0]
}
gs = GridSearchCV(lasso_pipe, param_grid=lasso_params, cv=3) # CV = 3 for time
gs.fit(X_train, y_train)

In [None]:
# LASSO Scores - Given that the model is privy to the upper limit we tested
# for alpha, it makes sense to keep going higher until we hit a ceiling.
print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_test, y_test))

##### Lasso Run 2

In [None]:
Lasso()
lasso_pipe = Pipeline([
    ('ss', ss),
    ('lasso', lasso)
])

In [None]:
lasso_params = {
    'lasso__max_iter': [10000],
    'lasso__alpha': range(30, 50, 1)
}
gs = GridSearchCV(lasso_pipe, param_grid=lasso_params, cv=3)
gs.fit(X_train, y_train)

In [None]:
# Model score is improving slightly, however we are still at our upper limit
print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_test, y_test))

##### Lasso Run 3

In [None]:
Lasso()
lasso_pipe = Pipeline([
    ('ss', ss),
    ('lasso', lasso)
])

In [None]:
lasso_params = {
    'lasso__max_iter': [10000],
    'lasso__alpha': range(50, 100, 1)
}
gs = GridSearchCV(lasso_pipe, param_grid=lasso_params, cv=3)
gs.fit(X_train, y_train)

In [None]:
# Model score has only improved marginally. Perhaps the scale is too small
print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_test, y_test))

##### Lasso Run 4

In [None]:
Lasso()
lasso_pipe = Pipeline([
    ('ss', ss),
    ('lasso', lasso)
])

In [None]:
lasso_params = {
    'lasso__max_iter': [10000],
    'lasso__alpha': range(100, 300, 3)
}
gs = GridSearchCV(lasso_pipe, param_grid=lasso_params, cv=3)
gs.fit(X_train, y_train)

In [None]:
# How long can this go on?
print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_test, y_test))

##### Lasso Run 5

In [None]:
Lasso()
lasso_pipe = Pipeline([
    ('ss', ss),
    ('lasso', lasso)
])

In [None]:
lasso_params = {
    'lasso__max_iter': [10000],
    'lasso__alpha': range(300, 1000, 3)
}
gs = GridSearchCV(lasso_pipe, param_grid=lasso_params, cv=3)
gs.fit(X_train, y_train)

In [None]:
# At last we seem to have found the ballpark for an optimal alpha value!
### On the first run through the code, the ideal alpha given was 413.9. It now
### sits at 561. While it is expected for this value to change, the difference
### is quite dramatic. I will rerun the code from the necessary spot to observe
### to what extent it continues to vary.
##### The third iteration has yielded an ideal alpha of 744! This warrants
##### further examination
print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_test, y_test))

##### Lasso alpha refinement

In [None]:
Lasso()
lasso_pipe = Pipeline([
    ('ss', ss),
    ('lasso', lasso)
])

In [None]:
### Clearly this was intended for the first run when the ideal alpha was in
### the middle of the range given.
lasso_params = {
    'lasso__max_iter': [10000],
    'lasso__alpha': [(x * 0.1) + 411 for x in range(50)] # To produce floats in range
}
gs = GridSearchCV(lasso_pipe, param_grid=lasso_params, cv=3)
gs.fit(X_train, y_train)

In [None]:
# A satisfactory alpha value (alpha = 413.9)
### At least it was on the original run
print(gs.best_score_)
print(gs.best_params_)
print(gs.score(X_test, y_test))

In [22]:
# Oddly enough this model produces a higher RMSE than the plain linear model
### This has since rectified itself.
pred = gs.predict(X)
print('RMSE for LASSO Model (alpha = 413.9):',
      mean_squared_error(y, pred) ** .5)

RMSE for LASSO Model (alpha = 413.9): 26147.046766428633


# Preparing First Submission

In [38]:
# Set X equal to testing dataframe
X = test

# Predict target values
pred = gs.predict(X)

In [30]:
# Generate first submission df
sub_one = pd.DataFrame(data=pred, index=test['Id'])
sub_one.columns = ['SalePrice']

In [39]:
# Generate first submission df
sub_two = pd.DataFrame(data=pred, index=test['Id'])
sub_two.columns = ['SalePrice']

In [40]:
# Export to CSV
sub_one.to_csv('./sub_one.csv') # RMSE = 37882.77012 (24th place)

# Preparing Second Submission

In [None]:
# Out of curiosity, I wanted to see how the MLR performed. Below we wil rerun
# the necessary code to regenerate the model.
X = df[features]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [None]:
ss = StandardScaler()
LinearRegression()
lr_pipe = Pipeline([
    ('ss', ss),
    ('lr', lr),
])

In [None]:
lr_params = {}
gs = GridSearchCV(lr_pipe, lr_params, cv=10)
gs.fit(X_train, y_train)

In [None]:
# Set X equal to testing dataframe
X = test

# Predict target values
pred = gs.predict(X)

In [None]:
# Generate second submission df
sub_two = pd.DataFrame(data=pred, index=test['Id'])
sub_two.columns = ['SalePrice']

In [None]:
# Export to CSV
sub_two.to_csv('./sub_two.csv') # RMSE = 95203414630844.20000 # As it should be

# Conclusions

While implementing LASSO as a means of feature selection hurt the interpretability of the model, it was successful in producing a viable model from an overly complex feature set. To further improve the model, I would integrate interaction terms with the polynomial features module and probably use elastic net to take advantage of ridge penalties. The use of pipelines reduces the implementation times of these features, however learning about them the day before the submission deadline, coupled with other unforeseen complications that have taken away from the time available to work on this project are limiting factors. That being said, this project was a fantastic learning experience; beyond figuring out how to make a workflow such as this work for Kaggle competitions, it became necessary to further understand how the LASSO model achieves what it does and how exactly the potential range of alpha values can vary so drastically across applications.

###### Questions and tasks that remain:

1) What is the effect of standardizing the dataset before generating dummy columns vs. after?  
2) What criteria make for good interaction terms?  
3a) When should one generate polynomial terms in addition to the interaction terms?  
3b) What criteria determine what degree of the poly_terms we should generate?  
4) How would elastic ridge perform against just LASSO in this case?  
5) How can one determine what features were eliminated by LASSO rather than just having the function spit out a model?