Simple mainly linear regressions with sklearn for the Housing Data Set

In [6]:
import numpy as np
import pandas as pd
#import autosklearn.regression
import sklearn.preprocessing as skp
from subprocess import check_output
import matplotlib.pyplot as plt
%matplotlib inline

Read the training and test data into Pandas dataframes

In [10]:
train = pd.read_csv('train.csv',index_col=0)
test = pd.read_csv('test.csv',index_col=0)

Preprocess Data so sklearn can read it, fill NaNs with means of the columns, and transform the categorical attributes with dummies

In [11]:
y_train = np.log(train.pop('SalePrice'))#log transform the target
all_df = pd.concat((train,test),axis=0) #concat the training and test data for faster preprocessing

all_df['MSSubClass']=all_df['MSSubClass'].astype(str) #Datatype fixes

all_dummy = pd.get_dummies(all_df)

mean_cols=all_dummy.mean() #calculate means of the categorical columns

all_dummy = all_dummy.fillna(mean_cols)
#Split the datasets back into training and test set
dummy_train = all_dummy.loc[train.index]
dummy_test = all_dummy.loc[test.index]
#prepare data for sklearn
X_train = dummy_train.values
X_test = dummy_test.values

The now (simple) preprocessed data can be fed into sklearn methods. Further Preprocessing and simple feature generation should be done before this point.

Import the used models from sklearn and some handy utility. Simple linear models will be used in this notebook. In addition a Random Forest Regressor will be also used. The optimal parameters will be calculated using the CV method provided in sklearn.

In [4]:
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LassoCV,Lasso,ElasticNet, ElasticNetCV,RidgeCV,Ridge
from sklearn.ensemble import RandomForestRegressor , AdaBoostRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge

In [5]:
elnet_cv = ElasticNetCV()
ridge_cv = RidgeCV(alphas=np.arange(1,100,10))
rf = RandomForestRegressor(n_estimators = 500)
ada = AdaBoostRegressor(n_estimators = 500)
gbr = GradientBoostingRegressor(n_estimators = 500)
lasso_cv = LassoCV()

Fit the models to training data

In [6]:
elnet_cv.fit(X_train,y_train)
print(elnet_cv.alpha_)
ridge_cv.fit(X_train,y_train)
print(ridge_cv.alpha_)
lasso_cv.fit(X_train,y_train)
print(lasso_cv.alpha_)
rf.fit(X_train,y_train)
ada.fit(X_train,y_train)
gbr.fit(X_train,y_train);

2.05048139342
11
1.02524069671


GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=500,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

Make the prediction

In [7]:
elnet = ElasticNet(elnet_cv.alpha_)
ridge = Ridge(ridge_cv.alpha_)
lasso = Lasso(lasso_cv.alpha_)
elnet.fit(X_train,y_train)
ridge.fit(X_train,y_train)
lasso.fit(X_train,y_train);

Lasso(alpha=1.0252406967113168, copy_X=True, fit_intercept=True,
   max_iter=1000, normalize=False, positive=False, precompute=False,
   random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

In [8]:
y_1 = np.exp(elnet.predict(X_test))
y_2 = np.exp(ridge.predict(X_test))
y_3 = np.exp(rf.predict(X_test))
y_4 = np.exp(ada.predict(X_test))
y_5 = np.exp(gbr.predict(X_test))
y_6 = np.exp(lasso.predict(X_test))

evaluate the expected testscore via a 5-fold-cv

In [12]:
score1 = np.mean(np.sqrt(-cross_val_score(elnet,X_train,y_train,cv=5,scoring='neg_mean_squared_error')))
score2 = np.mean(np.sqrt(-cross_val_score(ridge,X_train,y_train,cv=5,scoring='neg_mean_squared_error')))
score3 = np.mean(np.sqrt(-cross_val_score(rf,X_train,y_train,cv=5,scoring='neg_mean_squared_error')))
score4 = np.mean(np.sqrt(-cross_val_score(ada,X_train,y_train,cv=5,scoring='neg_mean_squared_error')))
score5 = np.mean(np.sqrt(-cross_val_score(gbr,X_train,y_train,cv=5,scoring='neg_mean_squared_error')))
score6 = np.mean(np.sqrt(-cross_val_score(lasso,X_train,y_train,cv=5,scoring='neg_mean_squared_error')))
scores = [score1,score2,score3,score4,score5,score6]
print(scores)
print(np.mean(scores))

[0.19814470818553748, 0.13898984113918872, 0.14205099881735989, 0.17157781509512335, 0.12340387488925017, 0.19812150361970698]
0.162048123624


since ridge, random forest and gradient boost seems to have the best performance on test data the average of those regressors is taken for the final prediction

In [13]:
y_final = (y_2+y_3+y_5)/3

In [14]:
submission_df = pd.DataFrame(data={'Id':test.index,'SalePrice':y_final})

In [15]:
submission_df.to_csv('submission_final_ridge_rf_gbr.csv',index=False)