House Prices Predictions
Dataset: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

1. Problem statement (0-10)
	* How well is the problem defined?
	* Does the research address a real-life problem?
	* Does the research solve the correct problem?



	* Every problem deals with "real-world" data in some way. Even if you don't use datasets, you'll likely generate some data
	* How is the data gathered?
	* Is the process statistically valid?
	* Is the process of data acquisition, data cleaning, and data manipulation well documented?
6. Testing (0-10)
	* This can have various meanings: unit testing, hypothesis testing, train / test data set, etc.
	* Is the code thoroughly tested?
	* Are there any comparisons to other implementations / other articles / previous research?
7. Visualization (0-10)
	* All kinds of projects employ some visualization: graphical plots, tables, etc.
	* Are all visualizations correct (i.e. convey the intended meaning without misleading the intended audience)?
	* Are all visualizations clear, and easy to understand?
8. Communication (0-10)
	* Does the project tell the story correctly?
	* Does the project serve the audience it was intended for?

2. Layout (0-20)
	* Are the document sections structured properly?
	* Is the article well-formatted (in terms of readability)?

3. Code quality (0-20)
	* Is the code well-written? Is the code self-documenting?
	* Is the code organized into functions?
	* Is the code generally well-structured?

In [101]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV


from sklearn.linear_model import Lasso, Ridge

from sklearn.metrics import mean_absolute_error,r2_score

In [102]:
train = pd.read_csv('train.csv')
X_test = pd.read_csv('test.csv')

In [103]:
train.shape


(1460, 81)

In [104]:
X_test.shape

(1459, 80)

Divide the data into train and validation set

In [105]:
attributes = train.drop('SalePrice', axis=1)
labels = train.SalePrice

In [106]:
X_train,X_valid,y_train,y_valid = train_test_split(
    attributes,
    labels,
    test_size=0.33,
    random_state=33
)

In [107]:
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

((978, 80), (482, 80), (978,), (482,))

Data cleaning and data manipulation/ Preprocessing\
Here are all the numerical columns

In [108]:
X_train.describe().T.iloc[:10]

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,978.0,728.608384,422.557427,1.0,370.25,718.5,1103.75,1458.0
MSSubClass,978.0,57.484663,42.772544,20.0,20.0,50.0,70.0,190.0
LotFrontage,797.0,68.927227,21.549108,21.0,59.0,68.0,80.0,182.0
LotArea,978.0,10452.155419,10424.805743,1300.0,7500.0,9434.5,11494.5,215245.0
OverallQual,978.0,6.07771,1.395304,1.0,5.0,6.0,7.0,10.0
OverallCond,978.0,5.57362,1.13166,1.0,5.0,5.0,6.0,9.0
YearBuilt,978.0,1971.120654,30.204519,1875.0,1954.0,1972.0,2000.0,2010.0
YearRemodAdd,978.0,1984.713701,20.664891,1950.0,1966.0,1994.0,2004.0,2010.0
MasVnrArea,972.0,103.510288,178.343213,0.0,0.0,0.0,168.0,1378.0
BsmtFinSF1,978.0,447.878323,437.727161,0.0,0.0,389.0,727.0,2260.0


Hereby are all the category/object columns

In [109]:
X_train.describe(include=object).T.iloc[:10]

Unnamed: 0,count,unique,top,freq
MSZoning,978,5,RL,767
Street,978,2,Pave,974
Alley,60,2,Grvl,32
LotShape,978,4,Reg,612
LandContour,978,4,Lvl,880
Utilities,978,2,AllPub,977
LotConfig,978,5,Inside,705
LandSlope,978,3,Gtl,926
Neighborhood,978,25,NAmes,146
Condition1,978,9,Norm,838


Handling null values

In [110]:
more_than_zero_null_values = X_train.isnull().sum()>0

X_train.isnull().sum()[more_than_zero_null_values]

LotFrontage     181
Alley           918
MasVnrType        6
MasVnrArea        6
BsmtQual         29
BsmtCond         29
BsmtExposure     29
BsmtFinType1     29
BsmtFinType2     30
FireplaceQu     460
GarageType       56
GarageYrBlt      56
GarageFinish     56
GarageQual       56
GarageCond       56
PoolQC          974
Fence           796
MiscFeature     945
dtype: int64

19 columns have more than 0 Null values\
Let's figure out the numerical features and categorical features

In [111]:
num_features = X_train.select_dtypes(include = 'number').columns.tolist()
print(f"The numerical features are {len(num_features)}.")
print(f"They are {', '.join(num_features)}.")

The numerical features are 37.
They are Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, TotRmsAbvGrd, Fireplaces, GarageYrBlt, GarageCars, GarageArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, MiscVal, MoSold, YrSold.


In [112]:
cat_features =  X_train.select_dtypes(exclude = 'number').columns.tolist()
print(f"The categorical features are {len(cat_features)}.")
print(f"They are {', '.join(cat_features)}.")

The categorical features are 43.
They are MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, HeatingQC, CentralAir, Electrical, KitchenQual, Functional, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PavedDrive, PoolQC, Fence, MiscFeature, SaleType, SaleCondition.


I will use simple imputer for the missing values and the MinMaxScaler for the feature scaling.\
Simple Imputer will be used and the mode will be filled in the missing categorical features\
I will use Pipeline because it is a new and interesting technique.

In [113]:
numeric_pipeline = Pipeline(
    steps = [
        ('Numerical imputer', SimpleImputer(strategy='mean')),
        ('numerical scaler', MinMaxScaler()) 
    ])

categorical_pipeline = Pipeline(
    steps = [
        ('categorical imputer', SimpleImputer(strategy='most_frequent')),
        ('categorical one-hot', OneHotEncoder(handle_unknown='ignore',sparse = False)) 
    ])

Using column transformer is needed in order to avoid the fit and transform for both the pipeline above

In [114]:
full_transformer = ColumnTransformer(transformers=[
    ('number', numeric_pipeline, num_features),
    ('category', categorical_pipeline, cat_features)
])

In [115]:
full_transformer.fit_transform(X_train)

array([[0.10089224, 0.23529412, 0.29768464, ..., 0.        , 1.        ,
        0.        ],
       [0.31708991, 0.        , 0.24223602, ..., 0.        , 1.        ,
        0.        ],
       [0.79341112, 0.35294118, 0.39751553, ..., 0.        , 1.        ,
        0.        ],
       ...,
       [0.39670556, 0.82352941, 0.08074534, ..., 0.        , 0.        ,
        0.        ],
       [0.26835964, 0.23529412, 0.31055901, ..., 0.        , 1.        ,
        0.        ],
       [0.71654084, 0.        , 0.36645963, ..., 0.        , 1.        ,
        0.        ]])

Lasso model (L1 regularization)

In [116]:
lasso_model = Lasso(alpha = 0.1)

lasso_pipeline = Pipeline(
    steps = [
        ('preprocessing', full_transformer),
        ('model',lasso_model)
    ])

In [117]:
_ = lasso_pipeline.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(


In [118]:
predictions = lasso_pipeline.predict(X_valid)


In [119]:
mean_absolute_error(y_valid,predictions)

18220.90578423801

In [120]:
r2_score(y_valid,predictions)

0.760180429884422

Hyperparameter Optimization

In [121]:
parameters_grid = {'model__alpha':np.arange(0,1,0.02)}

grid_search = GridSearchCV(
    lasso_pipeline,
    parameters_grid,
    cv= 10, 
    scoring = 'neg_mean_absolute_error'
)
_ = grid_search.fit(X_train,y_train)

  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coord

In [122]:
print(f'The best alpha value is {grid_search.best_params_}')

The best alpha value is {'model__alpha': 0.98}


In [123]:
print(f'The best score is {abs(grid_search.best_score_)}')

The best score is 18443.656766920954


Try with more parameters

In [124]:
parameters_grid = {'model__alpha':np.arange(1,200,5)}

grid_search = GridSearchCV(
    lasso_pipeline,
    parameters_grid,
    cv= 10, 
    scoring = 'neg_mean_absolute_error'
)
_ = grid_search.fit(X_train,y_train)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [125]:
print(f'The best alpha value is {grid_search.best_params_}')

The best alpha value is {'model__alpha': 96}


In [126]:
print(f'The best score is {abs(grid_search.best_score_)}')

The best score is 16747.366502908117


Ridge model (L2 regularization)

In [127]:
ridge_model = Ridge(alpha = 0.1)

ridge_pipeline = Pipeline(
    steps = [
        ('preprocessing', full_transformer),
        ('model',ridge_model)
    ])

In [128]:
_ = ridge_pipeline.fit(X_train, y_train)

In [129]:
predictions = ridge_pipeline.predict(X_valid)

In [130]:
mean_absolute_error(y_valid,predictions)

18185.32918987934

In [131]:
r2_score(y_valid,predictions)

0.7637747258600046

Hyperparameter optimisation

In [132]:
parameters_grid = {'model__alpha':np.arange(1,200,5)}

grid_search = GridSearchCV(
    ridge_pipeline,
    parameters_grid,
    cv= 10, 
    scoring = 'neg_mean_absolute_error'
)
_ = grid_search.fit(X_train,y_train)

In [133]:
print(f'The best alpha value is {grid_search.best_params_}')

The best alpha value is {'model__alpha': 6}


In [134]:
print(f'The best score is {abs(grid_search.best_score_)}')

The best score is 17444.619350162542
