<a href="https://colab.research.google.com/github/JoanWaweru/ML-Group-5-Tasks/blob/main/Housing_Price_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [160]:
import pandas as pd
from sklearn.linear_model import LassoCV
import numpy as np

Import Google drive where we have stored the dataset

In [161]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Load the housing price dataset

In [162]:
data = pd.read_csv('/content/drive/MyDrive/Datasets/modified_data.csv')

In [163]:
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,...,0,0,0,0,0,12,2008,WD,Normal,250000


Prepare the dataset. 
First, we identify which feature is irrelevant and drop it. The most irrelevant feature is the column Id.

In [164]:
data.drop('Id', axis=1, inplace=True)

In [165]:
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,12,2008,WD,Normal,250000


After dropping the Id column, check for the columns with missing data.

isnull().sum() returns a total count of missing values for each column and datatype.

In [166]:
data.isnull().sum()

MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
Street             0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 76, dtype: int64

Using SimpleImputer from sklearn library to fill in the missing values

In [167]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')
data.LotFrontage=imputer.fit_transform(data['LotFrontage'].values.reshape(-1,1))[:,0]

In [168]:
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,12,2008,WD,Normal,250000


The missing values have been filled in

In [169]:
data.isnull().sum()

MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 76, dtype: int64

Retrieving the column that had missing values and confirming that the values have been filled

In [170]:
data.LotFrontage

0       65.0
1       80.0
2       68.0
3       60.0
4       84.0
        ... 
1455    62.0
1456    85.0
1457    66.0
1458    68.0
1459    75.0
Name: LotFrontage, Length: 1460, dtype: float64

In [171]:
initialFeatures=list(data.columns)
initialFeatures
len(initialFeatures)

76

In [172]:
features_missing_data = list(data.columns[data.isna().any()])
len(features_missing_data)

14

In [173]:
data.dropna(axis=1, thresh=0.5 * (len(data)), inplace=True)

In [174]:
mean_fill=['LotFrontage', 'MasVnrArea']
backward_fill_data = ['FireplaceQu']
forward_fill_data = list(set(features_missing_data)-set(mean_fill)-set(backward_fill_data))

In [175]:
forward_fill_data

['GarageCond',
 'GarageQual',
 'BsmtQual',
 'GarageFinish',
 'BsmtExposure',
 'BsmtFinType1',
 'MasVnrType',
 'Electrical',
 'BsmtCond',
 'BsmtFinType2',
 'GarageType',
 'GarageYrBlt']

In [176]:
for a in mean_fill:
  data[a].fillna(data[a].mean(),inplace=True)

In [177]:
for b in backward_fill_data:
  data[b].fillna(method='bfill',inplace=True)

In [178]:
for c in forward_fill_data:
  data[c].fillna(method='ffill',inplace=True)

In [179]:
data.isnull().sum()

MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 76, dtype: int64

Encode the dataset to ensure that the model does not put weights to what does not need weighting.

First, list the data types entailed in each column.

In [180]:
data.dtypes

MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 76, dtype: object

Second, list all the non-numerical columns by extracting the categorical data.

In [181]:
categoricalFeatures = list(data.select_dtypes(include=['object']).copy().columns)

List the non-numerical columns.

In [182]:
categoricalFeatures

['MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'SaleType',
 'SaleCondition']

In [183]:
nominalData=['MSZoning', 'LandContour', 'LotConfig','Neighborhood','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','Foundation','BsmtFinType1','CentralAir']
ordinalData = list(set(categoricalFeatures)-set(nominalData))
numericalData = list(set(initialFeatures)-set(categoricalFeatures))
target = ['SalePrice']

In [184]:
data[numericalData]

Unnamed: 0,GarageCars,TotalBsmtSF,HalfBath,GarageArea,BsmtUnfSF,TotRmsAbvGrd,BsmtFinSF2,OverallQual,KitchenAbvGr,MSSubClass,...,PoolArea,ScreenPorch,BsmtFullBath,OverallCond,GrLivArea,WoodDeckSF,YearBuilt,EnclosedPorch,OpenPorchSF,YrSold
0,2,856,1,548,150,8,0,7,1,60,...,0,0,1,5,1710,0,2003,0,61,2008
1,2,1262,0,460,284,6,0,6,1,20,...,0,0,0,8,1262,298,1976,0,0,2007
2,2,920,1,608,434,6,0,7,1,60,...,0,0,1,5,1786,0,2001,0,42,2008
3,3,756,0,642,540,7,0,7,1,70,...,0,0,1,5,1717,0,1915,272,35,2006
4,3,1145,1,836,490,9,0,8,1,60,...,0,0,1,5,2198,192,2000,0,84,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,2,953,1,460,953,7,0,6,1,60,...,0,0,0,5,1647,0,1999,0,40,2007
1456,2,1542,0,500,589,7,163,6,1,20,...,0,0,1,6,2073,349,1978,0,0,2010
1457,1,1152,0,252,877,9,0,7,1,70,...,0,0,0,9,2340,0,1941,0,60,2010
1458,1,1078,0,240,0,5,1029,5,1,20,...,0,0,1,6,1078,366,1950,112,0,2010


Thirdly, we encode the ordinal data.

In [185]:
for feature in ordinalData:
 
  data[feature]=(data[feature].astype('category')).cat.codes

In [186]:
df_ordinal = data[ordinalData]

For nominal data, we will use One Hot Encoding.

In [187]:
df_nominal = pd.get_dummies(data[nominalData])

For the numerical data, there is no preprocessing taking place.

In [188]:
df_numerical = data[numericalData]

Join the data to form a new dataframe.

In [189]:
joinedData = pd.concat([df_numerical, df_nominal, df_ordinal], axis=1)

In [190]:
joinedData.head()

Unnamed: 0,GarageCars,TotalBsmtSF,HalfBath,GarageArea,BsmtUnfSF,TotRmsAbvGrd,BsmtFinSF2,OverallQual,KitchenAbvGr,MSSubClass,...,SaleType,GarageFinish,BsmtExposure,LotShape,LandSlope,BsmtFinType2,Street,Condition2,Heating,ExterCond
0,2,856,1,548,150,8,0,7,1,60,...,8,1,3,3,0,5,1,2,1,4
1,2,1262,0,460,284,6,0,6,1,20,...,8,1,1,3,0,5,1,2,1,4
2,2,920,1,608,434,6,0,7,1,60,...,8,1,2,0,0,5,1,2,1,4
3,3,756,0,642,540,7,0,7,1,70,...,8,2,3,0,0,5,1,2,1,4
4,3,1145,1,836,490,9,0,8,1,60,...,8,1,0,0,0,5,1,2,1,4


Standardize the Encoded Dataset.

In [191]:
from sklearn.preprocessing import StandardScaler

In [192]:
scaler = StandardScaler()
df_X = joinedData.drop('SalePrice', axis=1)
X = np.array(df_X)
df_X

Unnamed: 0,GarageCars,TotalBsmtSF,HalfBath,GarageArea,BsmtUnfSF,TotRmsAbvGrd,BsmtFinSF2,OverallQual,KitchenAbvGr,MSSubClass,...,SaleType,GarageFinish,BsmtExposure,LotShape,LandSlope,BsmtFinType2,Street,Condition2,Heating,ExterCond
0,2,856,1,548,150,8,0,7,1,60,...,8,1,3,3,0,5,1,2,1,4
1,2,1262,0,460,284,6,0,6,1,20,...,8,1,1,3,0,5,1,2,1,4
2,2,920,1,608,434,6,0,7,1,60,...,8,1,2,0,0,5,1,2,1,4
3,3,756,0,642,540,7,0,7,1,70,...,8,2,3,0,0,5,1,2,1,4
4,3,1145,1,836,490,9,0,8,1,60,...,8,1,0,0,0,5,1,2,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,2,953,1,460,953,7,0,6,1,60,...,8,1,3,3,0,5,1,2,1,4
1456,2,1542,0,500,589,7,163,6,1,20,...,8,2,3,3,0,4,1,2,1,4
1457,1,1152,0,252,877,9,0,7,1,70,...,8,1,3,3,0,5,1,2,1,2
1458,1,1078,0,240,0,5,1029,5,1,20,...,8,2,2,3,0,4,1,2,1,4


In [193]:
df_y = data[target]
y = np.array(df_y)
df_y

Unnamed: 0,SalePrice
0,208500
1,181500
2,223500
3,140000
4,250000
...,...
1455,175000
1456,210000
1457,266500
1458,142125


In [194]:
X.shape

(1460, 162)

In [195]:
y.shape

(1460, 1)

In [196]:
X=scaler.fit_transform(X)
y=scaler.fit_transform(y)

In [197]:
X

array([[ 0.31172464, -0.45930254,  1.22758538, ..., -0.03174026,
        -0.12304604,  0.36420746],
       [ 0.31172464,  0.46646492, -0.76162067, ..., -0.03174026,
        -0.12304604,  0.36420746],
       [ 0.31172464, -0.31336875,  1.22758538, ..., -0.03174026,
        -0.12304604,  0.36420746],
       ...,
       [-1.02685765,  0.21564122, -0.76162067, ..., -0.03174026,
        -0.12304604, -2.36968918],
       [-1.02685765,  0.04690528, -0.76162067, ..., -0.03174026,
        -0.12304604,  0.36420746],
       [-1.02685765,  0.45278362,  1.22758538, ..., -0.03174026,
        -0.12304604,  0.36420746]])

In [198]:
y

array([[ 0.34727322],
       [ 0.00728832],
       [ 0.53615372],
       ...,
       [ 1.07761115],
       [-0.48852299],
       [-0.42084081]])

Feature Selection using L1

In [199]:
regressor = LassoCV()
regressor.fit(X,y)

  y = column_or_1d(y, warn=True)


LassoCV()