<a href="https://colab.research.google.com/github/JoanWaweru/ML-Group-12-Tasks/blob/main/Housing_Price_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np

Import Google drive where we have stored the dataset

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Load the housing price dataset

In [4]:
data = pd.read_csv('/content/drive/MyDrive/Datasets/modified_data.csv')

In [5]:
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,...,0,0,0,0,0,12,2008,WD,Normal,250000


Prepare the dataset. 
First, we identify which feature is irrelevant and drop it. The most irrelevant feature is the column Id.

In [6]:
data.drop('Id', axis=1, inplace=True)

In [7]:
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,12,2008,WD,Normal,250000


After dropping the Id column, check for the columns with missing data.

isnull().sum() returns a total count of missing values for each column and datatype.

In [8]:
data.isnull().sum()

MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
Street             0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 76, dtype: int64

Using SimpleImputer from sklearn library to fill in the missing values

In [9]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')
data.LotFrontage=imputer.fit_transform(data['LotFrontage'].values.reshape(-1,1))[:,0]

In [10]:
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,12,2008,WD,Normal,250000


The missing values have been filled in

In [11]:
data.isnull().sum()

MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 76, dtype: int64

Retrieving the column that had missing values and confirming that the values have been filled

In [12]:
data.LotFrontage

0       65.0
1       80.0
2       68.0
3       60.0
4       84.0
        ... 
1455    62.0
1456    85.0
1457    66.0
1458    68.0
1459    75.0
Name: LotFrontage, Length: 1460, dtype: float64

In [13]:
initialFeatures=list(data.columns)
initialFeatures
len(initialFeatures)

76

Encode the dataset to ensure that the model does not put weights to what does not need weighting.

First, list the data types entailed in each column.

In [14]:
data.dtypes

MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 76, dtype: object

Second, list all the non-numerical columns by extracting the categorical data.

In [15]:
categoricalFeatures = list(data.select_dtypes(include=['object']).copy().columns)

List the non-numerical columns.

In [16]:
categoricalFeatures

['MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'SaleType',
 'SaleCondition']

In [17]:
nominalData=['MSZoning', 'LandContour', 'LotConfig','Neighborhood','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','Foundation','BsmtFinType1','CentralAir']
ordinalData = list(set(categoricalFeatures)-set(nominalData))
numericalData = list(set(initialFeatures)-set(categoricalFeatures))
target = ['SalePrice']

In [18]:
data[numericalData]

Unnamed: 0,GarageCars,BsmtFullBath,YrSold,LowQualFinSF,BsmtFinSF1,MSSubClass,OverallQual,Fireplaces,YearBuilt,YearRemodAdd,...,FullBath,BsmtFinSF2,GrLivArea,BsmtUnfSF,HalfBath,SalePrice,ScreenPorch,PoolArea,BedroomAbvGr,WoodDeckSF
0,2,1,2008,0,706,60,7,0,2003,2003,...,2,0,1710,150,1,208500,0,0,3,0
1,2,0,2007,0,978,20,6,1,1976,1976,...,2,0,1262,284,0,181500,0,0,3,298
2,2,1,2008,0,486,60,7,1,2001,2002,...,2,0,1786,434,1,223500,0,0,3,0
3,3,1,2006,0,216,70,7,1,1915,1970,...,1,0,1717,540,0,140000,0,0,3,0
4,3,1,2008,0,655,60,8,1,2000,2000,...,2,0,2198,490,1,250000,0,0,4,192
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,2,0,2007,0,0,60,6,1,1999,2000,...,2,0,1647,953,1,175000,0,0,3,0
1456,2,1,2010,0,790,20,6,2,1978,1988,...,2,163,2073,589,0,210000,0,0,3,349
1457,1,0,2010,0,275,70,7,2,1941,2006,...,2,0,2340,877,0,266500,0,0,4,0
1458,1,1,2010,0,49,20,5,0,1950,1996,...,1,1029,1078,0,0,142125,0,0,2,366


Thirdly, we encode the ordinal data.

In [19]:
for feature in ordinalData:
 
  data[feature]=(data[feature].astype('category')).cat.codes

In [20]:
df_ordinal = data[ordinalData]

For nominal data, we will use One Hot Encoding.

In [21]:
df_nominal = pd.get_dummies(data[nominalData])

For the numerical data, there is no preprocessing taking place.

In [22]:
df_numerical = data[numericalData]

Join the data to form a new dataframe.

In [23]:
joinedData = pd.concat([df_numerical, df_nominal, df_ordinal], axis=1)

In [24]:
joinedData.head()

Unnamed: 0,GarageCars,BsmtFullBath,YrSold,LowQualFinSF,BsmtFinSF1,MSSubClass,OverallQual,Fireplaces,YearBuilt,YearRemodAdd,...,PavedDrive,GarageType,LandSlope,BsmtQual,HouseStyle,HeatingQC,Street,BsmtExposure,BsmtFinType2,Electrical
0,2,1,2008,0,706,60,7,0,2003,2003,...,2,1,0,2,5,0,1,3,5,4
1,2,0,2007,0,978,20,6,1,1976,1976,...,2,1,0,2,2,0,1,1,5,4
2,2,1,2008,0,486,60,7,1,2001,2002,...,2,1,0,2,5,0,1,2,5,4
3,3,1,2006,0,216,70,7,1,1915,1970,...,2,5,0,3,5,2,1,3,5,4
4,3,1,2008,0,655,60,8,1,2000,2000,...,2,1,0,2,5,0,1,0,5,4


Standardize the Encoded Dataset.

In [25]:
from sklearn.preprocessing import StandardScaler

In [35]:
scaler = StandardScaler()
df_X = joinedData.drop('SalePrice', axis=1)
X = np.array(df_X)
df_X

Unnamed: 0,GarageCars,BsmtFullBath,YrSold,LowQualFinSF,BsmtFinSF1,MSSubClass,OverallQual,Fireplaces,YearBuilt,YearRemodAdd,...,PavedDrive,GarageType,LandSlope,BsmtQual,HouseStyle,HeatingQC,Street,BsmtExposure,BsmtFinType2,Electrical
0,2,1,2008,0,706,60,7,0,2003,2003,...,2,1,0,2,5,0,1,3,5,4
1,2,0,2007,0,978,20,6,1,1976,1976,...,2,1,0,2,2,0,1,1,5,4
2,2,1,2008,0,486,60,7,1,2001,2002,...,2,1,0,2,5,0,1,2,5,4
3,3,1,2006,0,216,70,7,1,1915,1970,...,2,5,0,3,5,2,1,3,5,4
4,3,1,2008,0,655,60,8,1,2000,2000,...,2,1,0,2,5,0,1,0,5,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,2,0,2007,0,0,60,6,1,1999,2000,...,2,1,0,2,5,0,1,3,5,4
1456,2,1,2010,0,790,20,6,2,1978,1988,...,2,1,0,2,2,4,1,3,4,4
1457,1,0,2010,0,275,70,7,2,1941,2006,...,2,1,0,3,5,0,1,3,5,4
1458,1,1,2010,0,49,20,5,0,1950,1996,...,2,1,0,3,2,2,1,2,4,0


In [36]:
df_y = data[target]
y = np.array(df_y)
df_y

Unnamed: 0,SalePrice
0,208500
1,181500
2,223500
3,140000
4,250000
...,...
1455,175000
1456,210000
1457,266500
1458,142125


In [32]:
X.shape

(1460, 162)

In [33]:
y.shape

(1460, 1)

In [37]:
X=scaler.fit_transform(X)
y=scaler.fit_transform(y)

In [38]:
X

array([[ 0.31172464,  1.10781015,  0.13877749, ...,  0.65814837,
         0.33985263,  0.30361622],
       [ 0.31172464, -0.81996437, -0.61443862, ..., -0.94735976,
         0.33985263,  0.30361622],
       [ 0.31172464,  1.10781015,  0.13877749, ..., -0.1446057 ,
         0.33985263,  0.30361622],
       ...,
       [-1.02685765, -0.81996437,  1.64520971, ...,  0.65814837,
         0.33985263,  0.30361622],
       [-1.02685765,  1.10781015,  1.64520971, ..., -0.1446057 ,
        -0.43181897, -3.47702075],
       [-1.02685765,  1.10781015,  0.13877749, ...,  0.65814837,
        -1.20349056,  0.30361622]])

In [39]:
y

array([[ 0.34727322],
       [ 0.00728832],
       [ 0.53615372],
       ...,
       [ 1.07761115],
       [-0.48852299],
       [-0.42084081]])