# Using Machine Learning to predict Sales Price

It is finally time to model our dataset to predict the SalePrice per property and the goal of this lesson is to learn how a target variable can be predicted using Machine Learning. We will also evaluate our model in this lesson.

Now, let us start by importing the necessary libraries,

In [None]:
import pandas as pd
import lightgbm as lgb
import datetime

Next, importing the CSV trained version file called `houseprices_data.csv` which contains pre-processed information about housing prices. 

In [None]:
# Reading in the CSV file as a DataFrame
df = pd.read_csv(r'C:\Users\muham\Downloads\train (1).csv', low_memory=False)

In [None]:
# Looking at the first five rows
df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [None]:
# Printing the shape
df.shape

(1460, 81)

In [None]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [None]:
df.shape

(1460, 81)

# Lets check all Null Values

In [None]:
df.isnull().sum()

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

In [None]:
df['LotFrontage'].mode()

0    60.0
dtype: float64

In [None]:
df['LotFrontage'].fillna(60,inplace=True)

In [None]:
df['LotFrontage'].isnull().sum()

0

In [None]:
df['Alley'].fillna(df['Alley'].mode,inplace=True)

In [None]:
df['Alley'].isnull().sum()

0

In [None]:
df['Street'].isnull().sum()

0

In [None]:
df['LotShape'].isnull().sum()

0

In [None]:
df['LandContour'].isnull().sum()

0

In [None]:
df['Utilities'].isnull().sum()

0

In [None]:
df['PoolArea'].isnull().sum()

0

In [None]:
df['PoolQC'].isnull().sum()

1453

In [None]:
df['PoolQC'].fillna(df['PoolQC'].mean,inplace=True)

In [None]:
df['PoolQC'].isnull().sum()

0

In [None]:
df['Fence'].isnull().sum()

1179

In [None]:
df['Fence'].fillna(df['Fence'].mean, inplace=True)

In [None]:
df['Fence'].isnull().sum()

0

In [None]:
df['MiscFeature'].isnull().sum()

1406

In [None]:
df['MiscFeature'].fillna(df['MiscFeature'].mean, inplace=True)

In [None]:
df['MiscFeature'].isnull().sum()

0

In [None]:
df['MiscVal'].isnull().sum()

0

In [None]:
df['MoSold'].isnull().sum()

0

In [None]:
df['YrSold'].isnull().sum()

0

In [None]:
df['SaleType'].isnull().sum()

0

In [None]:
df['SaleCondition'].isnull().sum()

0

In [None]:
df['SalePrice'].isnull().sum()

0

# Now that all null values have been cleared based on attribute natures as we want our predictions to e as solid as possible thus using mean, median and mode are the best techniques for doing so

# Lets first drop all string columns as ML algorithms cant read string attributes


In [None]:
# lets check all data types
df.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 81, dtype: object

In [None]:
df.drop(['MSZoning'], axis=1, inplace=True)

In [None]:
# Street	Alley	LotShape	LandContour	Utilities
df.drop(['Street'], axis=1, inplace=True)

In [None]:
df.drop(['Alley'], axis=1, inplace=True)

In [None]:
df.drop(['LotShape'], axis=1, inplace=True)

In [None]:
df.drop(['LandContour'], axis=1, inplace=True)

In [None]:
df.drop(['Utilities'], axis=1, inplace=True)

In [None]:
# 	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType, PoolQC	Fence	MiscFeature
# SaleType	SaleCondition
df.drop(['LotConfig'], axis=1, inplace=True)

In [None]:
df.drop(['LandSlope'], axis=1, inplace=True)

In [None]:
df.drop(['Neighborhood'], axis=1, inplace=True)

In [None]:
df.drop(['Condition1'], axis=1, inplace=True)

In [None]:
df.drop(['Condition2'], axis=1, inplace=True)

In [None]:
df.drop(['BldgType'], axis=1, inplace=True)

In [None]:
df.drop(['PoolQC'], axis=1, inplace=True)

In [None]:
df.drop(['Fence'], axis=1, inplace=True)

In [None]:
df.drop(['MiscFeature'], axis=1, inplace=True)

In [None]:
df.drop(['SaleType'], axis=1, inplace=True)

In [None]:
df.drop(['SaleCondition'], axis=1, inplace=True)

In [None]:
df.drop(['HouseStyle'], axis=1, inplace=True)

In [None]:
df.drop(['RoofStyle'], axis=1, inplace=True)

In [None]:
df.drop(['RoofMatl'], axis=1, inplace=True)

In [None]:
df.drop(['Exterior1st'], axis=1, inplace=True)

In [None]:
df.drop(['Exterior2nd'], axis=1, inplace=True)

In [None]:
df.drop(['MasVnrType'], axis=1, inplace=True)

In [None]:
df.drop(['ExterQual'], axis=1, inplace=True)

In [None]:
df.drop(['ExterCond'], axis=1, inplace=True)

In [None]:
df.drop(['Foundation'], axis=1, inplace=True)

In [None]:
df.drop(['BsmtQual'], axis=1, inplace=True)

In [None]:
df.drop(['BsmtCond'], axis=1, inplace=True)

In [None]:
#BsmtExposure
df.drop(['BsmtExposure'], axis=1, inplace=True)

In [None]:
df.drop(['BsmtFinType1'], axis=1, inplace=True)

In [None]:
df.drop(['BsmtFinSF1'], axis=1, inplace=True)

In [None]:
df.drop(['BsmtFinType2'], axis=1, inplace=True)

In [None]:
df.drop(['Heating'], axis=1, inplace=True)

In [None]:
df.drop(['HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'MoSold'], axis=1, inplace=True)

In [None]:
df.drop(['YrSold'], axis=1, inplace=True)

In [None]:
df.drop(['LotFrontage', 'MasVnrArea', 'Functional', 'GarageYrBlt'], axis=1, inplace=True)

In [None]:
df

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,...,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,SalePrice
0,1,60,8450,7,5,2003,2003,0,150,856,...,2,548,0,61,0,0,0,0,0,208500
1,2,20,9600,6,8,1976,1976,0,284,1262,...,2,460,298,0,0,0,0,0,0,181500
2,3,60,11250,7,5,2001,2002,0,434,920,...,2,608,0,42,0,0,0,0,0,223500
3,4,70,9550,7,5,1915,1970,0,540,756,...,3,642,0,35,272,0,0,0,0,140000
4,5,60,14260,8,5,2000,2000,0,490,1145,...,3,836,192,84,0,0,0,0,0,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,7917,6,5,1999,2000,0,953,953,...,2,460,0,40,0,0,0,0,0,175000
1456,1457,20,13175,6,6,1978,1988,163,589,1542,...,2,500,349,0,0,0,0,0,0,210000
1457,1458,70,9042,7,9,1941,2006,0,877,1152,...,1,252,0,60,0,0,0,0,2500,266500
1458,1459,20,9717,5,6,1950,1996,1029,0,1078,...,1,240,366,0,112,0,0,0,0,142125


In [None]:
df.dtypes

Id               int64
MSSubClass       int64
LotArea          int64
OverallQual      int64
OverallCond      int64
YearBuilt        int64
YearRemodAdd     int64
BsmtFinSF2       int64
BsmtUnfSF        int64
TotalBsmtSF      int64
1stFlrSF         int64
2ndFlrSF         int64
LowQualFinSF     int64
GrLivArea        int64
BsmtFullBath     int64
BsmtHalfBath     int64
FullBath         int64
HalfBath         int64
BedroomAbvGr     int64
KitchenAbvGr     int64
TotRmsAbvGrd     int64
Fireplaces       int64
GarageCars       int64
GarageArea       int64
WoodDeckSF       int64
OpenPorchSF      int64
EnclosedPorch    int64
3SsnPorch        int64
ScreenPorch      int64
PoolArea         int64
MiscVal          int64
SalePrice        int64
dtype: object

# As seen above we only have integer values which will make it accurate for us to make our prediction

In [None]:
df

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,...,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,SalePrice
0,1,60,8450,7,5,2003,2003,0,150,856,...,2,548,0,61,0,0,0,0,0,208500
1,2,20,9600,6,8,1976,1976,0,284,1262,...,2,460,298,0,0,0,0,0,0,181500
2,3,60,11250,7,5,2001,2002,0,434,920,...,2,608,0,42,0,0,0,0,0,223500
3,4,70,9550,7,5,1915,1970,0,540,756,...,3,642,0,35,272,0,0,0,0,140000
4,5,60,14260,8,5,2000,2000,0,490,1145,...,3,836,192,84,0,0,0,0,0,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,7917,6,5,1999,2000,0,953,953,...,2,460,0,40,0,0,0,0,0,175000
1456,1457,20,13175,6,6,1978,1988,163,589,1542,...,2,500,349,0,0,0,0,0,0,210000
1457,1458,70,9042,7,9,1941,2006,0,877,1152,...,1,252,0,60,0,0,0,0,2500,266500
1458,1459,20,9717,5,6,1950,1996,1029,0,1078,...,1,240,366,0,112,0,0,0,0,142125


First of all, let us split the dataset based on a 70:30 ratio. 70% of the dataset will be used for training our LightGBM model and 30% of the dataset will be used for evaluating it.

Next, let us get the target variable (y) and the features (X) from the splitted DataFrames. Please mind that we will be removing some columns since they cannot be used for training the model.

In [None]:
# Getting the target (y) from the splitted DataFrames
train_y = df["SalePrice"].astype(float).values
eval_y = df["SalePrice"].astype(float).values

# Getting the features (X) from the splitted DataFrames
train_X = df.drop(['SalePrice', 'GarageCars'], axis=1)
eval_X = df.drop(['SalePrice', 'GarageCars'], axis=1)

In [None]:
from sklearn.model_selection import train_test_split

Creating a custom function to train the LightGBM model with hyperparameters

In [None]:
def train_lightgbm(train_X, train_y, eval_X, eval_y):
    
    # Initializing the training dataset
    lgtrain = lgb.Dataset(train_X, label=train_y)
    
    # Initializing the evaluation dataset
    lgeval = lgb.Dataset(eval_X, label= eval_y)
    
    # Hyper-parameters for the LightGBM model
    params = {
        "objective" : "regression",
        "metric" : "rmse", 
        "num_leaves" : 30,
        "min_child_samples" : 100,
        "learning_rate" : 0.1,
        "bagging_fraction" : 0.7,
        "feature_fraction" : 0.5,
        "bagging_seed" : 2018,
        "verbosity" : -1
    }
    
    # Training the LightGBM model
    model = lgb.train(params, lgtrain, 1000, valid_sets=[lgeval], early_stopping_rounds=100, verbose_eval=100)
    
    # Returning the model
    return model

# Training the model 
model = train_lightgbm(train_X, train_y, eval_X, eval_y)

Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 26436.2
[200]	valid_0's rmse: 23146.3
[300]	valid_0's rmse: 20991.1
[400]	valid_0's rmse: 19104.2
[500]	valid_0's rmse: 17641.2
[600]	valid_0's rmse: 16326.5
[700]	valid_0's rmse: 15146.4
[800]	valid_0's rmse: 14168.6
[900]	valid_0's rmse: 13341.4
[1000]	valid_0's rmse: 12532.6
Did not meet early stopping. Best iteration is:
[1000]	valid_0's rmse: 12532.6


We've successfully trained our LightGBM model.

Now, let us quickly evaluate the model to see how it is doing by making an actual prediction using it. For this, let us select a row of data from our evaluation dataset and the actual revenue for that row of data.

In [None]:
# Index to test row 1458
index_val = 1400

# Selecting the index value from the evaluation DataFrame
actual_X_value = eval_X.reset_index(drop=True).iloc[index_val]

# Selecting the Sale Price from the target variable array
actual_y_value = eval_y[index_val]

In [None]:
# Printing the feature values
actual_X_value

Id               1401
MSSubClass         50
LotArea          6000
OverallQual         6
OverallCond         7
YearBuilt        1929
YearRemodAdd     1950
BsmtFinSF2          0
BsmtUnfSF         862
TotalBsmtSF       862
1stFlrSF          950
2ndFlrSF          208
LowQualFinSF        0
GrLivArea        1158
BsmtFullBath        0
BsmtHalfBath        0
FullBath            1
HalfBath            0
BedroomAbvGr        3
KitchenAbvGr        1
TotRmsAbvGrd        5
Fireplaces          1
GarageArea        208
WoodDeckSF          0
OpenPorchSF         0
EnclosedPorch     112
3SsnPorch           0
ScreenPorch         0
PoolArea            0
MiscVal             0
Name: 1400, dtype: int64

In [None]:
# Printing the SalePrice
actual_y_value

120000.0

Now, let us predict if our model can get a prediction close to the actual generated revenue.

In [None]:
# Predicting the value
predict_price = model.predict(actual_X_value.astype(float), predict_disable_shape_check=True)



In [None]:
predict_price

array([123319.80822919])

# Since Classification reports and other accuracy indiactors dont work on lightgbm model, thus the rmse represents the model has been trained well as its encoded with hyper parametres and has been tested to be able to predict the slae price and rmse kept decreasing and it may have data modelling in accuracies but can work well on any enviroment its tested on 

# Random Forest

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier

In [None]:
model_RF = RandomForestClassifier()

In [None]:
model_RF.fit(train_X, train_y)

RandomForestClassifier()

In [None]:
predict_RF = model_RF.predict(eval_X)
predict_RF

array([208500., 181500., 223500., ..., 266500., 142125., 147500.])

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(predict_RF, eval_y)

0.0

In [None]:
from sklearn.metrics import r2_score
r2_score(predict_RF, eval_y)

1.0

In [None]:
from sklearn.model_selection import KFold, cross_val_score, StratifiedKFold

In [None]:
cv1 = KFold(n_splits=10, random_state=12,shuffle= True)

In [None]:
# evaluate the model with cross validation
scores = cross_val_score(model_RF, train_X, train_y, scoring='accuracy', cv=cv1, n_jobs=-1)
scores

array([0.00684932, 0.00684932, 0.02054795, 0.        , 0.04109589,
       0.        , 0.01369863, 0.02054795, 0.02054795, 0.00684932])

In [None]:
from statistics import mean, stdev
# report perofmance
print('Accuracy: %.3f(%.3f)'% (mean(scores), stdev(scores)))

Accuracy: 0.014(0.013)


In [None]:
accuracy_score(predict_RF, eval_y)

1.0

In [None]:
# lets use Hyper parametres like Random Search to improve our RFC model
# Random Search
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
random_search = {'criterion': ['entropy', 'gini'],
 'max_depth': list(np.linspace(5, 1200, 10, dtype = int)) + [None],
 'max_features': ['auto', 'sqrt','log2', None],
 'min_samples_leaf': [4, 6, 8, 12],
 'min_samples_split': [3, 7, 10, 14],
 'n_estimators': list(np.linspace(5, 1200, 3, dtype = int))}
clf = RandomForestClassifier()
model_R = RandomizedSearchCV(estimator = clf, param_distributions = random_search, 
 cv = 4, verbose= 5, random_state= 101, n_jobs = -1)
model_R.fit(train_X,train_y)
model_R.best_params_

Fitting 4 folds for each of 10 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  34 out of  40 | elapsed:  4.7min remaining:   50.2s
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  5.4min finished


{'n_estimators': 1200,
 'min_samples_split': 3,
 'min_samples_leaf': 6,
 'max_features': 'log2',
 'max_depth': 668,
 'criterion': 'gini'}

In [None]:
predict_R = model_R.predict(eval_X)
predict_R

array([208500., 181500., 223500., ..., 266500., 142125., 147500.])

In [None]:
r2_score(predict_R, eval_y)

0.9997303629775582

In [None]:
accuracy_score(predict_R, eval_y)

0.9842465753424657

# So we can here see that the random Forest model is trained well as represented by the accuracy score and r_2 score but the cross validation score proves that it still needsmore training and ETL processing before being check on other enviroments as indicated by the cross val score

# Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
model_LR = LinearRegression()

In [None]:
model_LR.fit(train_X,train_y)

LinearRegression()

In [None]:
predict_LR = model_LR.predict(eval_X)
predict_LR

array([229236.61027232, 195616.95746954, 226980.37888147, ...,
       233701.68905113, 133071.83485446, 159470.00219185])

In [None]:
mean_absolute_error(predict_LR, eval_y)

21032.343683129395

In [None]:
r2_score(predict_LR, eval_y)

0.760762122785377

In [None]:
from sklearn.model_selection import StratifiedKFold
skfold = StratifiedKFold(n_splits=3, random_state=100, shuffle=True)
model_skfold = LinearRegression()
results_skfold = cross_val_score(model_skfold, train_X, train_y, cv=skfold)
print("Accuracy: %.2f%%" %(results_skfold.mean()*100.0))

Accuracy: 74.82%




We can conclude the following from this small evaluation of Linear Regression Model:

1. The model is actually trained and is able to predict a sale price on any new product or any changes to sale price.

2. The model is not able to accurately predict the revenue amount with the sale price or changes in the sale price.

# Conclusion 


Some things that can be done to increase model accuracy are as follows:

- Do not drop any of the columns and start with the unoptimized dataset. Then, individually go through all of the columns and only drop columns that are not helpful to the model.

- Engineer new features from the dataset based on the available data fields.

- Change the LightGBM model's hyper-parameters.

- Use another Machine Learning model or create an ensemble of Machine Learning algorithms for getting better results.

- Use K-Fold Cross Validation instead of simple data splitting for model evaluation.

- ... and much more. Research!
- ... The best Model to use from the three models is Random Forest predictor of Sale price