**Exploratory Data Analysis**

To start exploring the data I will first start off with importing the necessary libraries I will be needing

In [1]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
%matplotlib inline

**House Data Info**

The dataframe that I will be working with is a houses data set, the first step is that I will read into the csv file

In [2]:
# Read the dataset into a data table using Pandas and show the first 5 rows
house_df = pd.read_csv('Data/house_data_set.csv')
house_df.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_type,garage_sqft,carport_sqft,has_fireplace,has_pool,has_central_heating,has_central_cooling,house_number,street_name,unit_number,city,zip_code,sale_price
0,1978,1,4,1,1,1689,1859,attached,508,0,True,False,True,True,42670,Lopez Crossing,,Hallfort,10907,270897.0
1,1958,1,3,1,1,1984,2002,attached,462,0,True,False,True,True,5194,Gardner Park,,Hallfort,10907,302404.0
2,2002,1,3,2,0,1581,1578,none,0,625,False,False,True,True,4366,Harding Islands,,Lake Christinaport,11203,2519996.0
3,2004,1,4,2,0,1829,2277,attached,479,0,True,False,True,True,3302,Michelle Highway,,Lake Christinaport,11203,197193.0
4,2006,1,4,2,0,1580,1749,attached,430,0,True,False,True,True,582,Jacob Cape,,Lake Christinaport,11203,207897.0


I will get additional information on the dataframe to see if it has any null values and the types of data

In [3]:
#Basic info on data
house_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42703 entries, 0 to 42702
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   year_built           42703 non-null  int64  
 1   stories              42703 non-null  int64  
 2   num_bedrooms         42703 non-null  int64  
 3   full_bathrooms       42703 non-null  int64  
 4   half_bathrooms       42703 non-null  int64  
 5   livable_sqft         42703 non-null  int64  
 6   total_sqft           42703 non-null  int64  
 7   garage_type          42703 non-null  object 
 8   garage_sqft          42703 non-null  int64  
 9   carport_sqft         42703 non-null  int64  
 10  has_fireplace        42703 non-null  bool   
 11  has_pool             42703 non-null  bool   
 12  has_central_heating  42703 non-null  bool   
 13  has_central_cooling  42703 non-null  bool   
 14  house_number         42703 non-null  int64  
 15  street_name          42703 non-null 

In [4]:
house_df.describe()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_sqft,carport_sqft,house_number,unit_number,zip_code,sale_price
count,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,42703.0,3088.0,42703.0,42703.0
mean,1990.993209,1.365759,3.209283,1.923659,0.527153,1987.758986,2127.155446,455.8498,41.656324,18211.767347,2027.395402,11030.991476,413507.1
std,19.199987,0.513602,1.043396,0.759699,0.499268,846.76627,922.807342,243.453463,168.715867,27457.109993,1141.38377,573.576228,318549.7
min,1852.0,0.0,0.0,0.0,0.0,-3.0,5.0,-4.0,0.0,0.0,3.0,10004.0,626.0
25%,1980.0,1.0,3.0,1.0,0.0,1380.0,1466.0,412.0,0.0,674.0,1063.0,10537.0,270899.0
50%,1994.0,1.0,3.0,2.0,1.0,1808.0,1937.0,464.0,0.0,4530.0,2033.0,11071.0,378001.0
75%,2005.0,2.0,4.0,2.0,1.0,2486.0,2640.0,606.0,0.0,24844.5,2921.0,11510.0,497697.0
max,2017.0,4.0,31.0,8.0,1.0,12406.0,15449.0,8318.0,9200.0,99971.0,3998.0,11989.0,21042000.0


This data set has 19 house features (not counting sale price) and 42703 data entries, I will be using those features in my model to help predict the house sale price.

**Data Cleaning & Preparation**

In this section, I will start by cleaning the house dataframe before drawing any conclusions. It will help inspect the data better and get a more accurate general understanding of the data at hand.

**Step 1:** check  all columns in data frame

In [5]:
house_df.columns

Index(['year_built', 'stories', 'num_bedrooms', 'full_bathrooms',
       'half_bathrooms', 'livable_sqft', 'total_sqft', 'garage_type',
       'garage_sqft', 'carport_sqft', 'has_fireplace', 'has_pool',
       'has_central_heating', 'has_central_cooling', 'house_number',
       'street_name', 'unit_number', 'city', 'zip_code', 'sale_price'],
      dtype='object')

Now that I have seen all the columns and what features the dataframe entails, I want to select only the features that will be relative and useful to my model, in this next section I will be preparing the desired features. 

**Step 2:** Remove unwanted columns

There are some features that will not be useful in my model therefore, I will delete the columns that I won't need. 

The house number isn't going to be useful to include in the model since it's not likely that anyone buys a house because the street number assigned to it, it's just a random number. So I will drop this field from the model. 
The same will go for unit number. As for street name, city, and zip code columns, the location of a house has a big influence on the value so we need to include at least some of this information in the model. However, these columns provide duplicate information. For example, if we know the zip code of a house, we already know what city it's in, so I don't need to include both city and zip code in my model. I will also delete street name since.

In [6]:
#There are 4 columns that I will remove since it will not be included in my model
house_df.drop(['house_number', 'street_name', 'unit_number','zip_code'], axis = 1, inplace=True)
house_df.head()

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_type,garage_sqft,carport_sqft,has_fireplace,has_pool,has_central_heating,has_central_cooling,city,sale_price
0,1978,1,4,1,1,1689,1859,attached,508,0,True,False,True,True,Hallfort,270897.0
1,1958,1,3,1,1,1984,2002,attached,462,0,True,False,True,True,Hallfort,302404.0
2,2002,1,3,2,0,1581,1578,none,0,625,False,False,True,True,Lake Christinaport,2519996.0
3,2004,1,4,2,0,1829,2277,attached,479,0,True,False,True,True,Lake Christinaport,197193.0
4,2006,1,4,2,0,1580,1749,attached,430,0,True,False,True,True,Lake Christinaport,207897.0


**Step 3:** Check for any duplicates

In [7]:
house_df.duplicated().value_counts()

False    42703
dtype: int64

no duplicates, so we're good to proceed

**Step 4:** Check if the features columns need any preparation

The columns with true and false values will be fine to use in my model since they'll be treated as one or zero automatically, so no extra work is needed to prepare it. 

I will use one-hot encoding to the columns with categorical values in order to use it in my model: the garage_type and city. One-Hot Encoding is the process of creating dummy variables for categorical variables. For every categorical feature, a new numerical variable is created.

In [8]:
house_df['city'].value_counts()

Chadstad                4962
Coletown                3739
Jeffreyhaven            2981
North Erinville         2868
Port Andrealand         2669
Hallfort                2448
Lewishaven              2271
South Anthony           1849
Lake Jack               1831
Davidfort               1703
Lake Dariusborough      1441
West Ann                1397
East Lucas              1359
Port Jonathanborough    1344
Scottberg               1009
Lake Christinaport       833
East Amychester          792
Joshuafurt               745
West Lydia               709
Morrisport               654
Lake Carolyn             637
West Gregoryview         615
Wendybury                587
Amystad                  561
Port Adamtown            416
Richardport              297
Jenniferberg             275
Justinport               272
East Janiceville         248
Brownport                209
Clarkberg                174
West Gerald              151
West Brittanyview        120
East Justin              112
West Terrence 

In [9]:
house_df['garage_type'].value_counts()

attached    34079
none         5912
detached     2712
Name: garage_type, dtype: int64

In [10]:
# Replace categorical data with one-hot encoded data
features_df = pd.get_dummies(house_df, columns=['garage_type', 'city'])
features_df

Unnamed: 0,year_built,stories,num_bedrooms,full_bathrooms,half_bathrooms,livable_sqft,total_sqft,garage_sqft,carport_sqft,has_fireplace,...,city_South Anthony,city_South Stevenfurt,city_Toddshire,city_Wendybury,city_West Ann,city_West Brittanyview,city_West Gerald,city_West Gregoryview,city_West Lydia,city_West Terrence
0,1978,1,4,1,1,1689,1859,508,0,True,...,0,0,0,0,0,0,0,0,0,0
1,1958,1,3,1,1,1984,2002,462,0,True,...,0,0,0,0,0,0,0,0,0,0
2,2002,1,3,2,0,1581,1578,0,625,False,...,0,0,0,0,0,0,0,0,0,0
3,2004,1,4,2,0,1829,2277,479,0,True,...,0,0,0,0,0,0,0,0,0,0
4,2006,1,4,2,0,1580,1749,430,0,True,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42698,1982,1,1,1,0,591,627,0,200,False,...,0,0,0,0,0,0,0,0,0,0
42699,1983,1,1,1,0,592,624,0,204,False,...,0,0,0,0,0,0,0,0,0,0
42700,1983,1,1,1,0,594,618,0,197,False,...,0,0,0,0,0,0,0,0,0,0
42701,1981,1,3,2,0,1398,1401,401,0,False,...,0,0,0,0,0,0,0,0,0,0


In [11]:
features_df.columns

Index(['year_built', 'stories', 'num_bedrooms', 'full_bathrooms',
       'half_bathrooms', 'livable_sqft', 'total_sqft', 'garage_sqft',
       'carport_sqft', 'has_fireplace', 'has_pool', 'has_central_heating',
       'has_central_cooling', 'sale_price', 'garage_type_attached',
       'garage_type_detached', 'garage_type_none', 'city_Amystad',
       'city_Brownport', 'city_Chadstad', 'city_Clarkberg', 'city_Coletown',
       'city_Davidfort', 'city_Davidtown', 'city_East Amychester',
       'city_East Janiceville', 'city_East Justin', 'city_East Lucas',
       'city_Fosterberg', 'city_Hallfort', 'city_Jeffreyhaven',
       'city_Jenniferberg', 'city_Joshuafurt', 'city_Julieberg',
       'city_Justinport', 'city_Lake Carolyn', 'city_Lake Christinaport',
       'city_Lake Dariusborough', 'city_Lake Jack', 'city_Lake Jennifer',
       'city_Leahview', 'city_Lewishaven', 'city_Martinezfort',
       'city_Morrisport', 'city_New Michele', 'city_New Robinton',
       'city_North Erinville', 

The target variable that I'm trying to predict is the house's sale price. the different features in my data set will be utilized as the predictors of the target variable.

Therefore, I will need to separate the sale price column from the features data frame so the ML model doesn't see the sale price in the input data.

In [12]:
# Remove the sale price from the feature data
features_df.drop(['sale_price'], axis = 1, inplace=True)
features_df.columns

Index(['year_built', 'stories', 'num_bedrooms', 'full_bathrooms',
       'half_bathrooms', 'livable_sqft', 'total_sqft', 'garage_sqft',
       'carport_sqft', 'has_fireplace', 'has_pool', 'has_central_heating',
       'has_central_cooling', 'garage_type_attached', 'garage_type_detached',
       'garage_type_none', 'city_Amystad', 'city_Brownport', 'city_Chadstad',
       'city_Clarkberg', 'city_Coletown', 'city_Davidfort', 'city_Davidtown',
       'city_East Amychester', 'city_East Janiceville', 'city_East Justin',
       'city_East Lucas', 'city_Fosterberg', 'city_Hallfort',
       'city_Jeffreyhaven', 'city_Jenniferberg', 'city_Joshuafurt',
       'city_Julieberg', 'city_Justinport', 'city_Lake Carolyn',
       'city_Lake Christinaport', 'city_Lake Dariusborough', 'city_Lake Jack',
       'city_Lake Jennifer', 'city_Leahview', 'city_Lewishaven',
       'city_Martinezfort', 'city_Morrisport', 'city_New Michele',
       'city_New Robinton', 'city_North Erinville', 'city_Port Adamtown',

I will create X and y arrays. 

X to represent the input features (house features, indepedent variables)

y to represent the expected output to predict (house sale price, dependent variable)

I will be creating my arrays using Numpy matrix because it is more efficient and will make my code to run faster, so I will be vectorizing my code.

In [13]:
# the X array will be the contents of my features dataframe.
# I will call the to_numpy function to make sure the data is a numpy matrix data type and not a pandas dataframe
X = features_df.to_numpy()

# the y array will be the sales price column from my original dataset
y = house_df['sale_price'].to_numpy()

**Train & Test Data**

In this section, I will perform a train-test-split. When performing a train-test-split, it is important that the data is **randomly** split. Another thing to consider is just **how big** each training and testing set should be. I split the data into a training data set and a test data set. I will use Scikit-learn to do this in one linne of code. I will do the training set size at 70% and 30% for the testing set.

By training and testing, it will allow us to prove the accuracy of the model and prove that it actually learned general rules for predicting house prices.

In [14]:
#This command will shuffle all of our data so it's in a random order, and then split it into two groups. 
#The test size equals 0.3 parameter tells it we want to keep 70% of the data for training and 30%for testing. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

**Creating and Training ML model**
For this machine learning model, I will be using the scikit-learn gradient boosting regressor. this will be one line of code!

I will need to set the hyper-parameters that control how the gradient boosting regressor model will run. 

**Gradient boosting** is an ensemble learning algorithm meaning that it uses many simple machine learning models that work together to give a more accurate answer than any individual model could by itself. The basic data structure behind gradient boosting is a decision tree, it creates a combination of decision trees that build on each other.



n_estimators: determines how many decision trees to build, higher numbers are more accurate but require more time to run the model.

learning_rate: controls how much each additional decision tree influences the overall prediction. Lower rates usually lead to higher accuracy, but only works if n_estimators is set to a high value.

max_depth: controls how many layers deep each individual decision tree can be.

min_samples_leaf: controls how many times a value must appear in my training set for a decision tree-
to make a decision based on it.

max_features: the percentage of features in the model that we randomly choose to consider each time we create a branch in our decision tree.

loss: controls how scikit-learn calculates the error rate or cost as it learns.
    huber function: is a combination of the squarred error for regression and the absolute error of regression.

In [15]:
#Fit regression model
model = ensemble.GradientBoostingRegressor(
    n_estimators=1000,
    learning_rate=0.1,
    max_depth=6,
    min_samples_leaf=9,
    max_features=0.1,
    loss='huber')

In [16]:
# next I tell the model to train using the training data set by calling scikit-learn's fit function on the model.
model.fit(X_train, y_train)

GradientBoostingRegressor(loss='huber', max_depth=6, max_features=0.1,
                          min_samples_leaf=9, n_estimators=1000)

Now that the ML model is trained, the next step is to measure the performance of the model is.

I will use the **mean absolute error** to check the accuracy of my models predictions.
Mean absolute error expresses how incorrect the model is in an average prediction. 

In [17]:
# error rate on the training set
mse = mean_absolute_error(y_train, model.predict(X_train))
print("Training Set Mean Absolute Error: %.4f" % mse)

Training Set Mean Absolute Error: 48464.0062


In [18]:
# error rate on the test set
mse = mean_absolute_error(y_test, model.predict(X_test))
print("Test Set Mean Absolute Error: %.4f" % mse)

Test Set Mean Absolute Error: 59321.4621


For the training set, the MAE is 48,464 dollars. This means that our model is off by about 48,000 dollars in a given prediction. So the model was able to predict the value of every house in the training data set to within $48,000 of the real price.

For the test set, the MAE was a bit higher at $59,321. Which means that the model still works for houses it has never seen before, but not quite as well as for the training houses.

Given the wide range of houses in my model, that is quite good.
