# House Sales in King County, USA - Predictive Analytics Model

The aim of this project is to develop a regression model to predict of house prices before their sale.

The data is drawn from house sales in King County, USA from the year 2014 to 2015.

The dataset has 19 columns and 21,613 rows.
 

 # Importing Libraries
 These are libraries necessary for data analysis, ploting and machine learning algorithms for predictive analytics.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.linear_model import LinearRegression 
from sklearn.model_selection import train_test_split  
from math import sqrt
from sklearn.metrics import mean_squared_error
import mpl_toolkits
from sklearn.ensemble import GradientBoostingRegressor

# Reading the dataset
 More details on the dataset and its columns can be found [here](https://www.kaggle.com/harlfoxem/housesalesprediction).

In [None]:
data=pd.read_csv('../input/kc_house_data.csv')

# Exploratory Data Analysis
  The purpose of EDA is for showing us what the data can tell us before building or hypothesis testing task.
  
This helps in uderstanding the dataset and assist in data cleaning by identifying outlier, filling null entries and feature engineering (creation of new variables from existing variables).

In [None]:
#Display the first 10 records in the dataset
data.head(10)

In [None]:
#Display a concise summary of the dataset
#There are 21613 records in the dataset. Each column has exactly 21613 entries indicating there are no null entries.

data.info()

In [None]:
#Generate descriptive statistics that summarize the central tendency, dispersion and shape of the dataset’s distribution.

data.describe()

## Plots

### Correlation Matrix
A correlation matrix is a table showing correlation coefficients between variables.

This correalation matrix uses 'price', which we want to predict, as the target (dependent) variable.

The coeffiecients indictate the degree correlation of the independent variables against 'price'.

In [None]:
corrmat = data.corr()
cols = corrmat.nlargest(21, 'price')['price'].index #specify number of columns to display i.e 21
f, ax = plt.subplots(figsize=(18, 10)) #size of matrix
cm = np.corrcoef(data[cols].values.T)
sb.set(font_scale=1.25)
hm = sb.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':12}, yticklabels=cols.values,
                 xticklabels=cols.values)
plt.yticks(rotation=0, size=15)
plt.xticks(rotation=90, size=15)
plt.title("Correlation Matrix",style='oblique', size= 20)

plt.show()

### Scatter Plots
A scatter plot plots a graph of two variables where the pattern of the resulting points reveal any correlation present.

Scatter plots were used here to identify if there is a linear correlation betweeen indepent variable and price.

A line of best fit was also included in some plots to indicate the best linear trend.

In [None]:
plt.figure(figsize=(18,10))

plt.scatter(data['grade'],data['price'])
plt.xlabel("Grade")
plt.ylabel("Price")
plt.title("Price v. Grade")


In [None]:
plt.figure(figsize=(18,10))

plt.scatter(data['yr_built'],data['price'])
plt.xlabel("Year Built")
plt.ylabel("Price")
plt.title("Price v. Year Built")

In [None]:
plt.figure(figsize=(18,10))
plt.title("Price v. Bathrooms")

plt.scatter(data['price'],data['bathrooms'])
plt.ylabel("Bathrooms")
plt.xlabel("Price")
plt.plot(np.unique(data['price']), np.poly1d(np.polyfit(data['price'], data['bathrooms'], 1))(np.unique(data['price'])), color='green') #line of best fit



In [None]:
plt.figure(figsize=(18,10))

plt.scatter(data['bedrooms'],data['price'])
plt.ylabel('Price')
plt.xlabel('No. of bedrooms')
plt.title('Bedrooms v. Price')

In [None]:
plt.figure(figsize=(18,10))

plt.scatter(data['sqft_living'],data['price'])
plt.title('Price v. Square footage (living area)')
plt.xlabel('Square Footage')
plt.ylabel("Price")

In [None]:
plt.figure(figsize=(18,10))

plt.scatter(data['price'],data['sqft_above'])
plt.title('Price v. Square footage (above)')
plt.ylabel('Square Footage')
plt.xlabel("Price")
plt.plot(np.unique(data['price']), np.poly1d(np.polyfit(data['price'], data['sqft_above'], 1))(np.unique(data['price'])), color='green') #line of best fit


### Joint Plot

The joint plot below was used to display a map-like diagram by ploting Latitude and Longitude to show the distribution of the houses across the area.

It was also used to show areas with a high density of houses.

In [None]:
plt.figure(figsize=(18,10))

sb.jointplot(x=data.lat.values, y=data.long.values, size=12,color='brown')
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
plt.title("Concentration of Houses by Location")
sb.despine


### Bar graph

The bar graph was used to compare the count of different number bedrooms in the distribution.

As seen, there is a house with 33 bedrooms which could be a potential outlier.

In [None]:
plt.figure(figsize=(18,10))

data["bedrooms"].value_counts().plot(kind='bar')
plt.title('Count vs bedrooms Bar Graph')
plt.ylabel("Count")
plt.xlabel('Number of bedrooms')

# Identifying and Droping Outliers
An outlier is an observation point that is distant from other observations.

Outliers  affect the mean and median of a distribution which in turn may affect the predictive model as it tries to fit the data for training.

## Identifying Outliers
Box plots, violin plots,  scatter plots, bar graphs and other statistical methods can be used to identify outliers.

Box plots were used in this case to identify the outliers as a simple and accurate method.

### Box Plots

In [None]:
plt.figure(figsize=(18,10))

plt.boxplot(data['bedrooms'],1,'gD')

In [None]:
#count number of houses with more than ten bedrooms
data[data['bedrooms']>10].count()

In [None]:
#locate house with 33 bedrooms
data.loc[data['bedrooms'] == 33]

In [None]:
plt.figure(figsize=(18,10))

plt.boxplot(data['price'],1,'gD')

In [None]:
#count number of houses with prices above 7,000,000
data[data['price']>7000000].count()

In [None]:
#locate houses with a value above 7,000,000
data.loc[data['price'] > 7000000]

### Droping Outliers

In [None]:
data = data[data.bedrooms != 33]
data = data[data.price < 6000000]

# Probability Distribution and Normalization
A probability distribution is a device for indicating the values that a random variable may have.

The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values. This in turn creates a normal distribution.

A normal distrution is a symmetric distribution where most of the observations cluster around the central peak and the mean, median, and mode of a normal distribution are almost equal therefore minimizing variance (deviation from the mean).

A normal distribution is important as it eases training of a machine learning model and provides eveness while sampling data.



## Probability distribution

Distribution plots are used to show a probality distribution.
For a normal distribution, the ideal skewness and kurtosis value is approximately 0.

Skewness is a measure of the asymmetry of the probability distribution of a random variable about its mean.

Kurtosis  is used to describe the extreme values in one versus the other tail of a distribution.

In [None]:
plt.figure(figsize=(18,10))

from scipy import stats
from scipy.stats import skew,norm
from scipy.stats.stats import pearsonr
# kernel density plot
sb.distplot(data.condition,fit=norm);
plt.ylabel =('Frequency')
plt.title = ('Condition Distribution');
(mu,sigma)= norm.fit(data['condition']);

#QQ plot
plt.figure(figsize=(18,10))
res = stats.probplot(data['condition'], plot=plt)
plt.show()

print("skewness: %f" % data['condition'].skew())
print("kurtosis: %f" % data ['condition'].kurt())

In [None]:
plt.figure(figsize=(18,10))

from scipy import stats
from scipy.stats import skew,norm
from scipy.stats.stats import pearsonr
# kernel density plot
sb.distplot(data.sqft_above,fit=norm);
plt.ylabel =('Frequency')
plt.title = ('Square Foot Above Distribution');
(mu,sigma)= norm.fit(data['sqft_above']);

#QQ plot
plt.figure(figsize=(18,10))
res = stats.probplot(data['sqft_above'], plot=plt)
plt.show()

print("skewness: %f" % data['sqft_above'].skew())
print("kurtosis: %f" % data ['sqft_above'].kurt())

In [None]:
plt.figure(figsize=(18,10))

from scipy import stats
from scipy.stats import skew,norm
from scipy.stats.stats import pearsonr
# kernel density plot
sb.distplot(data.sqft_living15,fit=norm);
plt.ylabel =('Frequency')
plt.title = ('Square Foot Living(2015) Distribution');
(mu,sigma)= norm.fit(data['sqft_living15']);

#QQ plot
plt.figure(figsize=(18,10))
res = stats.probplot(data['sqft_living15'], plot=plt)
plt.show()

print("skewness: %f" % data['sqft_living15'].skew())
print("kurtosis: %f" % data ['sqft_living15'].kurt())

In [None]:
plt.figure(figsize=(18,10))

from scipy import stats
from scipy.stats import skew,norm
from scipy.stats.stats import pearsonr
# kernel density plot
sb.distplot(data.sqft_living,fit=norm);
plt.ylabel =('Frequency')
plt.title = ('Square Foot Living Distribution');
(mu,sigma)= norm.fit(data['sqft_living']);

#QQ plot
plt.figure(figsize=(18,10))
res = stats.probplot(data['sqft_living'], plot=plt)
plt.show()

print("skewness: %f" % data['sqft_living'].skew())
print("kurtosis: %f" % data ['sqft_living'].kurt())

In [None]:
from scipy import stats
from scipy.stats import skew,norm
from scipy.stats.stats import pearsonr
# kernel density plot
plt.figure(figsize=(18,10))

sb.distplot(data.sqft_lot15,fit=norm);
plt.ylabel =('Frequency')
plt.title = ('Square Foot Lot(2015) Distribution');
(mu,sigma)= norm.fit(data['sqft_lot15']);

#QQ plot

plt.figure(figsize=(18,10))
res = stats.probplot(data['sqft_lot15'], plot=plt)
plt.show()

print("skewness: %f" % data['sqft_lot15'].skew())
print("kurtosis: %f" % data ['sqft_lot15'].kurt())

In [None]:

from scipy import stats
from scipy.stats import skew,norm
from scipy.stats.stats import pearsonr
# kernel density plot
plt.figure(figsize=(18,10))

sb.distplot(data.price,fit=norm);
plt.ylabel =('Frequency')
plt.title = ('Price Distribution');
(mu,sigma)= norm.fit(data['price']);

#QQ plot
plt.figure(figsize=(18,10))
res = stats.probplot(data['price'], plot=plt)
plt.show()


print("skewness: %f" % data['price'].skew())
print("kurtosis: %f" % data ['price'].kurt())

## Normalization

Various data transformation methods such as Box-Cox, arcsine, and log transformations can be used in normalization.

(Natural) log transformation is often used where the data has a positively skewed distribution and therefore was used to normalize columns that were highly skewed.



In [None]:
plt.figure(figsize=(18,10))

#log transform the target 
data["sqft_lot15"] = np.log1p(data["sqft_lot15"])

#Kernel Density plot
sb.distplot(data.sqft_lot15,fit=norm);
plt.ylabel=('Frequency')
plt.title=('Square Foot Lot(2015) distribution');
#Get the fitted parameters used by the function
(mu,sigma)= norm.fit(data['sqft_lot15']);



#QQ plot
plt.figure(figsize=(18,10))

res =stats. probplot(data['sqft_lot15'], plot=plt)
plt.show()
print("skewness: %f" % data['sqft_lot15'].skew())
print("kurtosis: %f" % data['sqft_lot15'].kurt())

In [None]:
plt.figure(figsize=(18,10))

#log transform the target 
data["price"] = np.log1p(data["price"])

#Kernel Density plot
sb.distplot(data.price,fit=norm);
plt.ylabel=('Frequency')
plt.title=('Price distribution');
#Get the fitted parameters used by the function
(mu,sigma)= norm.fit(data['price']);
plt.savefig('dist.png')


#QQ plot
plt.figure(figsize=(18,10))
res =stats. probplot(data['price'], plot=plt)
plt.show()
print("skewness: %f" % data['price'].skew())
print("kurtosis: %f" % data['price'].kurt())

# Predictive Analytics

Predictive modeling in Machine Learning involves selection of the best algorithm that will be able to build a predictive model, with the highest accuracy and lowest loss, which will be able to predict the dependent variable given the independent variables.

A model is a representation of what an ML system has learned from the training data.

Loss is a measure of how far a model's predictions are from the actual value.

Linear Regression and Gradient Boosting Regression algoithms were both used in an attempt to build a model as the  value to be predicted is continous.

## **1. Linear Regression**

In [None]:
#Initialize Linear Regression to a variable reg

reg=LinearRegression()

In [None]:
#Initialize the value to be predicted(label) as price

labels=data['price']

In [None]:
#convert date into a readable data-type by the algorithm
#since the date variable had only 2014 and 2015, the date column can be trasformed into a nominal category with 1 representing 2014 and 0 representing 2015.

conv_dates = [1 if values == 2014 else 0 for values in data.date ]
data['date']=conv_dates

In [None]:
#drop columns not used in training.
#id, yr_built, condition and long (longitute) are droped because the have low corelation/significance on the target.
#price is also droped since it is not used as part of the independent variables.

train1 = data.drop(['id', 'price','condition','yr_built','long'],axis=1)


### Cross Validation
Cross validation is a method of estimating how well a model will perform  on new data by testing the model against one or more non-overlapping data subsets withheld from the training set.

The dataset will be split into two, train and test sets,  where the model will be trained on the train set and its performance, accuracy and loss, tested on the test set.

In [None]:
#70%, 30% train, test split

x_train , x_test , y_train , y_test = train_test_split(train1 , labels , test_size = 0.3,random_state =5)


In [None]:
#Fitting the regression algorithm with data from the train set.
#x_train represents the predictors (independent variables) and y_train represents the target.

reg.fit(x_train,y_train)

In [None]:
#Testing our accuracy.
acc1=reg.score(x_test,y_test)
print(str("The accuracy of the model is: "+str("%.2f" %(acc1*100))+"%"))


### RMSE

RMSE (Root Mean Squared Error) is a square root of MSE.

MSE is the average squared loss per example. MSE is calculated by dividing the squared loss by the number of examples.

The square root is introduced to make scale of the errors to be the same as the scale of targets.

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

In [None]:
y_prediction1 = reg.predict(x_test)

In [None]:
RMSE_lin = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction1))

In [None]:
print(RMSE_lin)

## **2. Gradient Boosting Regressor**


In [None]:
from sklearn.ensemble import GradientBoostingRegressor 

In [None]:
gbr=GradientBoostingRegressor(n_estimators= 400, max_depth = 5, min_samples_split = 2, learning_rate = 0.08, loss = 'ls')

In [None]:
train2 = data.drop(['id', 'price','condition','yr_built','long'],axis=1)

#70%, 30% train, test split

x_train1 , x_test1 , y_train1 , y_test1 = train_test_split(train2 , labels , test_size = 0.3,random_state =5)


In [None]:
gbr.fit(x_train1,y_train1)

In [None]:
acc=gbr.score(x_test1,y_test1)
acc

In [None]:
acc2=("%.2f" % (acc*100))
acc2

In [None]:
print(str("The acccuracy of the model is: "+str(acc2)+"%"))

### **Feature Importance**
Feature importance shows significance of predictors of variables on the target after training using GBR.

In [None]:
feature_importance = gbr.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5

plt.figure(figsize=(12,6))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, x_train.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Feature Importance')
plt.show()

### **RMSE**

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

In [None]:
y_prediction = gbr.predict(x_test)


In [None]:
RMSE_gbr = sqrt(mean_squared_error(y_true = y_test1, y_pred = y_prediction))

In [None]:
print(RMSE_gbr)