# Predict Home Value using IBM Watson Machine Learning and Db2 on Cloud

In this notebook, we will be showing you how to create a machine learning model to predict home sales. At the same time, we will be use IBM's Db2 on Cloud database to store and recieve data for this exercise. In this notebook, we will be going through the entire data science process of creating a model which includes: Importing Data, Preprocessing Data, Data Exploration, Data Visualization and Data Modeling. We will try to fitting our data with differeent models and see which one gives us the best resuts. At the end of this notebook, we will be deploying our model using Watson Machine Learning so it can used in an application to predict new home sales price. Hope you guys have fun!

### Learning Goals

The learning goals of this notebook are:

    1. Load data and create dataframe by connecting to IBM Db2 on cloud
    2. Explore and Visualize Data
    2. Create a Sklearn models and determine which is the best model to use 
    3. Train different models and evaluate which is the best model to use 
    4. Persist a model in a Watson Machine Learning repository
    5. Predict home value using the model


In [None]:
!pip install --upgrade ibmos2spark
!pip install --upgrade watson-machine-learning-client

import numpy as np
import pandas as pd

# 1. IMPORTING DATA

The first step as data scienctist is to import our data from our data source. In real a world application, your data is going to be too big to be stored locally on your computer. Due to that, your data will most likely be stored on the cloud or in some other method. For the purposes of this notebook, our data is store on IBM Db2 on Cloud. In order to retrieve data from there, we are going to import IBM's python model `ibmdbpy` that allows Python users to import data from IBM's databases. Our data is stored in schema `SKP44849` and the table name is `HOME_SALES`.

In [None]:
# Connect to datasource


By the end of this step, we should have your data imported from your data source and store in memory, so we can use it to create our model. In the next step, we are going to clean our data and make it ready for exploration and modeling.

# 2. Data Exploration

In this step, we are going to try and explore our data inorder to gain insight. We hope to be able to make some assumptions of our data before we start modeling.

In [None]:
pd_df.describe()

 The count, mean, min and max rows are self-explanatory. The std shows the standard deviation, and the 25%, 50% and 75% rows show the corresponding percentiles.

In [None]:
# Minimum price of the data
minimum_price = np.amin(pd_df['SALEPRICE'])

# Maximum price of the data
maximum_price = np.amax(pd_df['SALEPRICE'])

# Mean price of the data
mean_price = np.mean(pd_df['SALEPRICE'])

# Median price of the data
median_price = np.median(pd_df['SALEPRICE'])

# Standard deviation of prices of the data
std_price = np.std(pd_df['SALEPRICE'])

# Show the calculated statistics
print("Statistics for housing dataset:\n")
print("Minimum price: ${}".format(minimum_price)) 
print("Maximum price: ${}".format(maximum_price))
print("Mean price: ${}".format(mean_price))
print("Median price ${}".format(median_price))
print("Standard deviation of prices: ${}".format(std_price))

# 3. Data Visualization

As a data scientist it is important to make assumptions and hypothesiize about our data as we continue to explore our data. Some assumptions that we can make about the data are: 

1. Homes with more rooms will naturally worth more. Usually homes with more rooms are bigger and can fit more people, so it is reasonable that they cost more money.
2. Homes that have recently been built will cost more. Since they are newer and probably have a better design compared to older houses. 
3. Having a garage will also increase the price of the house and will increase more as the number of cars the garage can hold increases. 
4. House Style is usually a personal opinion for the buyer, so it shouldn't have that much impact on the cost of the home sale.

These are just a few of the assumptions we can make so far from our data. As we move into the visualizing our data, we hope to see patterns that are hard to notice just by looking at the numbers. 

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

pd_df.hist(bins=50, figsize=(30,25))
plt.show()

Some of the figures are skewed a little, but most of them have a normal distribution. This is normal to have.

The variable we are going to predict is the SALEPRICE. Letâ€™s look at how much each independent variable correlates with this dependent variable.

In [None]:
corr_matrix = pd_df.corr()
corr_matrix["SALEPRICE"].sort_values(ascending=False)

The SALEPRICE seems to be increasing when the GARAGECARS and the FULLBATH increase. We can also see a negative correlation between SALEPRICE and FOUNDATION and a couple others. And finally, coefficients close to zero indicate that there is no linear correlation. However, we barely see any cofficients close to zeros, this tells us that all the attributes are important to SALEPRICE, which is the attribute we are predicting.

# 4. Modeling

Now that we have cleaned and explored our data. We are ready to build our model that will predict the attribute `SALEPRICE`. One of the hardest part in the process is determining which model to use for a particular problem. However, since we are using Python's machine learning library [scikit-learn](https://scikit-learn.org/stable/), we will be able to build and test different models quickly and determine which one is the best to use. We will be building three models:


1. [Linear Regression Model](https://towardsdatascience.com/linear-regression-using-python-b136c91bf0a2)
2. [Random Forest Model](https://www.distilnetworks.com/glossary/term/random-forest-model/)
3. [Gradient Boosting](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)

IMPORT NOTE: THE METRIC RESULTS MAY BE A LITTLE DIFFERENT SINCE WE ARE SHUFFLING THE DATA WHEN WE CALL train_test_split()

## 4a. Splitting Our Data 
Before we can build our model, we need to split our data into test and train data. We will also be shuffling our data to make sure there isn't any bias when creating the model. Since having any bias in our model will lower the accuracy of our model. 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import numpy as np

# train, test = train_test_split(pd_df, test_size=0.2)
# print("Number of training records: " + str(train.count()))
# print("Number of testing records : " + str(test.count()))


y = pd_df['SALEPRICE']

pd_df = pd_df.drop(['SALEPRICE','ID'],  axis=1)


le = LabelEncoder()
X_2 = pd_df.apply(le.fit_transform)

enc = OneHotEncoder(handle_unknown='ignore')
ft = enc.fit(X_2)

onehotlabels = enc.transform(X_2).toarray()
onehotlabels.shape

#print(onehotlabels[:,len(onehotlabels)-1])

# print(onehotlabels)

x = X_2

#x = np.delete(onehotlabels, -1, axis=1)

#y = onehotlabels[:,len(onehotlabels)-1]


x_train, x_test, y_train, y_test = train_test_split( x,y)


## 4b. Developing The Model with Linear Regression 

Our first model that we are going to build is a linear regression model. This is one of the simplest models to implement and also has a high accuracy as well. We will be importing the `LinearRegression` module from the `sklearn` library.

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression(fit_intercept=False)
linear_regression_model = regressor.fit(x_train, y_train)

Now that we have built our model, let's use the model to predict the home sales value.

In [None]:
y_pred = regressor.predict(x_test)

Let's take a look at the predictions our model gave us

In [None]:
print('Predictions: \n', y_pred)

Let's also take a look at the coefficients for this model

In [None]:
y_pred = regressor.predict(x_test)
# The coefficients
print('Coefficients: \n', regressor.coef_)
# print('Linear Regression R squared": %.4f' % regressor.score(x_test, y_test))



Let's calculate root-mean-square error (RMSE) and see if we can gain more information about our model.

In [None]:
import numpy as np
from sklearn.metrics import mean_squared_error
lin_mse = mean_squared_error(y_pred, y_test)
lin_rmse = np.sqrt(lin_mse)
print('Linear Regression RMSE: %.4f' % lin_rmse)

Our model was able to predict the value of every house in the test set within $43101.9709 of the real price. 





In [None]:
from sklearn.metrics import mean_absolute_error
lin_mae = mean_absolute_error(y_pred, y_test)
print('Linear Regression MAE: %.4f' % lin_mae)

## 4c. Developing The Model with Random Forest

Now that we tried to fit our dataset to a linear regression model. Let's try a more complex model and see if our accuracy can improve. We will fit our model to the Random Forest Model in this section. We will be importing the `RandomForestRegressor` module from the `sklearn` library.

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(x_train, y_train)

In [None]:
print('Random Forest R squared": %.4f' % forest_reg.score(x_test, y_test))

In [None]:
y_pred = forest_reg.predict(x_test)
forest_mse = mean_squared_error(y_pred, y_test)
forest_rmse = np.sqrt(forest_mse)
print('Random Forest RMSE: %.4f' % forest_rmse)

As you can see from the metrics performed, we are getting much better results compared to the Linear Regression model. In our Linear Regression model, the R2 was around 70%, however now it has increase to 82%! This means that the Random Forest model is fitting our data much better compared to the Linear Regression Model. When comparing our RMSE, it has also decreased significantly alluding that there is less variabliity in our model now. 

Let's try to fit our dataset to one more model and see if we can improve our metrics.

## 4d. Developing The Model with Gradient Boosting

Our last model we are going to fit our data to is Gradient Boosting. We will be importing the `GradientBoostingRegressor` module from the `sklearn` library.

In [None]:
from sklearn import ensemble
from sklearn.ensemble import GradientBoostingRegressor
model = ensemble.GradientBoostingRegressor()
model.fit(x_train, y_train)

In [None]:
print('Gradient Boosting R squared": %.4f' % model.score(x_test, y_test))

In [None]:
y_pred = model.predict(x_test)
model_mse = mean_squared_error(y_pred, y_test)
model_rmse = np.sqrt(model_mse)
print('Gradient Boosting RMSE: %.4f' % model_rmse)

We notice that the metrics are significantly better than the Linear Regression model, however not as good as the Random Forest model.

# 5. Deploying Model

From our evalution, we can see that our Random Forest Model performed best out of the models we trained. For this reason we are going to use the that model for deploy. Below are the instructions to deploy our model.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer




x_train, x_test, y_train, y_test = train_test_split( pd_df,y)

categorical_feature_mask = (pd_df.dtypes==object)
numerical_features = ~categorical_feature_mask


preprocess = make_column_transformer(
    ( make_pipeline(SimpleImputer(), StandardScaler()), numerical_features  ),
    ( OneHotEncoder() , categorical_feature_mask,) )


model = make_pipeline( preprocess, RandomForestRegressor())

model.fit(x_train, y_train)

model.predict(x_test)

In [None]:
# Replace the credentials that you got from Watson Machine Learning service
from watson_machine_learning_client import WatsonMachineLearningAPIClient
wml_credentials = {    
    "apikey": "<api key>",
  "instance_id": "<instance id>",
  "url": "<URL>"
}
client = WatsonMachineLearningAPIClient(wml_credentials)

In [None]:
runtimes_meta = {
    client.runtimes.ConfigurationMetaNames.NAME: "Home_Sale_Model", 
    client.runtimes.ConfigurationMetaNames.DESCRIPTION: "Home Sale Model hype", 
    client.runtimes.ConfigurationMetaNames.PLATFORM: { "name": "python", "version": "3.6" }, 
}
runtime_details = client.runtimes.store(runtimes_meta)
runtime_details
runtime_url = client.runtimes.get_url(runtime_details)
runtime_uid = client.runtimes.get_uid(runtime_details)
print("Runtimes URL: " + runtime_url)
print("Runtimes UID: " + runtime_uid)

In [None]:
model_props = {client.repository.ModelMetaNames.NAME: "Home Sale Model hype",
               client.repository.ModelMetaNames.RUNTIME_UID: runtime_uid
              }
published_model = client.repository.store_model(model=model, meta_props=model_props)
import json
published_model_uid = client.repository.get_model_uid(published_model)
model_details = client.repository.get_details(published_model_uid)
print(json.dumps(model_details, indent=2))


In [None]:
created_deployment = client.deployments.create(published_model_uid, name="Home_Sale_Model")

# 6.  Predict using the deployed mode

Get the URL that is to be used for prediction. The prediction URL is obtained from the deployment details of the deployment created above.

In [None]:
scoring_endpoint = client.deployments.get_scoring_url(created_deployment)
print(scoring_endpoint)
x_train.iloc[0].values

Prepare the payload for prediction. The payload contains the input records for which predictions has to be performed.

In [None]:



scoring_payload = {'fields': ['LOTAREA', 'BLDGTYPE', 'HOUSESTYLE', 'OVERALLCOND', 'YEARBUILT',
       'ROOFSTYLE', 'EXTERCOND', 'FOUNDATION', 'BSMTCOND', 'HEATING',
       'HEATINGQC', 'CENTRALAIR', 'ELECTRICAL', 'FULLBATH', 'HALFBATH',
       'BEDROOMABVGR', 'KITCHENABVGR', 'KITCHENQUAL', 'TOTRMSABVGRD',
       'FIREPLACES', 'FIREPLACEQU', 'GARAGETYPE', 'GARAGEFINISH', 'GARAGECARS',
       'GARAGECOND', 'POOLAREA', 'POOLQC', 'FENCE', 'MOSOLD', 'YRSOLD' ], 
                   'values': [[9000, '1Fam', '2Story', 9, 1920, 'Hip', 'Gd', 'PConc', 'TA',
       'GasA', 'Ex', 'Y', 'SBrkr', 1, 0, 3, 1, 'TA', 7, 0, 'NA', 'Detchd',
       'Unf', 2, 'TA', 0, 'NA', 'NA', 7, 2009]]}

Execute the method to perform online predictions and display the prediction results.

In [None]:
predictions = client.deployments.score(scoring_endpoint, scoring_payload)

In [None]:
print(json.dumps(predictions, indent=2))

# 7. Conclusion

At the end of our modeling step, we built three models: `Linear Regression`, `Random Forest` and `Gradient Boosting`. We noticed that the `Random Forest` model gave us the best results compared to the other two. We could also say that our `Random Forest` model did the best job in describing our dataset compared to the other models.

Hopefully you were able to experience the Data Science process through these steps. As well as learn how to use IBM's Db2 on Cloud database as a data source for future machine learning projects and IBM's Watson Studio as a platform to build models. See you guys next time!