# California Housing Prices Data Set

Create a regression model to predict house prices. See https://www.kaggle.com/camnugent/california-housing-prices for data understanding



## Import packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics

import seaborn as sns
sns.set()

## Load and inspect data set

In [None]:
# Fetch the file
my_file = project.get_file("housing.csv")

# Read the CSV data file from the object storage into a pandas DataFrame
my_file.seek(0)
original_data = pd.read_csv(my_file)

original_data.head()

In [None]:
original_data.describe(include='all') # descriptive statistics for all columns

In [None]:
original_data.isnull().sum() # check for null values

In [None]:
original_data[original_data.duplicated(keep=False)] # check for duplicate rows

There are no duplicates but missing values for "total_bedrooms". Decide what to do with these null values: 

In [None]:
data_wo_null = # your code
data_wo_null.isnull().sum() # check

In [None]:
# Musterlösung 
data_wo_null = original_data.dropna(axis=0)
data_wo_null.isnull().sum()

## Select predictors

Create a correlation map:

In [None]:
# your code

In [None]:
# Musterlösung
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(data_wo_null.corr(),annot= True,linewidths=0.5,fmt = ".1f",ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Map')
plt.show()

Remove redundant features: 

In [None]:
data_reduced_features = data_wo_null[['<your feature 1>', '<your feature 2>','...']]
data_reduced_features.head()

In [None]:
# Musterlösung
data_reduced_features = data_wo_null[['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'median_income', 
                                      'median_house_value', 'ocean_proximity']] 
                                        # either keep longitude and latitude or remove both
data_reduced_features.head()

## Remove outliers
The next step is to detect outliers and handle them:

In [None]:
data_reduced_features.hist(figsize=(25,25), bins=50)

In [None]:
q = # your code

data_reduced_features_2 = # your code

In [None]:
# Musterlösung, total_rooms only, more outliers might be removed! 
q = data_reduced_features['total_rooms'].quantile(0.99)
data_reduced_features_2 = data_reduced_features[data_reduced_features['total_rooms']<q]

## Prepare data for modeling

Get dummies since there is a categorical feature:

In [None]:
dummies = # your code
dummies.head()

In [None]:
# Musterlösung
dummies = pd.get_dummies(data_reduced_features_2, drop_first=True)
dummies.head()

Set X and y (predictors and target) according to your dataframe:

In [None]:
target = dummies['<your target column>']
predictors = # your code

In [None]:
# Musterlösung
target = dummies['median_house_value']
predictors = dummies.drop(['median_house_value'], axis=1)

Split data into training and test sets: 

In [None]:
X_train, X_test, y_train, y_test = # your code

In [None]:
# Musterlösung
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=123) # 80-20 split into training and test data

Use StandardScaler to scale your predictors:

In [None]:
scaler = # your code 
# your code

X_train = # your code
X_test = # your code

In [None]:
# Musterlösung

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Regression model and evaluation

Create a linear regression model: 

In [None]:
# your code

In [None]:
# Musterlösung

reg = LinearRegression()
reg.fit(X_train,y_train)


In [None]:
print('training performance')
print(reg.score(X_train,y_train))
print('test performance')
print(reg.score(X_test,y_test))

In [None]:
y_pred = reg.predict(X_test)
test = pd.DataFrame({'Predicted':y_pred,'Actual':y_test})
fig= plt.figure(figsize=(16,8))
test = test.reset_index()
test = test.drop(['index'],axis=1)
plt.plot(test[:50])
plt.legend(['Actual','Predicted'])
sns.jointplot(x='Actual',y='Predicted',data=test,kind='reg',);

Interpret the result and feel free to try out further analyses.

# Musterlösung
Training and test performance similar (= no overfitting)
simple linear regression model seems to be ok
cf. https://people.duke.edu/~rnau/rsquared.htm if you are interested in learning more about evaluating regression models.

# Deployment

Deploy the linear regression model via the _Watson Machine Learning_ (WML) service on IBM Cloud. Please refer to the documentation for more details about the [watson-machine-learning-client](https://pypi.org/project/watson-machine-learning-client/) or the [REST API](https://watson-ml-api.mybluemix.net/).

In [None]:
# import the Watson Machine Learning Python client library
from watson_machine_learning_client import WatsonMachineLearningAPIClient

In [None]:
# fill in your credentials
wml_credentials = {
    # your code
}

In [None]:
# instantiate the client
wml_client = WatsonMachineLearningAPIClient(wml_credentials)

In [None]:
# review stored artifacts
wml_client.repository.list()

In [None]:
# delete artifacts that are no longer needed (note that the number of stored artifacts in the free version is limited to 5)
wml_client.repository.delete("...GUID...")

In [None]:
# fill in metadata for your deployment
metadata = {
        wml_client.repository.ModelMetaNames.NAME: 'Housing Prices Deployment',
        wml_client.repository.ModelMetaNames.DESCRIPTION: 'Test deployment for the california housing prices data set.',
        wml_client.repository.ModelMetaNames.AUTHOR_NAME: 'Your Name'
}

In [None]:
# store the (scikit-learn) model in WML
stored_model = wml_client.repository.store_model(reg, meta_props=metadata)

In [None]:
# get the id of the stored model
published_model_uid = wml_client.repository.get_model_uid(stored_model)

In [None]:
# create deployment and fetch scoring endpoint
created_deployment = wml_client.deployments.create(published_model_uid, name="Housing Prices Deployment")
scoring_endpoint = wml_client.deployments.get_scoring_url(created_deployment)

## Deployment validation

Use the stored deployment to make a prediction.

In [None]:
# review test data
X_test[0:2]

In [None]:
y_test[0:2]

In [None]:
# create scoring payload
scoring_values = X_test[0:2].tolist()
scoring_payload = {"values": scoring_values}
print(scoring_payload)

In [None]:
# run prediction
predictions = wml_client.deployments.score(scoring_endpoint, scoring_payload)
print(predictions)

Do the results match your expectation? Are the estimations accurate?

In [None]:
# use the local model to make the same prediction in your notebook and compare the results
reg.predict(X_test[0:2])