# Introduction
This is a jupyter notebook which contains code for the artificial intelligence course for leaders. A Jupyter notebook is a series of cells which can be executed such that the code in them is run. In this course we will explore a data set by plotting with matplotlib. Predictive supervised models will be built, first regression models, then an artificial neural network. We will be following the machine learning pipeline outlined in the theory part of the course. 

Run the Following cell if the notebook is opened in Google Collab. It will clone the github repository to get all necessary files. To run a cell, mark it and press the "Run" button in the menu.

In [None]:
!git clone https://github.com/NordAxon/AI-For-Leaders.git

# Import Libraries
Import all the libraries we need to run the code and perform the analysis. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib as plt
pd.set_option('display.max_columns', 100)
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import keras
%load_ext autoreload
%autoreload 2
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from keras.models import Sequential
from keras.layers import Dense
from sklearn.datasets import make_regression
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns

np.random.seed(1)

# 1) Import Raw Data
The data we will be using is a data set of housing sales in Melbourne. Each row is a sale with the columns being the features of the sold house/appartment. The goal is to predict the price of new sales based on the different features in the data set. 
Data source: https://www.kaggle.com/anthonypino/melbourne-housing-market

In [None]:
housing_df_original = pd.read_csv('AI-For-Leaders/data/melbourne-housing-market/Melbourne_housing.csv')
housing_df_original

In [None]:
# Size of the data frame
housing_df_original.shape

# 2) Pre-Processing

## Fill Empty Data
In some of the columns in the data we can see some entries are NaN, that means Not a Number. This is missing data. Many machine learning algorithms need data in all rows and columns so the NaNs have to be filled with something meaningful. That might be different for different columns. For example "Landsize" will be filled with 0 since a NaN it that case can be assumed to mean that there is no land contained in the real estate sale. Let's first have a look at how many NaN values are present in each column.

In [None]:
print('Number of NaNs')
print(pd.isnull(housing_df_original).sum())

Let's fill up some of the missing data. 
- Remove rows where there is no price
- For a couple of rows, fill with 0
- For YearBuilt, fill with the mean of that column since the house being built aat year 0 seems unlikely. 

In [None]:
housing_df_no_nan = housing_df_original.copy()

# Remove all rows with no price data
housing_df_no_nan = housing_df_no_nan[pd.notnull(housing_df_no_nan['Price'])]

# Fill some rows with 0 if there is missing data
housing_df_no_nan['BuildingArea'].fillna(0.0, inplace=True)
housing_df_no_nan['Rooms'].fillna(0.0, inplace=True)
housing_df_no_nan['Landsize'].fillna(0.0, inplace=True)
housing_df_no_nan['Car'].fillna(0.0, inplace=True)
housing_df_no_nan['Bathroom'].fillna(0.0, inplace=True)
housing_df_no_nan['Bedroom2'].fillna(0.0, inplace=True)

# Fill the missing 
housing_df_no_nan['YearBuilt'].fillna(housing_df_no_nan['YearBuilt'].mean(), inplace=True)

## Plot Variable Correlations and Histograms
- Let's have a look at how different variables relate to each other and how the data is distributed. We pick a couple of columns(variables) we think might be good predictors of price. These columns are plotted as scatter plots against each other and as histograms along the diagonal. 
- The goal is to get a feel for the data, understand how different columns relate to each other and look for outliers.

In [None]:
g = sns.pairplot(housing_df_no_nan[['Price', 'BuildingArea', 'Rooms', 'Landsize', 'Car', 'Bathroom', 'Regionname']], 
                 hue="Regionname", diag_kind='hist')
housing_df_no_nan[['Price', 'BuildingArea', 'Rooms', 'Landsize', 'Car', 'Bathroom', 'Regionname']].describe()

## Remove Outliers
Outliers are data point which are located far from the other data points, these come with a risk of skewing the models and therefore we want to remove these. In the plot above it seems like we have outliers in some of the columns. We will have a deeper look at some columns which seem to contain outliers. Run the cell below to see a plot of price, here it is clear that we have some data points which are far away from the others. 

In [None]:
housing_df_no_nan.plot.scatter('BuildingArea', 'Price', title='Price vs. Building Area')

We will set a clip-off att 500 for building area which means that all data points larger than 500 will be set to 500 instead. A clip-off at 4000000 is set for price. The data is plotted again after the outliers are clipped. 

In [None]:
housing_clipped = housing_df_no_nan.copy()
housing_clipped['BuildingArea'] = housing_clipped['BuildingArea'].clip(0, 500)
housing_clipped['Price'] = housing_clipped['Price'].clip(0, 4e6)
housing_clipped.plot.scatter('BuildingArea', 'Price', title='Price vs. Building Area')

### Assignment 1: 
-  Plot "Rooms" vs "Price" in the following empty cell to see if there are any outliers
-  If so, what could be a resonable cut-off?
-  Clip the data set to remove Room outliers, i.e replace the value of variable rooms_max with a resonable number

In [None]:
# ENTER CODE HERE

In [None]:
# Filter out or clip outliers
rooms_max = 200
housing_clipped_r = housing_clipped.copy()
housing_clipped_r['Rooms'] = housing_clipped_r['Rooms'].clip(0, rooms_max)

## Plot Histograms of Interesting Data Columns
Look more closely at some of the variables we think could be interesting by plotting larger histograms.

In [None]:
housing_clipped_r['BuildingArea'].hist(bins=40, figsize=(10,7))
plt.title('Histogram of Building Areas');
# TODO: Make it better

### Assignment 2: Create a histogram of Price in the following cell

In [None]:
# CODE HERE

# Price for Different Regions
We are also hypothesising that the property location will have an impact on price. Below a plot will be made to get a feel for how different locations might affect the price. 

In [None]:
plt.figure(figsize=(10,7))
housing_clipped_r.groupby('Regionname')['Price'].mean().plot.bar();

### Assignment 3: Plot Price for Different CouncilArea

In [None]:
# CODE HERE

# Simple One Dimensional Linear Regression
We are getting a feel for what the data looks like, so now we might try a first model for predicting price. The linear regression is a simple but very commonly used model. We pick the building area as a predictor to begin with since there seem to be a correlation between building area and price according to our exploration. 


# 4) Model Training



We'll start of by looking at a subset of the housing data, only in the council area of Yarra.
The goal is to find all of the weights, $w_i$, in the following linear regression model. 
$y = w_0 + w_1x_1$


In [None]:
# Set up input and output variables
y = housing_clipped_r['Price']
x = housing_clipped_r[['BuildingArea']]

# Split into test and train data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

# Set up regression model
regr = linear_model.LinearRegression()

# Train the regression model
regr.fit(x_train, y_train)

# Perform predictions
y_pred = regr.predict(x_test)

# Print regression coefficients, w
print(regr.coef_)

# Plot results
plt.figure(figsize=(15,10));
plt.plot(x_test, y_pred, 'r');
plt.scatter(x_test.values, y_test.values, alpha=0.1);
plt.title('Simple Linear Regression Model');

# 5) Model Evaluation
The error value will be compared to a baseline error which is the error if the prediction is just the mean of previous house values. 

In [None]:
# Evaluate Results
mean_error = (y_pred - y_test).abs().mean()
mean_error_baseline = (y_train.mean() - y_test).abs().mean()
print('\nBaseline Mean Error: ' + str(mean_error_baseline))
print('Model Mean Absolute Error: ' + str(mean_error))

# Multi-Dimensional Linear Regression
# 4) Model Training
In order to increase the predictive power, i.e. to get a more accurate model, more information can be added to the model. One way of doing that is by adding more input variables to the model. Variables that could be tried are BuildingArea, Rooms, LandSize, Car. <br><br>
$y = w_0 + w_1x_1 + w_2x_2 + \dots$<br>

In [None]:
feature_list = ['BuildingArea', 'Rooms', 'Landsize', 'Car']

In [None]:
x = housing_clipped_r[feature_list]
y = housing_clipped_r['Price']

# Split into test and train data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

# Set up regression model
regr = linear_model.LinearRegression()

# Train the regression model
regr.fit(x_train, y_train)

# Perform predictions
y_pred = regr.predict(x_test)

# Print regression coefficients, w
print('Regression Coefficients, w0, w1, w2, ...')
print(regr.coef_)

# 5) Model Evaluation

In [None]:
# Evaluate Results
mean_error = (y_pred - y_test).abs().mean()
print('\nMean Error Multi-Dimensional Linear Regression: ' + str(mean_error))

### Assignment 4: Add the data column 'Car' to the input data and see if your results change.

# Neural Network
The code in the following cell transforms data, builds a neural network and evaluates results of predictions from the neural net. 

In [None]:
#from numpy.random import seed
#seed(1)
# Filter out the wanted columns

def run_neural_network(x, y):
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

    # Scale the data
    scaler = MinMaxScaler()
    x_train = scaler.fit_transform(x_train)
    x_test = scaler.transform(x_test)
    y_train = scaler.fit_transform(y_train.values.reshape(len(y_train),1))
    y_test = scaler.transform(y_test.values.reshape(len(y_test),1))

    # define the neural network structure
    model = Sequential()
    model.add(Dense(100, input_dim=x_train.shape[1], activation='relu'))
    model.add(Dense(20, activation='relu'))
    model.add(Dense(1, activation='linear'))
    model.compile(loss='mse', optimizer='adam', metrics=['mean_squared_error', 'mean_absolute_error'])

    # train the model
    history = model.fit(x_train, y_train, epochs=15, verbose=0, validation_data=(x_test, y_test))

    # make a prediction
    y_pred = model.predict(x_test)[:,0]
    # show the inputs and predicted outputs

    y_pred = scaler.inverse_transform(y_pred.reshape(len(y_pred),1))
    y_test = scaler.inverse_transform(y_test)

    # Evaluate Results
    mean_error = (pd.Series(y_pred[:,0]) - y_test[:,0]).abs().mean()
    print('Mean Absolute Test Error: ' + str(mean_error))

    # Plot error over training time
    plt.figure(figsize=(10,7))
    plt.plot(history.history['mean_squared_error'])
    plt.plot(history.history['val_mean_squared_error'])
    plt.title('Model Loss')
    plt.ylabel('Mean Squared Error')
    plt.xlabel('Epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.show()
    
run_neural_network(housing_clipped_r[feature_list], housing_clipped_r['Price'])

### Assignment 5: Try different number of training epochs
-  What happens with the loss for a higher amount of training epochs(time)?

### Assignment 6: Try different sizes of the network
-  What are the results when more layers are added? 
-  What are the results when more neurons in each layer is added.
-  Why is there a difference between train and test data in absolute error? 

### Assignment 7: Add more features
-  Which columns could be useful for providing more predictive power?

### Questions:
-  Which of the different models performed best? Why?
-  Why does a well tuned neural network perform better than a linear regression model?
-  What could be done to increase predictive power?
-  Which additional data do you think would make a large differce in predictive power?
-  What was the lowest mean square error you got?

# Extra Assignment:
## Add More Features
Let's look at the data to see what we can do with the data to create columns which are more easily readable for a machine learning algorithm and how we can provide more information from the data we have. 
Try to make the predictive model as good as possible by adding more features such as location and age of property. 

In [None]:
# One-hot encoding of Region
one_hot_region = pd.get_dummies(housing_clipped_r.Regionname, prefix='Regionname')
housing_df_feature = pd.concat([housing_clipped_r, one_hot_region], axis=1)
one_hot_region

### Assignment 8: Modify above code to include CouncilArea in addition to Regionname 

### Add house age as a feature
Using the build year of a house directly as a feature is not good since most values will be around 2000. A small difference in feature, might be a big difference in actual house value. E.g. a house built 2017 is probably alot more valuable than a house build 2007, but that is still a small percentual difference. The percentual difference between the age of 1 year and 11 years on the other hand gives a large difference. We also log-transform the age to make it a bit more convenient for machine learning algorithms. 

In [None]:
#(2018. - housing_df_feature['YearBuilt']).apply(lambda x: np.log(x))
housing_df_feature['YearBuilt'] = housing_df_feature['YearBuilt'].clip(0, 2018)
house_ages = pd.Series(np.log((1 + (2018. - housing_df_feature['YearBuilt']))))
(2018. - housing_df_feature['YearBuilt']).hist(bins=30)
plt.title('Distribution before logarithm')
housing_df_feature['Age'] = house_ages
plt.figure()
house_ages.hist()
plt.title('Distribution After logarithm')

### Try Neural Network with New Features
With new features added it is time to train the neural network again, run the next cell and check the results. 

In [None]:
new_feature_list = ['BuildingArea', 'Rooms', 'Landsize', 'Car', 'Age'] + list(one_hot_region.columns)

run_neural_network(housing_df_feature[new_feature_list], housing_df_feature['Price'])

### Assignment 9
- Check the original data to see if there are other features(columns) which might generate even better predictions if they are part of the model. 
- Write code below to add these new features and train a new network

In [None]:
# CODE HERE