# MultiLayer Perceptron Classification Example


A data set that identifies different types of Iris's is used to demonstrate the use of multi layer perceptron in SAP HANA.  This data set is also used in a clustering example where the objective was to cluster the flowers into three clusters and the intuition was that the three clusters would correspond to the three types of Iris's in the data set.  Since we know the labels (i.e. the types of Iris's), we can use classification to create a model to predict the type of flower based on features or characteristics that are explained below.

## Iris Data Set
The data set used is from University of California, Irvine (https://archive.ics.uci.edu/ml/datasets/iris). For tutorials use only.  This data set contains attributes of a plant iris.  There are three species of Iris plants.
<table>
<tr><td>Iris Setosa</td><td><img src="images/Iris_setosa.jpg" title="Iris Sertosa" style="float:left;" width="300" height="50" /></td>
<td>Iris Versicolor</td><td><img src="images/Iris_versicolor.jpg" title="Iris Versicolor" style="float:left;" width="300" height="50" /></td>
<td>Iris Virginica</td><td><img src="images/Iris_virginica.jpg" title="Iris Virginica" style="float:left;" width="300" height="50" /></td></tr>
</table>

The data contains the following attributes for various flowers:
<table align="left"><tr><td>
<li align="top">sepal length in cm</li>
<li align="left">sepal width in cm</li>
<li align="left">petal length in cm</li>
<li align="left">petal width in cm</li>
</td><td><img src="images/sepal_petal.jpg" style="float:left;" width="200" height="40" /></td></tr></table>

Although the flower is identified in the data set, we will cluster the data set into 3 clusters since we know there are three different flowers.  The hope is that the cluster will correspond to each of the flowers.

A different notebook will use a classification algorithm to predict the type of flower based on the sepal and petal dimensions.

In [None]:
%matplotlib inline
from hana_ml import dataframe
from hana_ml.algorithms.pal.neural_network import MLPClassifier, MLPRegressor
from hana_ml.algorithms.pal import metrics

## Load data
The data is loaded into 4 tables - full set, test set, training set, and the validation set:
<li>IRIS_DATA_FULL_TBL</li>
<li>IRIS_DATA_TRAIN_TBL</li>
<li>IRIS_DATA_TEST_TBL</li>
<li>IRIS_DATA_VALIDATION_TBL</li>

To do that, a connection is created and passed to the loader.

There is a config file, <b>config/e2edata.ini</b> that controls the connection parameters and whether or not to reload the data from scratch.  In case the data is already loaded, there would be no need to load the data.  A sample section is below.  If the config parameter, reload_data is true then the tables for test, training, and validation are (re-)created and data inserted into them.

#########################<br>
[hana]<br>
url=host.sjc.sap.corp<br>
user=username<br>
passwd=userpassword<br>
port=3xx15<br>
<br>

#########################<br>
## Define Datasets - Training, validation, and test sets
Data frames are used keep references to data so computation on large data sets in HANA can happen in HANA.  Trying to bring the entire data set into the client will likely result in out of memory exceptions.

The original/full dataset is split into training, test and validation sets.  In the example below, they reside in different tables.

In [None]:
from hana_ml.algorithms.pal.utility import DataSets, Settings
import plotting_utils
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
full_set, training_set, validation_set, test_set = DataSets.load_iris_data(connection_context)

## Simple Exploration
Let us look at the number of rows in the data set

In [None]:
print('Number of rows in full set: {}'.format(full_set.count()))
print('Number of rows in training set: {}'.format(training_set.count()))
print('Number of rows in validation set: {}'.format(validation_set.count()))
print('Number of rows in test set: {}'.format(test_set.count()))

### Let's look at the columns

In [None]:
print(full_set.columns)

### Let us look at some rows

In [None]:
full_set.head(5).collect()

### Let's look at the data types

In [None]:
full_set.dtypes()

### Let's check how many SPECIES are in the data set.

In [None]:
full_set.distinct("SPECIES").collect()

## Create Model
The lines below show the ease with which clustering can be done.

Set up the features and labels for the model and create the model

In [None]:
%matplotlib inline
from plotting_utils import DrawNN
features = ['SEPALLENGTHCM','SEPALWIDTHCM','PETALLENGTHCM','PETALWIDTHCM']
label = 'SPECIES'

### Neural Network Architecture

In [None]:
network = DrawNN( [1, 10, 1] )
network.draw(False)

### Model Creation

In [None]:
mlpc = MLPClassifier(hidden_layer_size=(10,), activation='TANH', output_activation='TANH',
                     training_style='batch', max_iter=100, normalization='z-transform',
                     weight_init='uniform', thread_ratio=1)
mlpc.fit(training_set, 'ID', features, label)

### Model Storage

In [None]:
from hana_ml.model_storage import ModelStorage
model_storage = ModelStorage(connection_context)

mlpc.name = 'MLPC'  # The model name is mandatory
mlpc.version = 1
model_storage.save_model(model=mlpc)
#need to increase version

# Lists models
model_storage.list_models()

In [None]:
model_storage.list_models()['JSON'].iloc[0]

In [None]:
model = model_storage.load_model(name='MLPC', version=1)

In [None]:
model.predict(data=test_set, key='ID', features=features)[0].collect()

Model can be deleted in SAP HANA DB accoridng to model name and version:

In [None]:
model_storage.delete_model('MLPC', 1)
model_storage.list_models()

## Evaluation

### Accuracy
Let us compute the accuracy on our training and test sets

In [None]:
accuracy = mlpc.score(training_set, 'ID', features, label)
print("Training set accuracy: %f" % accuracy)
print("Test set accuracy: %f" % mlpc.score(test_set, 'ID', features, label))

### Precision, Recall, Confusion Matrix
Accuracy is usually not a good metric to evaluate a model.  Above, we see that we do pretty well for both the training and test sets.
Let us look at another metric.

To do that we first inspect the results of our test_set predictions

In [None]:
predictions_df, soft_max_df = mlpc.predict(test_set, 'ID', features)
print(soft_max_df.head(5).collect())

The function to get the confusion matrix takes in a single data frame with the true label and the predicted label.
So, let us construct this data frame by joining on the ID column.

In [None]:
ts = test_set.rename_columns({'ID': 'TID'}) #.cast('SPECIES', 'NVARCHAR(256)')
jsql = '{}."{}"={}."{}"'.format(predictions_df.quoted_name, 'ID', ts.quoted_name, 'TID')
results_df = predictions_df.join(ts, jsql, how='inner')
cm_df, classification_report_df = metrics.confusion_matrix(results_df, key='ID', label_true='SPECIES', label_pred='TARGET') 

In [None]:
print("Confusion Matrix")
cm_df.collect()

In [None]:
import matplotlib.pyplot as plt
from hana_ml.visualizers.metrics import MetricsVisualizer
f, ax1 = plt.subplots(1,1)
mv1 = MetricsVisualizer(ax1)
ax1 = mv1.plot_confusion_matrix(cm_df, normalize=False)

In [None]:
print("Recall, precision, and F-measures")
classification_report_df.collect()

In [None]:
connection_context.close()