#### Copyright IBM All Rights Reserved.
#### SPDX-License-Identifier: Apache-2.0

# Db2 Sample For Tensorflow

In this code sample, we will show how to use the Db2 Python driver to import data from our Db2 database. Then, we will use that data to create a machine learning model with tensorflow.

Many wine connoisseurs love to taste different wines from all over the world. Mostly importantly, they want to be able to guess the type of wine it is based on the taste and ingredients of the wine. In this notebook, we will be using a dataset that has collected certain attributes of many wine bottles that determines the class of the wine. Using this dataset, we will help our wine connoisseurs predict the `class` of wine.

This notebook will demonstrate how to use Db2 as a data source for creating machine learning models.

Prerequisites:
1. Python 3.6 and above
2. Db2 on Cloud instance (using free-tier option)
3. Data already loaded in your Db2 instance
4. Have Db2 connection credentials on hand

We will be importing two libraries- `ibm_db` and `ibm_dbi`. `ibm_db` is a library with low-level functions that will directly connect to our db2 database. To make things easier for you, we will be using `ibm-dbi`, which communicates with `ibm-db` and gives us an easy interface to interact with our data and import our data as a pandas dataframe. 

For this example, we will be using the [wine dataset](../data/wine.csv), which we have loaded into our Db2 instance.

NOTE: Running this notebook within a docker container. If `!easy_install ibm_db` doesn't work on your normally on jupter notebook, you may need to also run this notebook within a docker container as well.

## 1. Import Data
Let's first install and import all the libraries needed for this notebook. Most important we will be installing and importing the db2 python driver `ibm_db`.

In [None]:
!pip install tensorflow
!easy_install ibm_db

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# The two python ibm db2 drivers we need
import ibm_db
import ibm_db_dbi

In [None]:
# replace only <> credentials
dsn = "DRIVER={{IBM DB2 ODBC DRIVER}};" + \
      "DATABASE=<DATABASE NAME>;" + \
      "HOSTNAME=<HOSTNMAE>;" + \
      "PORT=50000;" + \
      "PROTOCOL=TCPIP;" + \
      "UID=<USERNAME>;" + \
      "PWD=<PWD>;"
hdbc  = ibm_db.connect(dsn, "", "")
hdbi = ibm_db_dbi.Connection(hdbc)

sql = 'SELECT * FROM <SCHEMA NAME>.<TABLE NAME>'

wine = pandas.read_sql(sql,hdbi)

#colnames = ['Class','Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols','Flavanoids','Nonflavanoid phenols','Proanthocyanins','Color intensity','Hue','dilute','Proline']
#wine = pd.read_csv('../data/winequality-red.csv', sep=';') 

In [None]:
# Let's see what our data looks like
wine.head()

## 2. Data Exploration

In this step, we are going to try and explore our data inorder to gain insight. We hope to be able to make some assumptions of our data before we start modeling.

In [None]:
wine.describe()

In [None]:
# Minimum price of the data
minimum_price = np.amin(wine['Wine'])

# Maximum price of the data
maximum_price = np.amax(wine['Wine'])

# Mean price of the data
mean_price = np.mean(wine['Wine'])

# Median price of the data
median_price = np.median(wine['Wine'])

# Standard deviation of prices of the data
std_price = np.std(wine['Wine'])

# Show the calculated statistics
print("Statistics for housing dataset:\n")
print("Minimum: {}".format(minimum_price)) 
print("Maximum: {}".format(maximum_price))
print("Mean: {}".format(mean_price))
print("Median {}".format(median_price))
print("Standard deviation: {}".format(std_price))

In [None]:
wine.corr()

In [None]:
corr_matrix = wine.corr()
corr_matrix["Wine"].sort_values(ascending=False)

## 3. Data Visualization

In [None]:
wine.hist(bins=50, figsize=(30,25))
plt.show()

In [None]:
boxplot = wine.boxplot(column=['Wine'])

## 4. Pre-Process Data

Before we start creating our model, we need to first pre-process our data for tensorflow.

In [None]:
import tensorflow as tf

In [None]:
# First we convert the Wine labels to Onehot format.
df = pd.get_dummies(wine, columns=['Wine'])

# Convert labels to numpy array for tensorflow
labels = df.loc[:,['Class_1','Class_2','Class_3']]
labels = labels.values

# Convert features to numpy array for tensorflow
features = df.drop(['Class_1','Class_2','Class_3','Ash'],axis = 1)
features = features.values

# Make sure the type is numpy arrays
print(type(labels))
print(type(features))

# Make sure the shape of the array is correct
print(labels.shape)
print(features.shape)

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into test and train data
train_x,test_x,train_y,test_y = train_test_split(features,labels)

# Verify the shape of the test and train data
print(train_x.shape,train_y.shape,test_x.shape,test_y.shape)

In [None]:
from sklearn.preprocessing import MinMaxScaler

# NN in tensorflow works better when the data is scaled between (0,1). So let's scale our data
scale = MinMaxScaler(feature_range = (0,1))

train_x = scale.fit_transform(train_x)
test_x = scale.fit_transform(test_x)

# Take a seak peak at our data 
print(train_x[0])
print(train_y[0])

## 5. Creating Machine Learning Model

Now that we have cleaned and explored our data. We are ready to build our model that will predict the attribute `Class`. We will be creating a basic neural network with tensorflow to help us predict. 

In [None]:
# Let's first create placeholders for our feature and labels
X = tf.placeholder(tf.float32,[None,12]) # Since we have 12 features as input
y = tf.placeholder(tf.float32,[None,3])  # Since we have 3 outut labels

We are going to create a simple NN model with 2 hidden layters (3 layers in total). They are going to be 80 and 50 respectively.

In [None]:
# Weights and biases for our first hidden layer
weights1 = tf.get_variable("weights1",shape=[12,80],initializer = tf.contrib.layers.xavier_initializer())
biases1 = tf.get_variable("biases1",shape = [80],initializer = tf.zeros_initializer)
layer1out = tf.nn.relu(tf.matmul(X,weights1)+biases1)

# Weights and biases for our second hidden layer
weights2 = tf.get_variable("weights2",shape=[80,50],initializer = tf.contrib.layers.xavier_initializer())
biases2 = tf.get_variable("biases2",shape = [50],initializer = tf.zeros_initializer)
layer2out = tf.nn.relu(tf.matmul(layer1out,weights2)+biases2)

# Weights and biases for our output node
weights3 = tf.get_variable("weights3",shape=[50,3],initializer = tf.contrib.layers.xavier_initializer())
biases3 = tf.get_variable("biases3",shape = [3],initializer = tf.zeros_initializer)
prediction =tf.matmul(layer2out,weights3)+biases3

In [None]:
# We also need to degine the loss funtion. We will be using the softmax_cross_entropy_with_logits_v2 function. 
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction, labels=y))

# I am keeping our learning rate as 0.001, but you can always change that.
optimizer = tf.train.AdamOptimizer(0.001).minimize(cost)

In [None]:
acc = []

# This is where we will run our model
with tf.Session() as sess:
    
    # Initilize our variables
    sess.run(tf.global_variables_initializer())
    
    # Train our data over 200 iterations
    for epoch in range(201):
        
        # Train using our NN model we created
        opt,costval = sess.run([optimizer,cost],feed_dict = {X:train_x,y:train_y})
        
        # Calculate how many matches we made
        matches = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
        
        # Compute cost and update parameters and also ouput accuracy with current parameters
        accuracy = tf.reduce_mean(tf.cast(matches, 'float'))
        
        # Calculate the accuracy and store it
        acc.append(accuracy.eval({X:test_x,y:test_y}))
        if(epoch % 100 == 0):
            print("Epoch", epoch, "--" , "Cost",costval)
            print("Accuracy on the test set ->",accuracy.eval({X:test_x,y:test_y}))
    print("FINISHED !!!")

In [None]:
# Lets plot our accuracy over the number of iterations and see how our model did 
plt.plot(acc)
plt.ylabel("Accuracy")
plt.xlabel("Epochs")

Looks like our model did really well with our data. It seems however that we may not have needed to run it over 200 iterations. But that's up to you guys to decide! 