#### Copyright IBM All Rights Reserved.
#### SPDX-License-Identifier: Apache-2.0

# Db2 Sample For Scikit-Learn

In this code sample, we will show how to use the Db2 Python driver to import data from our Db2 database. Then, we will use that data to create a machine learning model with scikit-learn.

Many wine connoisseurs love to taste different wines from all over the world. Mostly importantly, they want to know how the quality differs between each wine based on the ingredients. Some of them also want to be able to predict the quality before even tasting it. In this notebook, we will be using a dataset that has collected certain attributes of many wine bottles that determines the quality of the wine. Using this dataset, we will help our wine connoisseurs predict the quality of wine.

This notebook will demonstrate how to use Db2 as a data source for creating machine learning models.

Prerequisites:
1. Python 3.6 and above
2. Db2 on Cloud instance (using free-tier option)
3. Data already loaded in your Db2 instance
4. Have Db2 connection credentials on hand

We will be importing two libraries- `ibm_db` and `ibm_dbi`. `ibm_db` is a library with low-level functions that will directly connect to our db2 database. To make things easier for you, we will be using `ibm-dbi`, which communicates with `ibm-db` and gives us an easy interface to interact with our data and import our data as a pandas dataframe. 

For this example, we will be using the [winequality-red dataset](../data/winequality-red.csv), which we have loaded into our Db2 instance.

NOTE: Running this notebook within a docker container. If `!easy_install ibm_db` doesn't work on your normally on jupter notebook, you may need to also run this notebook within a docker container as well.

## 1. Import Data
Let's first install and import all the libraries needed for this notebook. Most important we will be installing and importing the db2 python driver `ibm_db`.

In [None]:
!pip install sklearn
!easy_install ibm_db

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# The two python ibm db2 drivers we need
import ibm_db
import ibm_db_dbi

Now let's import our data from our data source using the python db2 driver.

In [None]:
# replace only <> credentials
dsn = "DRIVER={{IBM DB2 ODBC DRIVER}};" + \
      "DATABASE=<DATABASE NAME>;" + \
      "HOSTNAME=<HOSTNMAE>;" + \
      "PORT=50000;" + \
      "PROTOCOL=TCPIP;" + \
      "UID=<USERNAME>;" + \
      "PWD=<PWD>;"
hdbc  = ibm_db.connect(dsn, "", "")
hdbi = ibm_db_dbi.Connection(hdbc)

sql = 'SELECT * FROM <SCHEMA NAME>.<TABLE NAME>'

wine = pandas.read_sql(sql,hdbi)
#wine = pd.read_csv('../data/winequality-red.csv', sep=';') 

In [None]:
wine.head()

## 2. Data Exploration

In this step, we are going to try and explore our data inorder to gain insight. We hope to be able to make some assumptions of our data before we start modeling.

In [None]:
wine.describe()

In [None]:
# Minimum price of the data
minimum_price = np.amin(wine['quality'])

# Maximum price of the data
maximum_price = np.amax(wine['quality'])

# Mean price of the data
mean_price = np.mean(wine['quality'])

# Median price of the data
median_price = np.median(wine['quality'])

# Standard deviation of prices of the data
std_price = np.std(wine['quality'])

# Show the calculated statistics
print("Statistics for housing dataset:\n")
print("Minimum quality: {}".format(minimum_price)) 
print("Maximum quality: {}".format(maximum_price))
print("Mean quality: {}".format(mean_price))
print("Median quality {}".format(median_price))
print("Standard deviation of quality: {}".format(std_price))

In [None]:
wine.corr()

In [None]:
corr_matrix = wine.corr()
corr_matrix["quality"].sort_values(ascending=False)

## 3. Data Visualization

In [None]:
wine.hist(bins=50, figsize=(30,25))
plt.show()

In [None]:
boxplot = wine.boxplot(column=['quality'])

## 4. Creating Machine Learning Model

Now that we have cleaned and explored our data. We are ready to build our model that will predict the attribute `quality`. 

In [None]:
wine_value = wine['quality']
wine_attributes = wine.drop(['quality'],  axis=1)

In [None]:
from sklearn.preprocessing import StandardScaler

# Let us scale our data first 
sc = StandardScaler()
wine_attributes = sc.fit_transform(wine_attributes)

In [None]:
from sklearn.decomposition import PCA

# Apply PCA to our data
pca = PCA(n_components=8)
x_pca = pca.fit_transform(wine_attributes)

We need to split our data into train and test data.

In [None]:
from sklearn.model_selection import train_test_split

# Split our data into test and train data
x_train, x_test, y_train, y_test = train_test_split( wine_attributes,wine_value, test_size = 0.25)

We will be using Logistic Regression to model our data

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

lr = LogisticRegression()

# Train our model
lr.fit(x_train, y_train)

# Predict using our trained model and our test data
lr_predict = lr.predict(x_test)

In [None]:
# Print confusion matrix and accuracy score
lr_conf_matrix = confusion_matrix(y_test, lr_predict)
lr_acc_score = accuracy_score(y_test, lr_predict)
print(lr_conf_matrix)
print(lr_acc_score*100)