<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>ModelOps demo - Jupyter-only Data exploration and experimentation</b>
</header>

![image](images/02_00.png) 

## Introduction

This Notebook will guide you to the PIMA  Diabetes prediction use case. It will cover everything that a Data Science team usually implements for Data exploration and model experimentation. Here we will use the same dataframes than on the models, although we only will be using Jupyter notebook as the interface. 

## Steps in this Notebook

<li>1. Configure the Environment </li>
    <li>2. Connect to Vantage</li>
    <li>3. PIMA Use Case - Data Exploration </li>
    <li>4. PIMA Use Case - Model Experimentation</li>


## Step 1. Configure the Environment

Here, we import the required libraries, set environment variables and environment paths (if required).



#### 1.1 Libraries installation

**A restart of the Kernel is needed to confirm changes**. We use -q parameter for a non-verbose log of the installation command, you may remove this parameter if you want to know all the steps of the pip installation.

In [None]:
%pip install -q teradataml==17.20.0.3 aoa==7.0.1 pandas==1.1.5 xgboost==0.90 scikit-learn==0.24.2

#### 1.2 Libraries import

In [None]:
from xgboost import XGBClassifier

In [None]:
from teradataml import create_context, get_context, copy_to_sql, DataFrame
from teradatasqlalchemy.types import *
import pandas as pd
import getpass
import logging
import sys

import os
import numpy as np
import warnings

warnings.filterwarnings(action='ignore', category=DeprecationWarning)

## Step 2. Connect to Vantage

<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO=02_ModelOps_PIMA_Experimentation.ipynb;' UPDATE FOR SESSION; ''')

## Step 3. PIMA Use Case - Data Exploration

The **[Pima](https://en.wikipedia.org/wiki/Pima_people)** are a group of **Native Americans** living in Arizona. A genetic predisposition allowed this group to survive normally to a diet poor of carbohydrates for years. In the recent years, because of a sudden shift from traditional agricultural crops to processed foods, together with a decline in physical activity, made them develop **the highest prevalence of type 2 diabetes** and for this reason they have been subject of many studies.

## Dataset

The dataset includes data from **768** women with **8** characteristics, in particular:

1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)

The last column of the dataset indicates if the person has been diagnosed with diabetes (1) or not (0)

### Source

The original [dataset](http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes) was at **UCI Machine Learning Repository** but is no longer available, this is an alternative site: https://nrvis.com/data/mldata/pima-indians-diabetes.csv

## The problem

The type of dataset and problem is a classic **supervised binary classification**. Given a number of elements all with certain characteristics (features), we want to build a machine learning model to identify people affected by type 2 diabetes.

To solve the problem we will have to analyse the data, do any required transformation and normalisation, apply a machine learning algorithm, train a model, check the performance of the trained model and iterate with other algorithms until we find the most performant for our type of dataset.

# Inspect the Dataset

In [None]:
dataset = DataFrame.from_query("""
SELECT 
    F.*, D.hasdiabetes 
FROM pima_patient_features F
JOIN pima_patient_diagnoses D
ON F.patientid = D.patientid
""").to_pandas()

dataset.head()

In [None]:
corr = dataset.corr()
corr

I'm not a doctor and I don't have any knowledge of medicine, but from the data I can guess that **the greater the age or the BMI of a patient is, the greater probabilities are the patient can develop type 2 diabetes**.

# Visualise the Dataset

In [None]:
%matplotlib inline
import seaborn as sns
sns.heatmap(corr, annot = True)

In [None]:
import matplotlib.pyplot as plt
dataset.hist(bins=50, figsize=(20, 15))
plt.show()

An important thing I notice in the dataset (and that wasn't obvious at the beginning) is the fact that some people have **null (zero) values** for some of the features: it's not quite possible to have 0 as BMI or for the blood pressure.

How can we deal with similar values? We will see it later during the **data transformation** phase.

# Splitting the Dataset into Train & Test

As already higlighted in the introduction to the notebook, we have already split the dataset and they are available in PIMA_TRAIN and PIMA_TEST.

In [None]:
from teradataml.dataframe.dataframe import DataFrame

# take 80% split for training
train_set = DataFrame.from_query("""
SELECT 
    F.*, D.hasdiabetes
FROM pima_patient_features F 
JOIN pima_patient_diagnoses D
ON F.patientid = D.patientid
    WHERE D.patientid MOD 5 <> 0
""").to_pandas()

# take 20% split for test
test_set = DataFrame.from_query("""
SELECT 
    F.*, D.hasdiabetes
FROM pima_patient_features F 
JOIN pima_patient_diagnoses D
ON F.patientid = D.patientid
    WHERE D.patientid MOD 5 = 0
""").to_pandas()

In [None]:
# Separate labels from the rest of the dataset
train_set_labels = train_set["HasDiabetes"]
train_set = train_set.drop("HasDiabetes", axis=1)

test_set_labels = test_set["HasDiabetes"]
test_set = test_set.drop("HasDiabetes", axis=1)

# Data cleaning and transformation

We have noticed from the previous analysis that some patients have missing data for some of the features. Machine learning algorithms don't work very well when the data is missing so we have to find a solution to "clean" the data we have.

The easiest option could be to eliminate all those patients with null/zero values, but in this way we would eliminate a lot of important data.

Another option is to calculate the **median** value for a specific column and substitute that value everywhere (in the same column) we have zero or null. Let's see how to apply this second method.

# Feature Scaling

One of the most important data transformations we need to apply is the **features scaling**. Basically most of the machine learning **algorithms don't work very well if the features have a different set of values**. In our case for example the Age ranges from 20 to 80 years old, while the number of times a patient has been pregnant ranges from 0 to 17. For this reason we need to apply a proper transformation.

In [None]:
# Apply a scaler
from sklearn.preprocessing import MinMaxScaler as Scaler

scaler = Scaler()
scaler.fit(train_set)
train_set_scaled = scaler.transform(train_set)
test_set_scaled = scaler.transform(test_set)

## Scaled Values

In [None]:
df = pd.DataFrame(data=train_set_scaled)
df.head()

## Step 4. PIMA Use Case - Model Experimentation - Select and train a model

It's not possible to know in advance which algorithm will work better with our dataset. We need to compare a few and select the one with the "best score".

## Comparing multiple algorithms

To compare multiple algorithms with the same dataset, there is a very nice utility in sklearn called **model_selection**. We create a list of algorithms and then we score them using the same comparison method. At the end we pick the one with the best score.

In [None]:
# Import all the algorithms we want to test
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [None]:
# Import the slearn utility to compare algorithms
from sklearn import model_selection

In [None]:
# Prepare an array with all the algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('RFC', RandomForestClassifier()))
models.append(('XGB', XGBClassifier()))

In [None]:
# Prepare the configuration to run the test
seed = 7
results = []
names = []
X = train_set_scaled
Y = train_set_labels

In [None]:
# Every algorithm is tested and results are
# collected and printed
from sklearn.model_selection import KFold

for name, model in models:
    kfold =  KFold(n_splits=10)
    cv_results = model_selection.cross_val_score(
        model, X, Y, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print("%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()))

In [None]:
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

It looks like that using this comparison method, the most performant algorithm is **XGBoost**.

## Find the best parameters for XGB

The default parameters for an algorithm are rarely the best ones for our dataset. Using sklearn we can easily build a parameters grid and try all the possible combinations. At the end we inspect the `best_estimator_` property and get the best ones for our dataset.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'learning_rate': [0.1, 0.2],
    'max_depth': [4, 6, 8]
}

model_xgb = XGBClassifier()

grid_search = GridSearchCV(model_xgb, param_grid, cv=10, scoring='accuracy')
grid_search.fit(train_set_scaled, train_set_labels)

In [None]:
# Print the bext score found
grid_search.best_score_

## Credits

https://github.com/andreagrandi/ml-pima-notebook

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">©2023 Teradata. All Rights Reserved</footer>