<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       ModelOps demo(Jupyter-only): Data exploration and experimentation
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

![image](images/02_00.png) 

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>This Notebook will guide you to the PIMA  Diabetes prediction use case. It will cover everything that a Data Science team usually implements for Data exploration and model experimentation. Here we will use the same dataframes than on the models, although we only will be using Jupyter notebook as the interface.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Steps in this Notebook</b></p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Configure the Environment </li>
    <li>Connect to Vantage</li>
    <li>PIMA Use Case - Data Exploration </li>
    <li>PIMA Use Case - Model Experimentation</li>
</ol>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>1. Configure the Environment</b>

<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

<p style = 'font-size:18px;font-family:Arial'><b>1.1 Libraries installation</b></p>

<p style = 'font-size:16px;font-family:Arial'><b>A restart of the Kernel is needed to confirm changes</b>. We use -q parameter for a non-verbose log of the installation command, you may remove this parameter if you want to know all the steps of the pip installation.</p>

In [None]:
#%pip install -q teradataml==17.20.0.6 teradatamodelops==7.0.3

<p style = 'font-size:16px;font-family:Arial'><b>Hint:</b><i>The easy way to restart the kernel to bring the above installed software into memory is to type zero zero (<b> 0 0 </b>). </i></p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>1.2 Libraries import</b></p>

In [None]:
import logging
import os
import sys
import getpass
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import GridSearchCV

from xgboost import XGBClassifier

from teradataml import *
from teradatasqlalchemy.types import *

warnings.filterwarnings(action='ignore', category=DeprecationWarning)

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>2. Connect to Vantage</b>

<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=02_ModelOps_PIMA_Experimentation.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:18px;font-family:Arial'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
#%run -i ../UseCases/run_procedure.py "call get_data('DEMO_ModelOps_cloud');"        # Takes 10 seconds
%run -i ../UseCases/run_procedure.py "call get_data('DEMO_ModelOps_local');"        # Takes 30 seconds

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../UseCases/run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>3. PIMA Use Case - Data Exploration</b>

<p style = 'font-size:16px;font-family:Arial'>The <strong><a href="https://en.wikipedia.org/wiki/Pima_people">Pima</a></strong> are a group of <strong>Native Americans</strong> living in Arizona. A genetic predisposition allowed this group to survive normally to a diet poor in carbohydrates for years. In recent years, due to a sudden shift from traditional agricultural crops to processed foods, together with a decline in physical activity, they have developed <strong>the highest prevalence of type 2 diabetes</strong> and have therefore been the subject of many studies.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Dataset</b></p>

<p style = 'font-size:16px;font-family:Arial'>The dataset includes data from <b>768</b> women with <b>8</b> characteristics, in particular:</p>

<ol style = 'font-size:16px;font-family:Arial'>
  <li>Number of times pregnant</li>
  <li>Plasma glucose concentration a 2 hours in an oral glucose tolerance test</li>
  <li>Diastolic blood pressure (mm Hg)</li>
  <li>Triceps skin fold thickness (mm)</li>
  <li>2-Hour serum insulin (mu U/ml)</li>
  <li>Body mass index (weight in kg/(height in m)^2)</li>
  <li>Diabetes pedigree function</li>
  <li>Age (years)</li>
</ol>


<p style = 'font-size:16px;font-family:Arial'>The last column of the dataset indicates if the person has been diagnosed with diabetes (1) or not (0)</p>


<p style = 'font-size:18px;font-family:Arial'><b>Source</b></p>

<p style = 'font-size:16px;font-family:Arial'>
  The original <a href="http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes">dataset</a> was at
  <strong>UCI Machine Learning Repository</strong> but is no longer available. This is an alternative site:
  <a href="https://nrvis.com/data/mldata/pima-indians-diabetes.csv">https://nrvis.com/data/mldata/pima-indians-diabetes.csv</a>.
</p>



<p style = 'font-size:18px;font-family:Arial'><b>The problem</b></p>

<p style = 'font-size:16px;font-family:Arial'>The type of dataset and problem is a classic <b>supervised binary classification</b>. Given a number of elements all with certain characteristics (features), we want to build a machine learning model to identify people affected by type 2 diabetes.
<br>
<br>
To solve the problem we will have to analyze the data, do any required transformation and normalization, apply a machine learning algorithm, train a model, check the performance of the trained model and iterate with other algorithms until we find the most performant for our type of dataset.</p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.1 Inspect the Dataset</b></p>

In [None]:
dataset = DataFrame.from_query("""
    SELECT 
        F.*, D.hasdiabetes 
    FROM Demo_Modelops.pima_patient_features F
    JOIN Demo_Modelops.pima_patient_diagnoses D
    ON F.patientid = D.patientid
    """).to_pandas()

dataset.head()

In [None]:
corr = dataset.corr()
corr

<p style="font-size:16px;font-family:Arial">
    I'm not a doctor and I don't have any knowledge of medicine, but from the data I can guess that
    <strong>the greater the age or the BMI of a patient is, the greater probabilities are the patient can develop type 2 diabetes</strong>.
</p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.2 Visualise the Dataset</b></p>

In [None]:
plt.figure(figsize=(8, 6))  # Adjust the size of the plot as desired
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')  # Customize color map and annotation format
plt.title('Correlation Heatmap')  # Add a title to the plot
plt.show()

In [None]:
dataset.hist(bins=50, figsize=(20, 15))
plt.suptitle('Histograms of Dataset Features', fontsize=16)
plt.show()

<p style = 'font-size:16px;font-family:Arial'>
    An important thing I notice in the dataset (and that wasn't obvious at the beginning) is the fact that some people have
    <strong>null (zero) values</strong> for some of the features: it's not quite possible to have 0 as BMI or for the blood pressure.
    <br>
    <br>
    How can we deal with similar values? We will see it later during the <strong>data transformation</strong> phase.
</p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.3 Splitting the Dataset into Train & Test</b></p>

<p style = 'font-size:16px;font-family:Arial'>As already highlighted in the introduction to the notebook, we have already split the dataset and they are available in PIMA_TRAIN and PIMA_TEST.</p>

In [None]:
# Take 80% split for training
train_set = DataFrame.from_query("""
    SELECT 
        F.*, D.hasdiabetes
    FROM Demo_Modelops.pima_patient_features F 
    JOIN Demo_Modelops.pima_patient_diagnoses D
    ON F.patientid = D.patientid
    WHERE D.patientid MOD 5 <> 0
""").to_pandas()

# Take 20% split for test
test_set = DataFrame.from_query("""
    SELECT 
        F.*, D.hasdiabetes
    FROM Demo_Modelops.pima_patient_features F 
    JOIN Demo_Modelops.pima_patient_diagnoses D
    ON F.patientid = D.patientid
    WHERE D.patientid MOD 5 = 0
""").to_pandas()

In [None]:
# Separate labels from the rest of the dataset

train_set_labels = train_set["HasDiabetes"]
train_set = train_set.drop("HasDiabetes", axis=1)

test_set_labels = test_set["HasDiabetes"]
test_set = test_set.drop("HasDiabetes", axis=1)

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>3.4 Data cleaning and transformation<b></p>

<p style = 'font-size:16px;font-family:Arial'>
    We have noticed from the previous analysis that some patients have missing data for some of the features. Machine learning algorithms don't work very well when the data is missing, so we have to find a solution to "clean" the data we have.
    <br>
    <br>
    The easiest option could be to eliminate all those patients with null/zero values, but in this way, we would eliminate a lot of important data.
    <br>
    <br>
    Another option is to calculate the <strong>median</strong> value for a specific column and substitute that value everywhere (in the same column) we have zero or null. Let's see how to apply this second method.
</p>

<p style = 'font-size:18px;font-family:Arial'><b>3.4.1 Feature Scaling</b></p>

<p style = 'font-size:16px;font-family:Arial'>
    One of the most important data transformations we need to apply is the <strong>features scaling</strong>. Basically, most of the machine learning
    <strong>algorithms don't work very well if the features have a different set of values</strong>. In our case, for example, the Age ranges from 20 to 80 years old, while the number of times a patient has been pregnant ranges from 0 to 17. For this reason, we need to apply a proper transformation.
</p>

In [None]:
# Apply a scaler
scaler = MinMaxScaler()
train_set_scaled = scaler.fit_transform(train_set)
test_set_scaled = scaler.transform(test_set)

<p style = 'font-size:18px;font-family:Arial'><b>3.4.2 Scaled Values</b></p>

In [None]:
df = pd.DataFrame(data = train_set_scaled)
df.head()

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>4. PIMA Use Case - Model Experimentation - Select and train a model</b>

<p style = 'font-size:16px;font-family:Arial'>It's not possible to know in advance which algorithm will work better with our dataset. We need to compare a few and select the one with the "best score".</p>

<p style = 'font-size:18px;font-family:Arial'><b>Comparing multiple algorithms</b></p>

<p style = 'font-size:16px;font-family:Arial'>To compare multiple algorithms with the same dataset, there is a very nice utility in sklearn called <strong>model_selection</strong>. We create a list of algorithms and then we score them using the same comparison method. At the end we pick the one with the best score.</p>

In [None]:
# Prepare a list with all the algorithms
models = [
    ('LR', LogisticRegression()),
    ('RFC', RandomForestClassifier()),
    ('XGB', XGBClassifier())
]

In [None]:
# Prepare the configuration to run the test
seed = 7
results = []
names = []
X = train_set_scaled
Y = train_set_labels

In [None]:
# Every algorithm is tested and results are
# collected and printed

for name, model in models:
    kfold = KFold(n_splits=10)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print("%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()))

In [None]:
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

<p style = 'font-size:16px;font-family:Arial'>It looks like that using this comparison method, the most performant algorithm is <strong>XGBoost</strong>.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Find the best parameters for XGB</b></p>

<p style = 'font-size:16px;font-family:Arial'>
    The default parameters for an algorithm are rarely the best ones for our dataset. Using sklearn, we can easily build a parameters grid and try all the possible combinations. At the end, we inspect the <code>best_estimator_</code> property and get the best ones for our dataset.
</p>

In [None]:
param_grid = {
    'learning_rate': [0.1, 0.2],
    'max_depth': [4, 6, 8]
}

model_xgb = XGBClassifier()

grid_search = GridSearchCV(
    estimator=model_xgb,
    param_grid=param_grid,
    cv=10,
    scoring='accuracy'
)
grid_search.fit(train_set_scaled, train_set_labels)

In [None]:
# Print the bext score found
grid_search.best_score_

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>5. Cleanup</b>
<div class="alert alert-block alert-info">
    <p style = 'font-size:16px;font-family:Arial'>If you are done with ModelOps usecase, please uncomment and run the below cleanup section.</p>
</div>

<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
# %run -i ../UseCases/run_procedure.py "call remove_data('DEMO_ModelOps');"        # Takes 10 seconds

In [None]:
remove_context()

<p style = 'font-size:18px;font-family:Arial'><b>Credits</b></p>

<a href="https://github.com/andreagrandi/ml-pima-notebook">https://github.com/andreagrandi/ml-pima-notebook</a>

[<< Back to Getting Started](./01_ModelOps_Getting_Started.ipynb) | [Continue to PIMA PMML BYOM >>](./03_ModelOps_BYOM_PIMA_PMML.ipynb)

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023. All Rights Reserved
        </div>
    </div>
</footer>