<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       ModelOps demo: R GBM using Git
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

![image](images/git_meth.png) 

<p style = 'font-size:18px;font-family:Arial'><b>Introduction</b>

<p style = 'font-size:16px;font-family:Arial'>This notebook will cover the Operationalization of the PIMA diabetes use case with R GBM algorithm. The <strong>gbm R</strong> package is an implementation of extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine.</p>

<p style = 'font-size:16px;font-family:Arial'>In this example, we will use the GBM algorithm to generate both R model formats and operationalize them through ModelOps in the same Model Catalog than other trained models based on other libraries and languages.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Prerequisites</b></p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li>Access to Teradata Vantage</li>
    <li>Access to VAL</li>
    <li>Access to BYOM</li>
    <li>Have already gone through Notebook 1 - ModelOps Setup </li>
    <li>Have already gone through Notebook 7 - ModelOps CLI and GIT Setup </li>
</ul>

<p style = 'font-size:18px;font-family:Arial'><b>Steps in this Notebook</b></p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>Configure the Environment </li>
    <li>Connect to Vantage</li>
    <li>Define Training function </li>
    <li>Define Evaluate function </li>
    <li>Define Scoring function</li>
    <li>Define Model Metadata</li>
    <li>Commit and Push to Git to let ModelOps manage</li>
</ol>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>1. Configure the Environment</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

<p style = 'font-size:18px;font-family:Arial'><b>1.1 Libraries installation</b></p>

<p style = 'font-size:16px;font-family:Arial'>Ensure you have the following libraries installed in order to be able to run this notebook.</p>

<p style = 'font-size:16px;font-family:Arial'>Run this in a terminal (File -> New -> Terminal)</p>

```r
$ R

$ install.packages(c("gbm", "tdplyr", "getPass", "caret", "e1071", "ids"))
```

<p style = 'font-size:16px;font-family:Arial'>You’ll be prompted to create a local user R installation profile as you cannot install to the base system, type Yes.</p>

<p style = 'font-size:16px;font-family:Arial'>When prompted to select a CRAN mirror, choose “USA (OR)”. To choose this you would type the number to the left of “USA (OR)”</p>

<p style = 'font-size:16px;font-family:Arial'>A restart of the Kernel is needed to confirm changes.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Hint:</b><i>The easy way to restart the kernel to bring the above installed software into memory is to type zero zero (<b> 0 0 </b>). </i></p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>1.2 Libraries import</b></p>

In [None]:
LoadPackages <- function() {
    if(!require('gbm')){install.packages('gbm')}
    if(!require('tdplyr')){install.packages('tdplyr')}
    if(!require('getPass')){install.packages('getPass')}
    if(!require('caret')){install.packages('caret', dependencies = TRUE)}
    if(!require('e1071')){install.packages('e1071')}
    if(!require('ids')){install.packages('ids')}
}
suppressPackageStartupMessages(LoadPackages())

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>2. Connect to Vantage</b></p>

In [None]:
# Create Vantage connection using tdplyr

con = NULL
aoa_create_context <- function(connection = con) {
    if (is.null(connection)) {
        # host = readline("Host: ");
        host = 'host.docker.internal'
        # username = readline("Username: ");
        username = 'demo_user'
        db_name <- username;
        password = getPass::getPass("Password: ");
        connection <- td_create_context(host=host, uid=username, pwd=password, dType="native", logmech="TDNEGO");

        # Set connection context
        td_set_context(connection);

        DBI::dbExecute(connection, "SET QUERY_BAND = 'appVersion=7.0;appName=VMO;appFunc=R;org=teradata-internal-telem;' FOR SESSION VOLATILE")
        DBI::dbExecute(connection, paste("DATABASE", db_name))
        message(paste("Using this database for table/views lookup and temp objects:", db_name))
    }
    return(connection)
}

con <- aoa_create_context()

# set the path to the local project repository for this model demo
model_local_path <- '~/modelops-demo-models/model_definitions/pima_r_gbm'
system(sprintf("mkdir -p %s/model_modules", model_local_path))

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>3. Define Training Function</b></p>

<p style = 'font-size:16px;font-family:Arial'>The training function takes the following shape</p>

```R
train <- function(data_conf, model_conf, ...) {
    # Connect to Vantage
    con <- aoa_create_context()
    
    # your training code
    
    # save your model
    saveRDS(model, "artifacts/output/model.rds")
}
```

<p style = 'font-size:16px;font-family:Arial'>You can execute this from the CLI or directly within the notebook as shown.</p>

In [None]:
# Save as ~/modelops-demo-models/model_definitions/pima_r_gbm/model_modules/training.R
LoadPackages <- function() {
    library("gbm")
    library("DBI")
    library("dplyr")
    library("tdplyr")

}

suppressPackageStartupMessages(LoadPackages())

train <- function(data_conf, model_conf, ...) {
    # Connect to Vantage
    con <- aoa_create_context()

    table <- tbl(con, sql(data_conf$sql))

    # Create dataframe from tibble, selecting the necessary columns and mutating integer64 to integers
    # select both the feature and target columns (ignorning e.g. entity key)
    columns <- unlist(c(data_conf$featureNames, data_conf$targetNames), use.name = TRUE)
    data <- table %>% select(all_of(columns)) %>% mutate(
                       NumTimesPrg = as.integer(NumTimesPrg),
                       PlGlcConc = as.integer(PlGlcConc),
                       BloodP = as.integer(BloodP),
                       SkinThick = as.integer(SkinThick),
                       TwoHourSerIns = as.integer(TwoHourSerIns),
                       HasDiabetes = as.integer(HasDiabetes)) %>% as.data.frame()

    # Load hyperparameters from model configuration
    hyperparams <- model_conf[["hyperParameters"]]

    print("Training model...")

    # Train model
    model <- gbm(HasDiabetes~.,
                 data=data,
                 shrinkage=hyperparams$shrinkage,
                 distribution = 'bernoulli',
                 cv.folds=hyperparams$cv.folds,
                 n.trees=hyperparams$n.trees,
                 verbose=FALSE)

    print("Model Trained!")

    # Get optimal number of iterations
    if (hyperparams$cv.folds > 1) {
        best.iter <- gbm.perf(model, plot.it=FALSE, method="cv")
    }

    # clean the model (R stores the dataset on the model..
    model$data <- NULL

    # how to save only best.iter tree?
    # model$best.iter <- best.iter
    # model$trees <- light$trees[best.iter]

    # Save trained model
    print("Saving trained model...")   
    saveRDS(model, paste(ifelse(model_conf$outputPath != "" && !is.null(model_conf$outputPath), model_conf$outputPath, "artifacts/output/"), "model.rds", sep=""))
}

In [None]:
# Define the ModelContext to test with. The ModelContext (dataconf and model_conf) is created and managed automatically by ModelOps 
# when it executes your code via CLI / UI. However, for testing in the notebook, you can define as follows

# define the training dataset
sql <- "
SELECT 
    F.*, D.hasdiabetes
FROM DEMO_ModelOps.PIMA_PATIENT_FEATURES F 
JOIN DEMO_ModelOps.PIMA_PATIENT_DIAGNOSES D
ON F.patientid = D.patientid
    WHERE D.patientid MOD 5 <> 0
";

feature_metadata <- list(
    # "database" = td_get_context()$default.database,
    "database" = 'DEMO_ModelOps',
    "table" = "aoa_feature_metadata"
);

hyperParameters <- list(
    "shrinkage" = 0.01,
    "cv.folds"=1,  # cv.folds value has been reduced to minimun to avoid perfomance issues in Jupyter (should be 5)
    "n.trees"=3000
);

entityKey = "PatientId"

targetNames <- list("HasDiabetes");

featureNames <- list("NumTimesPrg", "PlGlcConc", "BloodP", "SkinThick", "TwoHourSerIns", "BMI", "DiPedFunc", "Age");

data_conf <- list(
    "sql" = sql,
    "featureNames" = featureNames,
    "targetNames" = targetNames
)

model_conf = list(
    "hyperParameters" = hyperParameters,
    "outputPath" = "artifacts/"
)

# Execute training
train(
    data_conf = data_conf,
    model_conf = model_conf,
    model_version = "rgbm_v1"
)

In [None]:
# Check the generated files
res <- system("ls -lh artifacts", intern=TRUE)
print(paste(res, sep = "\n"))

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>4. Define Evaluation Function</b></p>

<p style = 'font-size:16px;font-family:Arial'>The evaluation function takes the following shape</p>

```R
evaluate <- function(data_conf, model_conf, ...) {
    # Connect to Vantage
    con <- aoa_create_context()
    
    # Load model
    model <- readRDS("artifacts/input/model.rds")
    
    # your evaluation logic here
    
    # Save metrics
    write(jsonlite::toJSON(metrics, auto_unbox = TRUE, null = "null", keep_vec_names=TRUE), "artifacts/output/metrics.json")
}
```

<p style = 'font-size:16px;font-family:Arial'>You can execute this from the CLI or directly within the notebook as shown.</p>

In [None]:
# Save as ~/modelops-demo-models/model_definitions/pima_r_gbm/model_modules/evaluation.R
LoadPackages <- function() {
    library("methods")
    library("jsonlite")
    library("caret")
    library("gbm")
    library("DBI")
    library("dplyr")
    library("tdplyr")
}

evaluate <- function(data_conf, model_conf, ...) {
    model <- readRDS(paste(ifelse(model_conf$inputPath != "" && !is.null(model_conf$inputPath), model_conf$inputPath, "artifacts/input/"), "model.rds", sep=""))
    print("Evaluating model...")

    suppressPackageStartupMessages(LoadPackages())

    # Connect to Vantage
    con <- aoa_create_context()

    table <- tbl(con, sql(data_conf$sql))

    # Create dataframe from tibble, selecting the necessary columns and mutating integer64 to integers
    data <- table %>% mutate(NumTimesPrg = as.integer(NumTimesPrg),
                                PlGlcConc = as.integer(PlGlcConc),
                                BloodP = as.integer(BloodP),
                                SkinThick = as.integer(SkinThick),
                                TwoHourSerIns = as.integer(TwoHourSerIns),
                                HasDiabetes = as.integer(HasDiabetes)) %>% as.data.frame()

    probs <- predict(model, data, na.action = na.pass, type = "response")
    preds <- as.integer(ifelse(probs > 0.5, 1, 0))

    cm <- confusionMatrix(table(preds, data$HasDiabetes))

    png(paste(ifelse(model_conf$outputPath != "" && !is.null(model_conf$outputPath), model_conf$outputPath, "artifacts/output/"), "confusion_matrix.png", sep=""), width = 860, height = 860)
    fourfoldplot(cm$table)
    dev.off()

    preds$pred <- preds
    metrics <- cm$overall

    # Save metrics
    write(jsonlite::toJSON(metrics, auto_unbox = TRUE, null = "null", keep_vec_names=TRUE), paste(ifelse(model_conf$outputPath != "" && !is.null(model_conf$outputPath), model_conf$outputPath, "artifacts/output/"), "metrics.json", sep=""))
}

In [None]:
# Define the ModelContext to test with. The ModelContext (dataconf and model_conf) is created and managed automatically by ModelOps 
# when it executes your code via CLI / UI. However, for testing in the notebook, you can define as follows

# define the training dataset
sql <- "
SELECT 
    F.*, D.hasdiabetes 
FROM DEMO_ModelOps.PIMA_PATIENT_FEATURES F 
JOIN DEMO_ModelOps.PIMA_PATIENT_DIAGNOSES D
ON F.patientid = D.patientid
    WHERE D.patientid MOD 5 = 0
";

data_conf <- list(
    "sql" = sql,
    "featureNames" = featureNames,
    "targetNames" = targetNames
)

model_conf <- list(
    "outputPath" = "artifacts/",
    "inputPath" = "artifacts/"
)

# Execute evaluation
evaluate(
    data_conf = data_conf,
    model_conf = model_conf,
    model_version = "rgbm_v1"
)

In [None]:
# Check the generated files
res <- system("ls -lh artifacts", intern=TRUE)
print(paste(res, sep = "\n"))

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>5. Define Scoring Function</b></p>

<p style = 'font-size:16px;font-family:Arial'>The scoring function takes the following shape</p>

```R
evaluate <- function(data_conf, model_conf, ...) {
    # Connect to Vantage
    con <- aoa_create_context()
    
    # Load model
    model <- readRDS("artifacts/input/model.rds")
    
    # your scoring logic here
    
    # your scoring result saving logic here
}
```

<p style = 'font-size:16px;font-family:Arial'>You can execute this from the CLI or directly within the notebook as shown.</p>

In [None]:
# Save as ~/modelops-demo-models/model_definitions/pima_r_gbm/model_modules/scoring.R
library(methods)
library(gbm)
library(jsonlite)
library(caret)

LoadBatchScoringPackages <- function() {
    library("gbm")
    library("DBI")
    library("dplyr")
    library("tdplyr")
}

score.batch <- function(data_conf, model_conf, model_version, job_id, ...) {
    model <- initialise_model()
    print("Batch scoring model...")

    suppressPackageStartupMessages(LoadBatchScoringPackages())

    # Connect to Teradata Vantage
    con <- aoa_create_context()

    table <- tbl(con, sql(data_conf$sql))

    # Create dataframe from tibble, selecting the necessary columns and mutating integer64 to integers
    data <- table %>% mutate(PatientId = as.integer(PatientId),
                             NumTimesPrg = as.integer(NumTimesPrg),
                             PlGlcConc = as.integer(PlGlcConc),
                             BloodP = as.integer(BloodP),
                             SkinThick = as.integer(SkinThick),
                             TwoHourSerIns = as.integer(TwoHourSerIns)) %>% as.data.frame()

    # The model object will be obtain from the environment as it has already been initialised using 'initialise_model'
    probs <- predict(model, data, na.action = na.pass, type = "response")
    score <- as.integer(ifelse(probs > 0.5, 1, 0))
    print("Finished batch scoring model...")

    # create result dataframe and store in Teradata Vantage
    pred_df <- as.data.frame(unlist(score))
    colnames(pred_df) <- c("HasDiabetes")
    pred_df$PatientId <- data$PatientId
    pred_df$job_id <- job_id

    # tdplyr doesn't match column names on append.. and so to match / use same table schema as for byom predict
    # example (see README.md), we must add empty json_report column and change column order manually (v17.0.0.4)
    # CREATE MULTISET TABLE pima_patient_predictions
    # (
    #     job_id VARCHAR(255), -- comes from airflow on job execution
    #     PatientId BIGINT,    -- entity key as it is in the source data
    #     HasDiabetes BIGINT,   -- if model automatically extracts target
    #     json_report CLOB(1048544000) CHARACTER SET UNICODE  -- output of
    # )
    # PRIMARY INDEX ( job_id );
    pred_df$json_report <- ""
    pred_df <- pred_df[, c("job_id", "PatientId", "HasDiabetes", "json_report")]

    copy_to(con, pred_df,
            name=dbplyr::in_schema(data_conf$predictions$database, data_conf$predictions$table),
            types = c("varchar(255)", "bigint", "bigint", "clob"),
            append=TRUE)
    print("Saved batch predictions...")
}

initialise_model <- function() {
    print("Loading model...")
    model <- readRDS(paste(ifelse(model_conf$inputPath != "" && !is.null(model_conf$inputPath), model_conf$inputPath, "artifacts/input/"), "model.rds", sep=""))
}

In [None]:
# Define the ModelContext to test with. The ModelContext (dataconf and model_conf) is created and managed automatically by ModelOps 
# when it executes your code via CLI / UI. However, for testing in the notebook, you can define as follows

# define the scoring dataset
sql <- "
SELECT 
    F.*
FROM DEMO_ModelOps.PIMA_PATIENT_FEATURES F 
    WHERE F.patientid MOD 5 = 0
";

# where to store predictions
predictions <- list(
    "database" = td_get_context()$default.database,
    "table" = "pima_patient_predictions_tmp"
)

data_conf <- list(
    "sql" = sql,
    "featureNames" = featureNames,
    "targetNames" = targetNames,
    "predictions" = predictions
)

model_conf <- list(
    "inputPath" = "artifacts/"
)

job_id <- uuid::UUIDgenerate(1)

# Execute batch scoring
score.batch(
    data_conf = data_conf,
    model_conf = model_conf,
    model_version = "rgbm_v1",
    job_id = job_id
)

In [None]:
# Using tibble
tbl(con, sql(sprintf("SELECT * FROM %s.pima_patient_predictions_tmp WHERE job_id = '%s'", td_get_context()$default.database, job_id)))

In [None]:
# Using DBI
DBI::dbGetQuery(con, sprintf("SELECT * FROM %s.pima_patient_predictions_tmp WHERE job_id = '%s'", td_get_context()$default.database, job_id))

In [None]:
# Clean up

system('rm -f artifacts/*')
tryCatch(DBI::dbSendQuery(con, sprintf("DROP TABLE %s.rgbm_v1", td_get_context()$default.database)), error=function(cond) return(NA))
tryCatch(DBI::dbSendQuery(con, sprintf("DROP TABLE %s.pima_patient_predictions_tmp", td_get_context()$default.database)), error=function(cond) return(NA))

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>6. Define Model Metadata</b></p>

<p style = 'font-size:16px;font-family:Arial'>Now let's create the configuration files.</p>

<p style = 'font-size:16px;font-family:Arial'>Requirements file with the dependencies and versions:</p>

In [None]:
# Save as ~/modelops-demo-models/model_definitions/pima_r_gbm/model_modules/requirements.R
message('Installing packages')
if(!require('gbm')){install.packages('gbm')}
if(!require('devtools')){install.packages('devtools')}
if(!require('caret')){install.packages('caret')}

<p style = 'font-size:16px;font-family:Arial'>The hyper parameter configuration (default values):</p>

In [None]:
# Save as ~/modelops-demo-models/model_definitions/pima_r_gbm/config.json
{
    "hyperParameters": {
        "shrinkage": 0.01,
        "cv.folds": 5,
        "n.trees": 3000
    }
}

<p style = 'font-size:16px;font-family:Arial'>The model configuration:</p>

In [None]:
# Save as ~/modelops-demo-models/model_definitions/pima_r_gbm/model.json
{
    "id": "d0d58b07-15f1-4421-8e56-3f30cc7c679c",
    "name": "R PIMA GBM",
    "description": "R PIMA GBM for Diabetes Prediction",
    "language": "R",
    "automation": {
        "training": {
            "resources": {
                "cpu": "1",
                "memory": "1Gi"
            }
        },
        "evaluation": {
            "resources": {
                "cpu": "1",
                "memory": "1Gi"
            }
        },
        "deployment": {
            "resources": {
                "cpu": "1",
                "memory": "1Gi"
            }
        }
    }
}

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>7. Commit and push changes</b>

<p style = 'font-size:16px;font-family:Arial'>Run the command below to commit and push changes to our forked repository, so ModelOps can fetch the changes to the model.</p>

In [None]:
res <- system(sprintf('cd %s/../.. && git add . && git commit -m "Added R PIMA GBM demo model 🎢" && git push', model_local_path), intern=TRUE)
print(paste(res, sep = "\n"))

[<< Back to Git PIMA Python XGBoost](./08_ModelOps_GIT_PIMA_Python_H2OAutoML.ipynb) | [Continue to Git PIMA Python In database XGBoost >>](./11_ModelOps_GIT_PIMA_Python_indb_XGboost.ipynb)

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023. All Rights Reserved
        </div>
    </div>
</footer>