## Notebook: Contextual Anomaly Detection (CAD) Training

This is the first of a series of three notebooks that show how the Contextual Anomaly Detection (CAD) Accelerator can be used to train and deploy a prediction interval model into Monitor using the Model Factory service endpoints.
1.   Training: cookbooks/contextual_anomaly_train.ipynb
2.  Monitor Device Creation: cookbooks/contextual_anomaly_create_device.ipynb
3.   Model Deployment: cookbooks/contextual_anomaly_deploy_model.ipynb

### CAD Description

The CAD training job produces three prediction interval models that capture the normal operation (non-anomalous) behaviour of a given target variable based on a set of input features. 

These are point estimate multivariate regression machine learning models (denoted as base regressor) enhanced with conformal prediction statistical wrappers to produce a lower and upper bound that contain the target variable with probability 95% under normal operation (non-anomalous) conditions. Therefore, the probability of observing a target variable outside of the provided interval is 5% under normal operation conditions.


The three CAD models are saved in ONNX format to be compatible with MAS, the user can choose which model to deploy based on the performance summary provided by the training job.

### Wind Turbine Dataset Description: 

In this notebook we use the CAD Accelerator to learn a prediction interval model that covers the normal operation behaviour of the Average Reactive Power of a Wind turbine asset with probability 95%. We consider 4 input features to predict the target variable. These are Average Active Power, Average Generator Bearing 1 Temperature, Average Generator Bearing 2 Temperature, Average Wind Speed.







<a id='notebook_workflow'></a>
### Notebook Workflow
- [Imports](#imports)
- [Load KPI specification file (yaml)](#load_kpiyaml)
- [Load Model Factory config file (yaml)](#load_mfyaml)
- [Load Dataset](#load_dataset)
- [Generate Training Payload](#cad_payload)
- [Post Training Job](#cad_post)
- [Request Training Summary](#cad_summary)
- [Save Model Information](#cad_modelinfo)


<a id='imports'></a>
### Imports

In [1]:
import pandas as pd
import requests
import time
from time import sleep
import yaml

!pwd

/Users/nataliamartinezgil/Documents/GitHub/supervised_anomaly_accelerator/cookbooks


<a id='load_kpiyaml'></a>
### Load KPI specification file (yaml)



This file is a dictionary containing:

    - asset_id_column: column corresponding to asset id
    - data_name: train dataset file name
    - device_description: device description (optional)
    - mas_device_name: monitor device name (needed for device creation and/or deployment)
    - feature_columns: feature columns name as in first row of dataset csv file separated by ',' (e.g., P_avg,Rs_avg,Gb1t_avg,Ws_avg)
    - feature_names: feature columns interpretable names separated by ',' (same order as in feature_columns)
    - target_columns: target column name as in first row of dataset
    - target_names: target column interpretable name
    - timestamp_column: time stamp column name as in first row of dataset
    - timestamp_format: '%m/%d/%Y %H:%M'
    - inference_data_name: test dataset file name (optional for inference)
    - feature_map: dictionary mapping feature columns to descriptions (optional)

In [2]:
input_file_name = "../config/Pavg_kpi.yml"

with open(input_file_name, 'r') as file:
    input_data = yaml.safe_load(file)

print('KPI specification file: ')
print(input_data)

KPI specification file: 
{'asset_id_column': 'Wind_turbine_name', 'data_name': 'Wind_Turbine_train.csv', 'device_description': 'Wind Turbine', 'mas_device_name': 'Wind_Turbine_Test_1', 'feature_columns': 'P_avg,Gb1t_avg,Gb2t_avg,Ws_avg', 'feature_names': 'P_average,Gblt_average,Gb2t_average,Ws_average', 'target_columns': 'Rs_avg', 'target_names': 'Rs_avg', 'timestamp_column': 'Date_time', 'timestamp_format': '%m/%d/%Y %H:%M', 'inference_data_name': 'Wind_Turbine_test.csv', 'feature_map': {'P_avg': 'Average Active Power', 'Rs_avg': 'Average Reactive Power', 'Gb1t_avg': 'Average Generator Bearing 1 Temperature', 'Gb2t_avg': 'Average Generator Bearing 2 Temperature', 'Gost_avg': 'Average Generator Outer Stator Temperature', 'Git_avg': 'Average Generator Inner Stator Temperature', 'Yt_avg': 'Average Yaw System Temperature', 'Ot_avg': 'Average Outdoor Temperature', 'Ws_avg': 'Average Wind Speed', 'Wa_avg': 'Average Wind Direction'}}


<a id='load_mfyaml'></a>
### Load Model Factory config file (yaml)


This file is a dictionary containing:

    - endpoint_url: <ACTION: Replace with Model Factory endpoint URL>
    - train_recipe_endpoint: recipe/supervised-anomaly (DON'T CHANGE)
    - deploy_recipe_endpoint: deployment/monitor/model/create (DON'T CHANGE)
    - create_device_recipe_endpoint: deployment/monitor/device/create (DON'T CHANGE)

In [3]:
model_factory_config_file_name = "../config/model_factory_config.yml"

with open(model_factory_config_file_name, 'r') as file:
    model_factory_config = yaml.safe_load(file)

print(model_factory_config)

{'endpoint_url': 'http://127.0.0.1:8000/ibm/modelfactory/service/', 'train_recipe_endpoint': 'recipe/supervised-anomaly', 'deploy_recipe_endpoint': 'deployment/monitor/model/create', 'create_device_recipe_endpoint': 'deployment/monitor/device/create'}


<a id='load_dataset'></a>
### Load Dataset


In [4]:
prefix = "../data/"
data_df_1 = pd.read_csv(prefix + input_data['data_name'])
data_df_1.head(10)


Unnamed: 0,Wind_turbine_name,Date_time,P_avg,Rs_avg,Gb1t_avg,Gb2t_avg,Gost_avg,Git_avg,Yt_avg,Ot_avg,Ws_avg,Wa_avg
0,R80711,1/1/2017 3:10,152.44,10.73,55.68,53.849998,44.98,43.419998,22.48,-3.12,5.04,181.52
1,R80711,1/1/2017 3:20,187.89,11.36,56.759998,55.959999,45.66,44.220001,19.690001,-3.19,5.13,178.67999
2,R80711,1/1/2017 3:40,82.32,9.42,53.84,52.27,45.439999,42.419998,7.25,-3.62,4.49,182.61
3,R80711,1/1/2017 4:10,165.08,11.0,53.75,52.68,45.57,43.650002,16.68,-3.75,5.11,183.46001
4,R80711,1/1/2017 4:20,227.57001,11.96,55.82,55.77,46.330002,44.91,18.620001,-3.32,5.84,189.21001
5,R80711,1/1/2017 4:30,240.19,12.17,57.48,57.82,47.119999,45.900002,20.0,-3.15,5.83,187.58
6,R80711,1/1/2017 4:40,237.63,12.11,58.029999,58.110001,47.66,46.450001,20.93,-3.0,5.89,208.32001
7,R80711,1/1/2017 4:50,261.67999,12.44,59.25,59.84,48.349998,47.299999,21.99,-2.62,5.85,187.23
8,R80711,1/1/2017 5:00,198.02,11.54,59.040001,59.040001,48.720001,47.48,22.83,-2.33,5.55,186.36
9,R80711,1/1/2017 5:10,169.78,11.04,58.259998,57.689999,48.91,47.400002,23.110001,-2.38,5.45,186.24001


<a id='cad_payload'></a>
### Generate Training Payload



Training Payload Required Information (this will be read from the input_data dictionary loaded above)

- Payload:
    - "feature_columns": feature column names separated by ','
    - "target_columns": target column name separated
    - "feature_names": features descriptive names separated by ','
    - "target_names": target descriptive name
- files:
    - "data_file": .csv data file with column names matching "feature_columns" and "target_columns" as specified in payroll.
    

In [5]:
payload = {'feature_columns': input_data['feature_columns'],
'feature_names': input_data['feature_names'],
'target_columns': input_data['target_columns'],
'target_names': input_data['target_names']}

data_folder = "../data/"

files=[
  ('data_file',(input_data['data_name'],open(data_folder + input_data['data_name'],'rb'),'text/csv'))
]

print('Payload :')
print(payload)

Payload :
{'feature_columns': 'P_avg,Gb1t_avg,Gb2t_avg,Ws_avg', 'feature_names': 'P_average,Gblt_average,Gb2t_average,Ws_average', 'target_columns': 'Rs_avg', 'target_names': 'Rs_avg'}


<a id='cad_post'></a>
### Post Training Job

In [6]:
endpoint_url = model_factory_config["endpoint_url"]
training_endpoint_url = endpoint_url + model_factory_config["train_recipe_endpoint"]

headers = {
  'accept': 'application/json'
}
start_time = time.time()

response = requests.request("POST", training_endpoint_url, headers=headers, data=payload, files=files)

In [7]:
post_r_json = response.json()
print(post_r_json)
job_id = post_r_json['job_id']
print()
print('job id:',job_id )

{'job_id': '932740d6-9ac6-4bce-8adf-9db53fe146ae', 'message': 'Job 932740d6-9ac6-4bce-8adf-9db53fe146ae was submitted.', 'status': 'INITIALIZING'}

job id: 932740d6-9ac6-4bce-8adf-9db53fe146ae


<a id='cad_summary'></a>
### Request Training Summary

The following code will obtain a summary after the training job is completed. This summary contains the performance of the three trained prediction interval models on test, train and validation data.
We report the following metrics:
- R2 score: indicates the R2 score of the base regressor, higher is better.
- RMSE:  indicates the root mean square error of the base regressor, lower is better.
- Coverage: indicates the probability of finding the groud truth target inside the produced interval (this value should be larger or equal to 0.95)
- Mean Width: Indicates the average prediction interval width
- Median Width: Indicates the median prediction interval width

In [8]:

log_url = endpoint_url + "log/"
summary_url = endpoint_url + "summary/"

print('Summary URL :: ')
print(summary_url + job_id)
print()
while True:
    get_response = requests.get(summary_url + job_id, headers={})
    json_data = get_response.json()
    print("json_data",json_data)
    if 'status' in json_data:
        print('status :', json_data['status'])
        print()
        if json_data['status'] == 'DONE':
            print()
            print('-------------  Trained Models -------------  ')
            print()
            performance_dictionary = json_data['summary']['performance_dictionary']
            for model in performance_dictionary.keys():
                print('model name : ',model)
                for key in performance_dictionary[model]:
                    if key != 'performance':
                        print(key, ' : ',performance_dictionary[model][key])
                print()
                print('Performance : ')
                for dataset in performance_dictionary[model]['performance'].keys():
                    print(dataset + ' set')
                    print(performance_dictionary[model]['performance'][dataset])
                    print()


                print('----------------')
                print()
            finish_time = time.time()
            break

        else:
            print('STATUS : ',json_data['status'] )
            print('retry later .... ')
            sleep(2)


Summary URL :: 
http://127.0.0.1:8000/ibm/modelfactory/service/summary/932740d6-9ac6-4bce-8adf-9db53fe146ae

json_data {'job_id': '932740d6-9ac6-4bce-8adf-9db53fe146ae', 'status': 'UNKNOWN', 'error': 'The job is not found in the database. It is either because (1) job_id is incorrect, OR (2) the job was just created. Please try this endpoint a few seconds later.'}
status : UNKNOWN

STATUS :  UNKNOWN
retry later .... 
json_data {'job_id': '932740d6-9ac6-4bce-8adf-9db53fe146ae', 'status': 'UNKNOWN', 'error': 'The job is not found in the database. It is either because (1) job_id is incorrect, OR (2) the job was just created. Please try this endpoint a few seconds later.'}
status : UNKNOWN

STATUS :  UNKNOWN
retry later .... 
json_data {'job_id': '932740d6-9ac6-4bce-8adf-9db53fe146ae', 'status': 'UNKNOWN', 'error': 'The job is not found in the database. It is either because (1) job_id is incorrect, OR (2) the job was just created. Please try this endpoint a few seconds later.'}
status : UNK

In [9]:
print('Trained Model Names')
print(performance_dictionary.keys())
print('-----------------')
print()

''' !!ACTION :: Here you can choose any of the models in performance_dictionary.keys()'''
model_selected = list(performance_dictionary.keys())[0]

print('Selected Model is : ', model_selected)
for key in performance_dictionary[model_selected]:
    if key != 'performance':
        print(key, ' : ',performance_dictionary[model_selected][key])
print()
print('Performance : ')
for dataset in performance_dictionary[model_selected]['performance'].keys():
    print(dataset + ' set')
    print(performance_dictionary[model_selected]['performance'][dataset])
    print()
print('----------------')
print()

Trained Model Names
dict_keys(['PI_model_base_LGBM1_best', 'PI_model_cv_base_LGBM1_best', 'PI_model_prefit_LGBM1_best'])
-----------------

Selected Model is :  PI_model_base_LGBM1_best
base_model  :  LGBM1_best.pkl
PI_method  :  base
model_uri  :  s3://testdataupload/48/f5b935499e3e488cbd83235cb376343b/artifacts/PI_model_base_LGBM1_best.onnx

Performance : 
test set
{'rmse': 0.23386329971115025, 'r2': 0.96644103313591, 'coverage': 0.9575, 'mean_width': 0.642869746819347, 'median_width': 0.6428697468193469}

val set
{'rmse': 0.20541198294790505, 'r2': 0.974986008857764, 'coverage': 0.956, 'mean_width': 0.642869746819347, 'median_width': 0.6428697468193469}

train set
{'rmse': 0.16270706709836444, 'r2': 0.9824199140604926, 'coverage': 0.9836666666666667, 'mean_width': 0.6428697468193469, 'median_width': 0.6428697468193469}

----------------



<a id='cad_modelinfo'></a>
### Save Model Information

The information of the selected model is saved in 'config/model_info.yml'. This information will be needed for model deployment into Monitor

In [10]:
if 'status' in json_data:
    if json_data['status'] == 'DONE':
        output_data = {
            "onnx_model_uri" : performance_dictionary[model]['model_uri'],
            "train_job_id" : job_id,
            "mas_device_name" : input_data['mas_device_name']
        }
        with open("../config/model_info.yml","w") as file:
            yaml.dump(output_data, file)
        print('Model Information: ')
        print(output_data)

Model Information: 
{'onnx_model_uri': 's3://testdataupload/48/f5b935499e3e488cbd83235cb376343b/artifacts/PI_model_prefit_LGBM1_best.onnx', 'train_job_id': '932740d6-9ac6-4bce-8adf-9db53fe146ae', 'mas_device_name': 'Wind_Turbine_Test_1'}
