<img src="https://cybersecurity-excellence-awards.com/wp-content/uploads/2017/06/366812.png">


<h1><center>Darwin Forecasting Model Building </center></h1>


# Prior to getting started:

Darwin notebook will no longer support 'Register User' starting from 2.0. As a user, you must have credentials ready before using this notebook. 

In order to proceed, in the Environment Variables cell: 
1. Set your username and password to ensure that you're able to log in successfully
2. Set the path to the location of your datasets if you are using your own data.  The path is set for examples.
  <br><b>NOTE:</b> We provide two ways to analyze feature importance. One is to use the entire dataset; the other one is to analyze a few samples to understand individual samples. In the latter case, we advise users to use a small dataset (<=500) because it takes long time to process individual samples. 

Here are a few things to be mindful of:
1. For every run, check the job status (i.e. requested, failed, running, completed) and wait for job to complete before proceeding. 
2. If you're not satisfied with your model and think that Darwin can benefit from extra training, use the resume function.

## Import Necessary Libraries

In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import os
import datetime
from IPython.display import Image
from time import sleep
from amb_sdk.sdk import DarwinSdk

## Set Darwin SDK

In [10]:
ds = DarwinSdk()
ds.set_url('https://darwin-api.sparkcognition.com/v1/')

(True, 'https://amb-api.sparkcognition.com/v1/')

## Environment Variables

In [11]:
#Set your user id and password accordingly
USER="[your Darwin user id]"
PW="[your Darwin password]"

# Set path to datasets - The default below assumes Jupyter was started from amb-sdk/examples/Enterprise/
# Modify accordingly if you wish to use your own data
PATH_TO_DATASET = '../../sets/'
TRAIN_DATASET = 'sine_train.csv'
TEST_DATASET = 'sine_test.csv'

# A timestamp is used to create a unique name in the event you execute the workflow multiple times or with 
# different datasets.  File names must be unique in Darwin.
ts = '{:%Y%m%d%H%M%S}'.format(datetime.datetime.now())

# User Login

In [12]:
status, msg = ds.auth_login_user(USER, PW)
if not status:
    print(msg)
else:
    print('You are logged in.')

You are logged in.


# Data Upload

**Read dataset and view a file snippet**

In [13]:
df = pd.read_csv(os.path.join(PATH_TO_DATASET, TRAIN_DATASET))
df.head()

Unnamed: 0,x,y
0,0.0,0.0
1,0.591959,0.591959
2,1.130872,1.130872
3,1.568445,1.568445
4,1.865468,1.865468


**Upload training dataset to Darwin**

In [14]:
status, dataset = ds.upload_dataset(os.path.join(PATH_TO_DATASET, TRAIN_DATASET))
if not status:
    print(dataset)
else:
    print('Data is successfully uploaded!')

400: BAD REQUEST - {"message": "Dataset already exists"}



#### **Upload testing dataset to Darwin**

In [16]:
status, dataset = ds.upload_dataset(os.path.join(PATH_TO_DATASET, TEST_DATASET))
if not status:
    print(dataset)
else:
    print('Data is successfully uploaded!')

400: BAD REQUEST - {"message": "Dataset already exists"}



# Analyze Data
Analyze data is a necessary step before cleaning data and creating model. 

In [17]:
status, analyze_id = ds.analyze_data(TRAIN_DATASET, 
                                     job_name = 'Darwin_analyze_data_job' + "-" + ts, 
                                     artifact_name = 'Darwin_analyze_data_artifact' + "-" + ts)
sleep(1)
if status:
    ds.wait_for_job('Darwin_analyze_data_job' + "-" + ts)
else:
    print(analyze_id)

{'status': 'Complete', 'starttime': '2019-07-19T11:28:36.811098', 'endtime': None, 'percent_complete': 100, 'job_type': 'AnalyzeData', 'loss': None, 'generations': None, 'dataset_names': ['sine_train.csv'], 'artifact_names': ['Darwin_analyze_data_artifact-20190719112807'], 'model_name': None, 'job_error': None}


In [18]:
ds.lookup_job_status_name(analyze_id['job_name'])

(True,
 {'status': 'Complete',
  'starttime': '2019-07-19T11:28:36.811098',
  'endtime': None,
  'percent_complete': 100,
  'job_type': 'AnalyzeData',
  'loss': None,
  'generations': None,
  'dataset_names': ['sine_train.csv'],
  'artifact_names': ['Darwin_analyze_data_artifact-20190719112807'],
  'model_name': None,
  'job_error': None})

# Clean Data

Starting Version 1.6, Darwin SDK offers a way to clean your data outside of model training. Every dataset needs to be cleaned before creating a model. There is no need to save the cleaned data and upload it, but users need to specify the target name before running.

In [19]:
target = 'y'
status, job_id = ds.clean_data(dataset_name=TRAIN_DATASET, target=target)
if not status:
    print(job_id)
else:
    print('Data has been successfully cleaned!')

Data has been successfully cleaned!


# Create and Train Model 

In the cell below, specify the parameters used to create the forecasting model:
- model: the name of your model
- forecast_horizon: the forecast length, which is how far you need to predict in the future, int type
- max_train_time: the amount of time used for training (possibly shorter with early stopping)

In [20]:
ts = '{:%Y%m%d%H%M%S}'.format(datetime.datetime.now())
model = "model" + "-" + ts
forecast_horizon = 2
max_train_time = '00:02'
status, job_id = ds.create_model(dataset_names=TRAIN_DATASET,
                                 model_name=model,
                                 forecast_horizon=forecast_horizon,
                                 fit_profile_name=job_id['profile_name'],
                                 max_train_time=max_train_time)
sleep(1)
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

{'status': 'Running', 'starttime': '2019-07-19T11:29:01.594042', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 'dataset_names': ['sine_train.csv'], 'artifact_names': None, 'model_name': 'model-20190719112807', 'job_error': ''}
{'status': 'Running', 'starttime': '2019-07-19T11:29:01.594042', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 'dataset_names': ['sine_train.csv'], 'artifact_names': None, 'model_name': 'model-20190719112807', 'job_error': ''}
{'status': 'Running', 'starttime': '2019-07-19T11:29:01.594042', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 'dataset_names': ['sine_train.csv'], 'artifact_names': None, 'model_name': 'model-20190719112807', 'job_error': ''}
{'status': 'Running', 'starttime': '2019-07-19T11:29:01.594042', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 

In [12]:
# look up job status
ds.lookup_job_status_name(job_id['job_name'])

(True,
 {'status': 'Complete',
  'starttime': '2019-07-16T14:29:45.649993',
  'endtime': '2019-07-16T14:54:58.737844',
  'percent_complete': 100,
  'job_type': 'TrainModel',
  'loss': 1.0151908758082584e-09,
  'generations': 145,
  'dataset_names': ['sine_train.csv'],
  'artifact_names': None,
  'model_name': 'model-20190716142903',
  'job_error': ''})

In [13]:
# look up the model details
ds.lookup_model_name(job_id['model_name'])

(True,
 {'id': '115fcfa4-a800-11e9-9c2d-5f98d0e68267',
  'name': 'model-20190716142903',
  'type': 'Supervised',
  'problem_type': None,
  'updated_at': '2019-07-16T14:54:58.691581',
  'trained_on': ['sine_train.csv'],
  'trained_on_id': ['00ac11ea-a800-11e9-9c2d-73227e8ef7ae'],
  'loss': 1.0151908758082584e-09,
  'complete': True,
  'generations': 145,
  'parameters': {'target': 'y',
   'impute': 'mean',
   'recurrent': None,
   'big_data': False,
   'max_unique_values': 50,
   'forecast_horizon': 2,
   'max_int_uniques': 15,
   'train_time': '00:20'},
  'description': {'recurrent': None,
   'best_genome': [{'layer 1': {'type': 'LinearGene',
      'parameters': {'activation': 'relu', 'numunits': 2}}},
    {'layer 2': {'type': 'LinearGene',
      'parameters': {'activation': 'identity', 'numunits': 1}}}],
   'genome_type': 'DeepNet'},
  'train_time_seconds': 1513,
  'algorithm': None,
  'running_job_id': None})

## Extra Training (Optional)
Run the following cell for extra training, specify the amount of time for extra training using `max_train_time` 

In [14]:
max_train_time = '00:01'
status, job_id = ds.resume_training_model(dataset_names=TRAIN_DATASET,
                                          model_name=model,
                                          max_train_time=max_train_time)
sleep(1)
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

{'status': 'Running', 'starttime': '2019-07-15T17:02:56.793773', 'endtime': None, 'percent_complete': 0, 'job_type': 'UpdateModel', 'loss': 2.263362148369197e-06, 'generations': 17, 'dataset_names': ['sine_train.csv'], 'artifact_names': None, 'model_name': 'model-20190715165920', 'job_error': ''}
{'status': 'Running', 'starttime': '2019-07-15T17:02:56.793773', 'endtime': None, 'percent_complete': 11, 'job_type': 'UpdateModel', 'loss': 2.263362148369197e-06, 'generations': 26, 'dataset_names': ['sine_train.csv'], 'artifact_names': None, 'model_name': 'model-20190715165920', 'job_error': ''}
{'status': 'Running', 'starttime': '2019-07-15T17:02:56.793773', 'endtime': None, 'percent_complete': 37, 'job_type': 'UpdateModel', 'loss': 1.5302066458389163e-06, 'generations': 30, 'dataset_names': ['sine_train.csv'], 'artifact_names': None, 'model_name': 'model-20190715165920', 'job_error': ''}
{'status': 'Running', 'starttime': '2019-07-15T17:02:56.793773', 'endtime': None, 'percent_complete': 6

## Predict
Run the following cell for prediction

In [None]:
# clean dataset
status, job_id = ds.clean_data(TEST_DATASET, model_name=model)

if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

In [29]:
# Test model
status, artifact = ds.run_model(dataset_name=TEST_DATASET, 
                                model_name=model, 
                                forecast_horizon=forecast_horizon)
sleep(1)
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

{'status': 'Complete', 'starttime': '2019-07-16T16:50:13.469606', 'endtime': None, 'percent_complete': 100, 'job_type': 'CleanData', 'loss': None, 'generations': None, 'dataset_names': ['sine_test.csv'], 'artifact_names': None, 'model_name': None, 'job_error': None}


In [31]:
# Get predictions
status, prediction = ds.download_artifact(artifact['artifact_name'])

In [15]:
# View prediction
prediction.head()

Unnamed: 0,anomaly_score,predict_proba,prediction
0,-16.743278,"[0.9989219908860879, 0.0010780091139121865]",0
1,-16.743278,"[0.9997870887479495, 0.0002129112520504335]",0
2,-16.743278,"[0.9954133220147088, 0.004586677985291236]",0
3,-16.743278,"[0.99987644529621, 0.00012355470379009663]",0
4,-5.883411,"[0.3316821039130491, 0.6683178960869509]",1


## Visualization

In [None]:
seq_len = model_info['description']['best_genome'][0]['layer 1']['parameters']['seqlength']
preds = prediction.iloc[::seq_len, :].values.reshape(-1)
plt.rcParams['figure.figsize']= 20,20
plt.plot(df[target][seq_len:].reset_index(drop=True), color='b')
plt.plot(preds, color='r')