<img src="https://cybersecurity-excellence-awards.com/wp-content/uploads/2017/06/366812.png">


<h1><center>Darwin Unsupervised Model Building </center></h1>


Prior to getting started, there are a few things you want to do:
1. Enter your username and password to ensure that you're able to log in successfully
2. Set the path to your dataset. If left unfilled, you will be testing an example dataset on the server. 
3. Set the dataset path for feature importance
  - For global feature importance, the dataset path remains the same as your original dataset
  - For individual row's feature importance, you need to specify a path to a dataset that contains no more than 500       rows.

Once you're up and running, here are a few things to be mindful of:
1. For every run, check the job status (i.e. requested, failed, running, completed) and wait for job to complete before proceeding. 
2. If you're not satisfied with your model and think that Darwin can benefit from extra training, use the resume function.

## Import libraries

In [39]:
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import Image
from time import sleep
import os

from amb_sdk.sdk import DarwinSdk

# A timestamp is used to create a unique name in the event you execute the workflow multiple times or with 
# different datasets.  File names must be unique in Darwin.
import datetime
ts = '{:%Y%m%d%H%M%S}'.format(datetime.datetime.now())

## Setup
In the cell below, you need to set up the following variables:
 - path: the designated path to your dataset, default to be an example dataset "pulsar.csv" which you should be able to find in the Darwin SDK /sets folder
 - user, password: your credentials to log in the next time you use Darwin SDK. These should be inside of your trial email.

In [40]:
# set local data path to files
path = '../../sets/'

# View data snippet
df = pd.read_csv(os.path.join(path, 'pulsars.csv'))
df.head()

Unnamed: 0,mean_profile,std_profile,kurt_profile,skew_profile,mean_dmsnr,std_dmsnr,kurt_dmsnr,skew_dmsnr,class
0,111.09375,47.341089,0.435469,0.471339,2.386288,15.867173,9.327098,103.545876,0
1,105.0,49.203341,0.563215,0.38215,1.601171,14.657767,11.381829,148.33435,0
2,115.304688,43.653207,0.448319,0.614359,3.158027,21.378754,8.34743,76.310271,0
3,108.554688,52.559016,0.138068,-0.44234,1.787625,12.108555,11.262459,180.074252,0
4,136.429688,49.552164,-0.180418,0.370338,9.066054,37.284742,4.270014,17.700441,0


Log in with the username and password that were attached to your trial email

In [41]:
ds = DarwinSdk()
ds.set_url('https://amb-trial-api.sparkcognition.com/v1/')

status, msg = ds.auth_login_user('username','password')

if not status:
    print(msg)
else:
    print("You are now logged in!")

You are now logged in!


## Upload Data
After setting up the dataset path, the next step is to upload the dataset from your local device to the server. In the cell below, you need to specify the dataset_name.

In [42]:
# Upload dataset
ds.delete_all_models()
ds.delete_all_datasets()
dataset_name = 'pulsars'
status, dataset = ds.upload_dataset(os.path.join(path, 'pulsars.csv'), dataset_name)
if not status:
    print(dataset)
else:
    print("Data uploaded successfully")

Data uploaded successfully


## Create and Train Model 

To build unsupervised models, which cluster data and perform anomaly detection, Darwin goes through the following steps:
1. Determines an approximate number of clusters to start with using a single pass with a hierarchical method
2. Iterates on subsets of the data using a Spectral-Net algorithm to determine the ideal number of clusters
3. Proceeds to cluster the data using a Spectral-Net approach

In the cell below, specify the parameters used to create the model:
- model: the name of your model
- max_epochs: the number of epochs to train the model, one epoch indicates one scan of the entire dataset
- n_clusters: the number of clusters, either an integer or 'auto', if left with 'auto', the unsupervised algorithm will compute a number for you

In [43]:
# Build model
model = "model-" + ts

max_epochs = 20
n_clusters = 2
status, job_id = ds.create_model(dataset_names=dataset_name,
                                 model_name=model,
                                 max_epochs=max_epochs,
                                 n_clusters=n_clusters)
sleep(1)
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

model-20181022132213
{'status': 'Running', 'starttime': '2018-10-22T13:22:16.022224', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars'], 'artifact_names': None, 'model_name': 'model-20181022132213', 'job_error': ''}
{'status': 'Running', 'starttime': '2018-10-22T13:22:16.022224', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars'], 'artifact_names': None, 'model_name': 'model-20181022132213', 'job_error': ''}
{'status': 'Running', 'starttime': '2018-10-22T13:22:16.022224', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars'], 'artifact_names': None, 'model_name': 'model-20181022132213', 'job_error': ''}
{'status': 'Running', 'starttime': '2018-10-22T13:22:16.022224', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 

In [44]:
# look up job status
ds.lookup_job_status_name(job_id['job_name'])

(True,
 {'status': 'Complete',
  'starttime': '2018-10-22T13:22:16.022224',
  'endtime': '2018-10-22T13:23:11.223161',
  'percent_complete': 100,
  'job_type': 'TrainModel',
  'loss': None,
  'generations': 0,
  'dataset_names': ['pulsars'],
  'artifact_names': None,
  'model_name': 'model-20181022132213',
  'job_error': ''})

In [45]:
# look up the model
ds.lookup_model_name(job_id['model_name'])

(True,
 {'type': 'Unsupervised',
  'updated_at': '2018-10-22T13:23:11.215241',
  'trained_on': ['pulsars'],
  'loss': None,
  'generations': 0,
  'parameters': {'n_clusters': 2, 'max_generation': 20},
  'description': {'model': "UnsupervisedModel(anomaly_prior=0.05, auto_save_per=10, clustering=True,\n         clustermethod='GaussianMixture', impute='ffill',\n         max_generation=20, max_time=600,\n         model_file='models/8f7a8372-52e1-11e8-b5c4-b3789b6189ff_model-20181022132213',\n         n_clusters=2, recurrent=False, verbose=2)"},
  'train_time_seconds': 55,
  'algorithm': 'NA',
  'running_job_id': '6751073a-d627-11e8-b232-c75d47fe44df'})

## Extra Training (Optional)
Run the following cell for extra training, no need to specify parameters

In [46]:
# Train some more
extra_epochs = 10
status, job_id = ds.resume_training_model(dataset_names=dataset_name,
                                          model_name=model,
                                          max_epochs=extra_epochs,
                                          n_clusters=n_clusters)
sleep(1)
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

{'status': 'Running', 'starttime': '2018-10-22T13:23:18.590892', 'endtime': None, 'percent_complete': 0, 'job_type': 'UpdateModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars'], 'artifact_names': None, 'model_name': 'model-20181022132213', 'job_error': ''}
{'status': 'Complete', 'starttime': '2018-10-22T13:23:18.590892', 'endtime': '2018-10-22T13:23:33.712502', 'percent_complete': 100, 'job_type': 'UpdateModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars'], 'artifact_names': None, 'model_name': 'model-20181022132213', 'job_error': ''}


## Predict
Run the following cell for prediction

In [47]:
# Test model
status, artifact = ds.run_model(dataset_name, 
                                model, 
                                supervised=False)
sleep(1)
ds.wait_for_job(artifact['job_name'])

{'status': 'Running', 'starttime': '2018-10-22T13:23:35.806026', 'endtime': None, 'percent_complete': 0, 'job_type': 'RunModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars'], 'artifact_names': ['416377c1b0d74ec4bd5f815ad8513c7b'], 'model_name': 'model-20181022132213', 'job_error': ''}
{'status': 'Complete', 'starttime': '2018-10-22T13:23:35.806026', 'endtime': '2018-10-22T13:23:39.337109', 'percent_complete': 100, 'job_type': 'RunModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars'], 'artifact_names': ['416377c1b0d74ec4bd5f815ad8513c7b'], 'model_name': 'model-20181022132213', 'job_error': ''}


(True, 'Job completed')

In [48]:
# Get predictions
status, pred_file = ds.download_artifact(artifact['artifact_name'])

In [49]:
# View prediction
df = pd.read_csv(pred_file['filename'])
df.head()

Unnamed: 0,anomaly,predict_proba,prediction
0,-9.613625,"[0.921194873115461, 0.07880512688453897]",0
1,-9.613625,"[0.9160670047272838, 0.08393299527271617]",0
2,-9.613625,"[0.8731624947460095, 0.1268375052539906]",0
3,-9.613625,"[0.909575160047408, 0.09042483995259198]",0
4,-9.613625,"[0.8996014254411605, 0.10039857455883959]",0


## Analyze Data

In [50]:
status, analyze_id = ds.analyze_data(dataset_name, 
                                     job_name = 'Darwin_analyze_data_job-' + ts, 
                                     artifact_name = 'Darwin_analyze_data_artifact-' + ts)
sleep(1)
if status:
    ds.wait_for_job('Darwin_analyze_data_job-' + ts)
else:
    print(analyze_id)

{'status': 'Running', 'starttime': '2018-10-22T13:23:52.59637', 'endtime': None, 'percent_complete': 0, 'job_type': 'AnalyzeData', 'loss': None, 'generations': None, 'dataset_names': ['pulsars'], 'artifact_names': ['Darwin_analyze_data_artifact-20181022132213'], 'model_name': None, 'job_error': None}
{'status': 'Running', 'starttime': '2018-10-22T13:23:52.59637', 'endtime': None, 'percent_complete': 0, 'job_type': 'AnalyzeData', 'loss': None, 'generations': None, 'dataset_names': ['pulsars'], 'artifact_names': ['Darwin_analyze_data_artifact-20181022132213'], 'model_name': None, 'job_error': None}
{'status': 'Running', 'starttime': '2018-10-22T13:23:52.59637', 'endtime': None, 'percent_complete': 10, 'job_type': 'AnalyzeData', 'loss': None, 'generations': None, 'dataset_names': ['pulsars'], 'artifact_names': ['Darwin_analyze_data_artifact-20181022132213'], 'model_name': None, 'job_error': None}
{'status': 'Running', 'starttime': '2018-10-22T13:23:52.59637', 'endtime': None, 'percent_com

In [51]:
ds.lookup_job_status_name('Darwin_analyze_data_job-' + ts)

(True,
 {'status': 'Complete',
  'starttime': '2018-10-22T13:23:52.59637',
  'endtime': '2018-10-22T13:24:55.971',
  'percent_complete': 100,
  'job_type': 'AnalyzeData',
  'loss': None,
  'generations': None,
  'dataset_names': ['pulsars'],
  'artifact_names': ['Darwin_analyze_data_artifact-20181022132213'],
  'model_name': None,
  'job_error': None})

In [52]:
status, analyze_results = ds.download_artifact('Darwin_analyze_data_artifact-' + ts)
analyze_results

Unnamed: 0,col_name,num_uniques,mean,stddev,min,max,col_type,uniques,missing,num_with_str_outlier,drop,is_cat,scalable
0,mean_profile,4131,95.386872844363,36.45879164320172,5.8125,180.21875,DoubleType,,0.0,False,False,False,True
1,std_profile,4643,44.39843999534597,8.03584733035067,24.77204176,91.8086279,DoubleType,,0.0,False,False,False,True
2,kurt_profile,4489,1.240831394071567,1.8062319389936876,-1.604829088,8.069522046,DoubleType,,0.0,False,False,False,True
3,skew_profile,4645,5.725651713966373,11.072265713013318,-1.781888301,68.10162173,DoubleType,,0.0,False,False,False,True
4,mean_dmsnr,3897,23.786014458090968,39.28215532852028,0.273411371,222.4214047,DoubleType,,0.0,False,False,False,True
5,std_dmsnr,4580,35.27200189459302,23.989199233584703,7.565681088,110.64221059999998,DoubleType,,0.0,False,False,False,True
6,kurt_dmsnr,4961,6.66097918719681,4.877371432518144,-3.1392696110000005,32.19858411,DoubleType,,0.0,False,False,False,True
7,skew_dmsnr,4516,79.4015215256139,103.01049301033186,-1.976975603,1072.957979,DoubleType,,0.0,False,False,False,True
8,class,2,0.3533089027807717,0.4780491449433773,0.0,1.0,IntegerType,"['0', '1']",0.0,False,False,False,True


## Analyze Model
Analyze model provides feature importance ranked by the model. It indicates a general view of which features pose a bigger impact on the model

In [53]:
status, analyze_id = ds.analyze_model(job_id['model_name'], 
                                      job_name='Darwin_analyze_model_job-' + ts, 
                                      artifact_name='Darwin_analyze_model_artifact-' + ts)
sleep(1)
if status:
    ds.wait_for_job('Darwin_analyze_model_job-' + ts)
else:
    print(analyze_id)

{'status': 'Running', 'starttime': '2018-10-22T13:25:10.852169', 'endtime': None, 'percent_complete': 0, 'job_type': 'AnalyzeModel', 'loss': None, 'generations': 0, 'dataset_names': None, 'artifact_names': ['Darwin_analyze_model_artifact-20181022132213'], 'model_name': 'model-20181022132213', 'job_error': ''}
{'status': 'Complete', 'starttime': '2018-10-22T13:25:10.852169', 'endtime': '2018-10-22T13:25:15.167323', 'percent_complete': 100, 'job_type': 'AnalyzeModel', 'loss': None, 'generations': 0, 'dataset_names': None, 'artifact_names': ['Darwin_analyze_model_artifact-20181022132213'], 'model_name': 'model-20181022132213', 'job_error': ''}


In [54]:
ds.lookup_job_status_name('Darwin_analyze_model_job-' + ts)

(True,
 {'status': 'Complete',
  'starttime': '2018-10-22T13:25:10.852169',
  'endtime': '2018-10-22T13:25:15.167323',
  'percent_complete': 100,
  'job_type': 'AnalyzeModel',
  'loss': None,
  'generations': 0,
  'dataset_names': None,
  'artifact_names': ['Darwin_analyze_model_artifact-20181022132213'],
  'model_name': 'model-20181022132213',
  'job_error': ''})

Download and print the top 10 features

In [55]:
status, feature_importance = ds.download_artifact('Darwin_analyze_model_artifact-' + ts)
feature_importance

skew_profile    0.279141
kurt_dmsnr      0.131902
std_profile     0.111963
skew_dmsnr      0.108896
kurt_profile    0.101227
std_dmsnr       0.088957
mean_dmsnr      0.072086
mean_profile    0.062883
class = 1       0.042945
dtype: float64

## Analyze Prediction
Different from Analyze Model, the Analyze Prediction provides a way to analyze feature importance for each data point. The output estimates how each feature added or subtracted from a known base-value to result in the overall prediction that was made. <br>
**You need to set the path to a dataset which contains all the samples you want to analyze (max rows = 500)**

In [56]:
# Upload the data that you are interested in feature importance (max: 500 rows)
dataset_name = 'pulsars_predict'
path = '../../sets/'
status, response = ds.upload_dataset(os.path.join(path, 'pulsars_predict.csv'), dataset_name)
print(status)
print(response)
if status:
    dataset_by_row=response['dataset_name']
else:
    print("Upload data failed!")

True
{'dataset_name': 'pulsars_predict'}


In [57]:
status, analyze_id = ds.analyze_predictions(job_id['model_name'], 
                                            'pulsars_predict',
                                            job_name='Analyze_prediction_job-' + ts, 
                                            artifact_name='Analyze_prediction_artifact-' + ts)
sleep(1)
if status:
    ds.wait_for_job('Analyze_prediction_job-' + ts)
else:
    print(analyze_id)

{'status': 'Running', 'starttime': '2018-10-22T13:25:29.111522', 'endtime': None, 'percent_complete': 0, 'job_type': 'AnalyzePredictions', 'loss': None, 'generations': 0, 'dataset_names': None, 'artifact_names': ['Analyze_prediction_artifact-20181022132213'], 'model_name': 'model-20181022132213', 'job_error': ''}
{'status': 'Complete', 'starttime': '2018-10-22T13:25:29.111522', 'endtime': '2018-10-22T13:25:34.905241', 'percent_complete': 100, 'job_type': 'AnalyzePredictions', 'loss': None, 'generations': 0, 'dataset_names': None, 'artifact_names': ['Analyze_prediction_artifact-20181022132213'], 'model_name': 'model-20181022132213', 'job_error': ''}


In [58]:
ds.lookup_job_status_name('Analyze_prediction_job-' + ts)

(True,
 {'status': 'Complete',
  'starttime': '2018-10-22T13:25:29.111522',
  'endtime': '2018-10-22T13:25:34.905241',
  'percent_complete': 100,
  'job_type': 'AnalyzePredictions',
  'loss': None,
  'generations': 0,
  'dataset_names': None,
  'artifact_names': ['Analyze_prediction_artifact-20181022132213'],
  'model_name': 'model-20181022132213',
  'job_error': ''})

Download and print the top 10 features

In [59]:
status, feature_importance = ds.download_artifact('Analyze_prediction_artifact-' + ts)
feature_importance.head()

Unnamed: 0,mean_profile_shap,std_profile_shap,kurt_profile_shap,skew_profile_shap,mean_dmsnr_shap,std_dmsnr_shap,kurt_dmsnr_shap,skew_dmsnr_shap,class = 1_shap,base_value,predicted_proba,predicted_class
0,0.070299,0.10265,0.072198,0.069924,0.007757,0.020696,0.052016,0.056087,0.019862,0.448936,0.719071,0
1,0.075286,0.115519,0.079864,0.079485,0.014563,0.020417,0.029793,0.027319,0.029671,0.448936,0.846829,0
2,-0.076678,-0.097932,-0.074503,-0.071541,-0.002461,-0.017659,-0.050975,-0.062609,-0.016513,0.551064,0.808995,1
3,0.0712,0.125838,0.078611,0.074969,0.019367,0.022947,0.032403,0.01553,0.031386,0.448936,0.923996,0
4,-0.019629,-0.090178,-0.040499,-0.067282,-0.047057,-0.035739,-0.057256,-0.057172,-0.05241,0.551064,0.947637,1
