<img src="https://cybersecurity-excellence-awards.com/wp-content/uploads/2017/06/366812.png">

<h1><center>Darwin Supervised Classification Model Building </center></h1>

Prior to getting started, there are a few things you want to do:
1. Set the dataset path.
2. Enter your username and password to ensure that you're able to log in successfully

Once you're up and running, here are a few things to be mindful of:
1. For every run, look up the job status (i.e. requested, failed, running, completed) and wait for job to complete before proceeding. 
2. If you're not satisfied with your model and think that Darwin can do better by exploring a larger search space, use the resume function.

## Import libraries

In [1]:
# Import necessary libraries
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import Image
from time import sleep
import os
import numpy as np
from sklearn.metrics import classification_report

from amb_sdk.sdk import DarwinSdk

## Setup

**Login to Darwin**<br>
Enter your registered username and password below to login to Darwin.

In [2]:
# Login

ds = DarwinSdk()

ds.set_url('https://amb-demo-api.sparkcognition.com/v1/')

status, msg = ds.auth_login_user('ziqichina@outlook.com', 'PXkNucE5VZ')

if not status:
    print(msg)

**Data Path** <br>
In the cell below, set the path to your dataset, the default is Darwin's example datasets

## Data Upload and Clean

**Read dataset and view a file snippet**

After setting up the dataset path, the next step is to upload the dataset from your local device to the server. <br> In the cell below, you need to specify the dataset_name if you want to use your own data.

In [4]:
#dataset_name = 'cancer_train.csv'
#df = pd.read_csv(os.path.join(path, dataset_name))
df = pd.read_csv('outbreak.csv')
df.head()

Unnamed: 0,year,month,state,genus_species,contaminated_ingredient,serotype_or_genotype,etiology_status,location_of_preparation,illnesses,deaths,hospitalizations,food_vehicle
0,2009,1,Minnesota,Norovirus,,,Suspected,Restaurant - Sit-down dining,2,0.0,0.0,
1,2009,1,Minnesota,Norovirus,,,Confirmed,,16,0.0,0.0,
2,2009,1,Minnesota,Norovirus,,,Suspected,Restaurant - Sit-down dining,5,0.0,0.0,
3,2009,1,Minnesota,Norovirus,,,Confirmed,"Restaurant - ""Fast-food""(drive up service or p...",3,0.0,0.0,
4,2009,1,Minnesota,Norovirus,,,Confirmed,Restaurant - other or unknown type,21,0.0,0.0,cookies


**Upload dataset to Darwin**

In [5]:
# Upload dataset
status, dataset = ds.upload_dataset('outbreak.csv','outbreak')
if not status:
    print(dataset)

400: BAD REQUEST - {"message": "Dataset already exists"}



**Clean dataset**

In [6]:
ds.analyze_data('outbreak')
#should we exclude labels?

(True,
 {'job_name': '2b429de3a75f48e3b9b2ec07af2bab73',
  'artifact_name': 'e6c48bc53a0b42508a337dcd11d86c44'})

In [7]:
ds.download_artifact('e6c48bc53a0b42508a337dcd11d86c44')

(True,                    col_name num_uniques                mean  \
 0                      year          18  2005.5623725090225   
 1                     month          12   6.433443171714002   
 2                     state          55                None   
 3             genus_species         213                None   
 4   contaminated_ingredient         365                None   
 5      serotype_or_genotype         265                None   
 6           etiology_status          22                None   
 7   location_of_preparation         220                None   
 8                 illnesses         291                None   
 9                    deaths          14                None   
 10         hospitalizations          59  0.9505615076803924   
 11             food_vehicle        3088                None   
 
                stddev   min   max     col_type   missing num_with_str_outlier  \
 0   5.158403941746534  1998  2015  IntegerType  0.000000                False

In [16]:
# clean dataset

# how we feature engineering.....
features=['genus_species','location_of_preparation','food_vehicle']


    
target = 'illnesses'
status, job_id = ds.clean_data('outbreak', target = target)

if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)


{'status': 'Requested', 'starttime': '2019-04-17T17:35:42.758501', 'endtime': None, 'percent_complete': 0, 'job_type': 'CleanDataTiny', 'loss': None, 'generations': None, 'dataset_names': ['outbreak'], 'artifact_names': ['1004902ca53d4c5b9878b50c525d9f01'], 'model_name': None, 'job_error': None}
{'status': 'Complete', 'starttime': '2019-04-17T17:35:42.758501', 'endtime': '2019-04-17T17:35:48.263794', 'percent_complete': 100, 'job_type': 'CleanDataTiny', 'loss': None, 'generations': None, 'dataset_names': ['outbreak'], 'artifact_names': ['1004902ca53d4c5b9878b50c525d9f01'], 'model_name': None, 'job_error': ''}


In [17]:
ds.download_dataset('outbreak',artifact_path='C:/data/')

(True,
 {'filename': 'C:\\data\\outbreak-part0-v4yetsr2.csv',
  'part': 0,
  'note': 'part 0 of 0'})

## Create and Train Model 

We will now build a model that will learn the class labels in the target column.<br> In the default cancer dataset, the target column is "Diagnosis". <br> You will have to specify your own target name for your custom dataset. <br> You can also increase max_train_time for longer training.


In [19]:
model = 'illnesses' + "_model1"
status, job_id = ds.create_model(dataset_names = 'outbreak', \
                                 model_name =  model, \
                                 max_train_time = '00:02')
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

{'status': 'Requested', 'starttime': '2019-04-17T17:37:14.580578', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 'dataset_names': ['outbreak'], 'artifact_names': None, 'model_name': 'illnesses_model1', 'job_error': None}
{'status': 'Running', 'starttime': '2019-04-17T17:37:14.580578', 'endtime': None, 'percent_complete': 7, 'job_type': 'TrainModel', 'loss': 0.6584872603416443, 'generations': 2, 'dataset_names': ['outbreak'], 'artifact_names': None, 'model_name': 'illnesses_model1', 'job_error': ''}
{'status': 'Running', 'starttime': '2019-04-17T17:37:14.580578', 'endtime': None, 'percent_complete': 7, 'job_type': 'TrainModel', 'loss': 0.6584872603416443, 'generations': 2, 'dataset_names': ['outbreak'], 'artifact_names': None, 'model_name': 'illnesses_model1', 'job_error': ''}
{'status': 'Running', 'starttime': '2019-04-17T17:37:14.580578', 'endtime': None, 'percent_complete': 7, 'job_type': 'TrainModel', 'loss': 0.6584872603416443, 'g

## Extra Training (Optional)
Run the following cell for extra training, no need to specify parameters

In [None]:
# Train some more
status, job_id = ds.resume_training_model(dataset_names = dataset_name,
                                          model_name = model,
                                          max_train_time = '00:05')
                                          
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

## Analyze Model
Analyze model provides feature importance ranked by the model. <br> It indicates a general view of which features pose a bigger impact on the model

In [21]:
# Retrieve feature importance of built model
status, artifact = ds.analyze_model('illnesses_model1')
sleep(1)
if status:
    ds.wait_for_job(artifact['job_name'])
else:
    print(artifact)


{'status': 'Running', 'starttime': '2019-04-17T17:40:28.7781', 'endtime': None, 'percent_complete': 0, 'job_type': 'AnalyzeModel', 'loss': 0.6584872603416443, 'generations': 2, 'dataset_names': None, 'artifact_names': ['12542baef8bf4a15b862cd415fb1dae1'], 'model_name': 'illnesses_model1', 'job_error': ''}
{'status': 'Complete', 'starttime': '2019-04-17T17:40:28.7781', 'endtime': '2019-04-17T17:40:39.205932', 'percent_complete': 100, 'job_type': 'AnalyzeModel', 'loss': 0.6584872603416443, 'generations': 2, 'dataset_names': None, 'artifact_names': ['12542baef8bf4a15b862cd415fb1dae1'], 'model_name': 'illnesses_model1', 'job_error': ''}


Show the 10 most important features of the model.

In [22]:
feature_importance[:10]

hospitalizations                          0.232533
year                                      0.178419
illnesses                                 0.168933
etiology_status = Suspected               0.148510
etiology_status = Suspected; Suspected    0.083955
etiology_status = Confirmed; Confirmed    0.062698
month = 9                                 0.012465
month = 7                                 0.009659
month = 12                                0.009618
etiology_status = Confirmed; Suspected    0.008575
dtype: float64

## Predictions
**Perform model prediction on the the training dataset.**

In [None]:
status, artifact = ds.run_model(dataset_name, model)
sleep(1)
ds.wait_for_job(artifact['job_name'])

Download predictions from Darwin's server.

In [None]:
status, prediction = ds.download_artifact(artifact['artifact_name'])
prediction.head()

Create plots comparing predictions with actual target

In [None]:
unq = prediction[target].unique()[::-1]
p = np.zeros((len(prediction),))
a = np.zeros((len(prediction),))
for i,q in enumerate(unq):
    p += i*(prediction[target] == q).values
    a += i*(df[target] == q).values
#Plot predictions vs actual
plt.plot(a)
plt.plot(p)
plt.legend(['Actual','Predicted'])
plt.yticks([i for i in range(len(unq))],[q for q in unq]);
print(classification_report(df[target], prediction[target]))

**Perform model prediction on a test dataset that wasn't used in training.** <br>
Upload test dataset

In [None]:
test_data = 'cancer_test.csv'
status, dataset = ds.upload_dataset(os.path.join(path, test_data))
if not status:
    print(dataset)

clean test data

In [None]:
# clean test dataset
status, job_id = ds.clean_data(test_data, target = target, model_name = model)

if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

Run model on test dataset.

In [None]:
status, artifact = ds.run_model(test_data, model)
sleep(1)
ds.wait_for_job(artifact['job_name'])

Create plots comparing predictions with actual target

In [None]:
# Create plots comparing predictions with actual target
status, prediction = ds.download_artifact(artifact['artifact_name'])
df = pd.read_csv(os.path.join(path,test_data))
unq = prediction[target].unique()[::-1]
p = np.zeros((len(prediction),))
a = np.zeros((len(prediction),))
for i,q in enumerate(unq):
    p += i*(prediction[target] == q).values
    a += i*(df[target] == q).values
#Plot predictions vs actual
plt.plot(a)
plt.plot(p)
plt.legend(['Actual','Predicted'])
plt.yticks([i for i in range(len(unq))],[q for q in unq]);
print(classification_report(df[target], prediction[target]))

## Find out which machine learning model did Darwin use:

In [23]:
status, model_type = ds.lookup_model_name('illnesses_model1')
print(model_type['description']['best_genome'])

{'type': 'RandomForestRegressor', 'parameters': {'bootstrap': True, 'criterion': 'mse', 'max_depth': 3, 'max_features': 1.0, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1, 'n_estimators': 100}}
