<img src="https://www.sparkcognition.com/wp-content/uploads/2019/11/SparkCognition-Logo-Color-e1573238635285.png">

<h1><center>Darwin Supervised Regression Model Building </center></h1>

# Prior to getting started:

Darwin notebook will no longer support 'Register User' starting from 2.0. As a user, you must have credentials ready before using this notebook. 

In order to proceed, in the Environment Variables cell: 
1. Set your username and password to ensure that you're able to log in successfully
2. Set the path to the location of your datasets if you are using your own data.  The path is set for the examples.
3. Set the dataset names accordingly

Here are a few things to be mindful of:
1. For every run, check the job status (i.e. requested, failed, running, completed) and wait for job to complete before proceeding. 
2. If you're not satisfied with your model and think that Darwin can benefit from extra training, use the resume function.

## Import Necessary Libraries

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import os
import datetime
import numpy as np
from IPython.display import Image
from time import sleep
from sklearn.metrics import r2_score
from amb_sdk.sdk import DarwinSdk

## Set Darwin SDK

In [None]:
ds = DarwinSdk()
ds.set_url('https://darwin-api.sparkcognition.com/v1/')

## Environment Variables

In [None]:
#Set your user id and password accordingly
USER="[your Darwin user id]"
PW="[your Darwin password]"

# Set path to datasets - The default below assumes Jupyter was started from amb-sdk/examples/Enterprise/
# Modify accordingly if you wish to use your own data
# PATH_TO_DATASET='../../sets/'
PATH_TO_DATASET='../../sets/'
TRAIN_DATASET='power_train.csv'
TEST_DATASET='power_test.csv'

# A timestamp is used to create a unique name in the event you execute the workflow multiple times or with 
# different datasets.  File names must be unique in Darwin.
ts = '{:%Y%m%d%H%M%S}'.format(datetime.datetime.now())

# User Login 

In [None]:
status, msg = ds.auth_login_user(USER,PW)
if not status:
    print(msg)
else:
    print("Login successfully!")

# Data Upload

**Read dataset and view a file snippet**

In [None]:
df = pd.read_csv(os.path.join(PATH_TO_DATASET, TRAIN_DATASET))
df.head()

In [None]:
df.shape

**Upload dataset to Darwin**

In [None]:
# Upload dataset
status, dataset = ds.upload_dataset(os.path.join(PATH_TO_DATASET, TRAIN_DATASET))
if not status:
    print(dataset)
else:
    print("Upload dataset successfully!")

## Analyze Data
Analyze data is a necessary step before cleaning data and creating model. 

In [None]:
status, analyze_id = ds.analyze_data(TRAIN_DATASET, 
                                     job_name = 'Darwin_analyze_data_job' + "-" + ts, 
                                     artifact_name = 'Darwin_analyze_data_artifact' + "-" + ts)
sleep(1)
if status:
    ds.wait_for_job('Darwin_analyze_data_job' + "-" + ts)
else:
    print(analyze_id)

## **Clean dataset**

In [None]:
# clean dataset
target = "PE"
status, job_id = ds.clean_data(TRAIN_DATASET, target = target)

if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

# Create and Train Model 

We will now build a model that will learn the class labels in the target column.<br> In the default boston dataset, the target column is "Assessed_Value". <br> You will have to specify your own target name for your custom dataset. <br> You can also increase max_train_time for longer training.


#### NOTE (New Feature in Darwin 2.0):
User can applies cross-validation in Darwin now. Simply specify k for the number of cross-validaiton to be ran.

In [None]:
cv_kfold = 3

In [None]:
model = target + "_model3" + ts
max_train_time = '00:01'
status, job_id = ds.create_model(dataset_names = TRAIN_DATASET,
                                 model_name =  model,
                                 max_train_time = max_train_time,
                                 fit_profile_name = job_id['profile_name'],
                                 cv_kfold = cv_kfold)
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

In [None]:
ds.lookup_job_status_name(job_id['job_name'])

# Extra Training (Optional)
Run the following cell for extra training, no need to specify parameters

In [None]:
status, job_id = ds.resume_training_model(dataset_names = TRAIN_DATASET,
                                          model_name = model,
                                          max_train_time = '00:01')
                                          
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

# Analyze Model
Analyze model provides feature importance ranked by the model. <br> It indicates a general view of which features pose a bigger impact on the model

In [None]:
# Retrieve feature importance of built model
status, artifact = ds.analyze_model(model)
sleep(1)
if status:
    ds.wait_for_job(artifact['job_name'])
else:
    print(artifact)
status, feature_importance = ds.download_artifact(artifact['artifact_name'])

Show the 10 most important features of the model.

In [None]:
feature_importance[:10]

# Predictions
**Perform model prediction on the the training dataset.**

In [None]:
ts = '{:%Y%m%d%H%M%S}'.format(datetime.datetime.now())
df = pd.read_csv(os.path.join(PATH_TO_DATASET, TEST_DATASET))
df.head()

In [None]:
# Upload dataset
status, dataset = ds.upload_dataset(os.path.join(PATH_TO_DATASET, TEST_DATASET))
if not status:
    print(dataset)
else:
    print("Upload successfully!")

In [None]:
# analyze dataset
status, analyze_id = ds.analyze_data(TEST_DATASET, 
                                     job_name = 'Darwin_analyze_test_job' + "-" + ts, 
                                     artifact_name = 'Darwin_analyze_test_artifact' + "-" + ts)
sleep(1)
if status:
    ds.wait_for_job('Darwin_analyze_test_job' + "-" + ts)
else:
    print(analyze_id)

In [None]:
# clean dataset
status, job_id = ds.clean_data(TEST_DATASET, model_name=model)

if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

In [None]:
status, artifact = ds.run_model(TEST_DATASET, model)
sleep(1)
ds.wait_for_job(artifact['job_name'])

Download predictions from Darwin's server.

In [None]:
status, prediction = ds.download_artifact(artifact['artifact_name'])
prediction.head()

Create plots comparing predictions with actual target

In [None]:
#Plot predictions vs actual
plt.plot(df[target], prediction[target], '.')
plt.plot([0,2.3e7],[0,2.3e7],'--k')
print('R^2 : ', r2_score(df[target], prediction[target]))

## Find out which type of model Darwin used:

In [None]:
status, model_type = ds.lookup_model_name(model)
print(model_type['description']['best_genome'])