# Introduction

This is a sample notebook that was created to show how a user could interact programmatically with the AutoML Visual Interface in Dataiku DSS. Most of the content was pieced together from articles within Dataiku's [Knowledge Base](https://doc.dataiku.com/)

Though not extensive, I hope this notebook exposes you to some of the key functions that you may need to use and also some useful functions that might be relevant when trying to set these tasks to run automatically through [Scenarios](https://doc.dataiku.com/dss/latest/scenarios/index.html)

For more information on the topic and the APIs, I would recommend to refer to [this](https://doc.dataiku.com/dss/latest/python-api/ml.html#obtaining-a-handle-to-an-existing-ml-task)

---

# Importing Libraries

In [62]:
import dataiku
import datetime

# # from outside DSS
# import dataikuapi

# Establishing connection to DSS Instance

In [2]:
# establish connection to instance
client = dataiku.api_client()

# # from outside DSS
# host = "http://localhost:11200"
# apiKey = "BCtZV0kLIxHAWCPZTtZM8vgbj2Yzst9F"
# client = dataikuapi.DSSClient(host, apiKey)

# Retrieving Project and ML Tasks

In [3]:
# Get project in DSS
proj = client.get_project("CCFRAUDAVDCORESTART")

In [5]:
# # List ML task that has been created for project
# proj.list_ml_tasks()

# Get ML task that has been created
ml_task = proj.get_ml_task("IryrkUVj", "BA05g5C8") # provide visual_analysis_id & ml_task_id which can be found in the URL


# # Otherwise, you can also create your own ML training task --
# # Create a new ML Task to predict the variable "target" from "trainset"
# ml_task = p.create_prediction_ml_task( # use .create_clustering_ml_task() for clustering tasks
#     input_dataset="trainset",
#     target_variable="target",
#     ml_backend_type='PY_MEMORY', # ML backend to use
#     guess_policy='DEFAULT' # Template to use for setting default parameters
# )

# # Wait for the ML task to be ready
# ml_task.wait_guess_complete()

# Changing settings within ML Tasks:

## 1. Retrieve settings

In [0]:
settings = ml_task.get_settings()

## 2. Feature Selection

In [0]:
settings.reject_feature("not_useful")
settings.use_feature("useful")

## 3. Feature handling

In [0]:
# Use impact coding rather than dummy-coding
fs = settings.get_feature_preprocessing("mycategory")
fs["category_handling"] = "IMPACT"

# Impute missing with most frequent value
fs["missing_handling"] = "IMPUTE"
fs["missing_impute_with"] = "MODE"

## 4. Algorithm selection

In [0]:
settings.set_algorithm_enabled("GBT_CLASSIFICATION", True)
# use .get_all_possible_algorithm_names() to find the str names of algorithms

## 5. Algorithm-specific hyperparameter tuning

In [0]:
rf_settings = settings.get_algorithm_settings("RANDOM_FOREST_CLASSIFICATION")

# rf_settings is an object representing the settings for this algorithm.
# The 'enabled' attribute indicates whether this algorithm will be trained.
# Other attributes are the various hyperparameters of the algorithm.

# The precise hyperparameters for each algorithm are not all documented, so let's
# print the dictionary keys to see available hyperparameters.
# Alternatively, tab completion will provide relevant hints to available hyperparameters.
print(rf_settings.keys())

# Let's first have a look at rf_settings.n_estimators which is a multi-valued hyperparameter
# represented as a NumericalHyperparameterSettings object
print(rf_settings.n_estimators)

# Set multiple explicit values for "n_estimators" to be explored during the search
rf_settings.n_estimators.definition_mode = "EXPLICIT"
rf_settings.n_estimators.values = [100, 200]
# Alternatively use the set_values setter
rf_settings.n_estimators.set_values([100, 200])

# Set a range of values for "n_estimators" to be explored during the search
rf_settings.n_estimators.definition_mode = "RANGE"
rf_settings.n_estimators.range.min = 10
rf_settings.n_estimators.range.max = 100
rf_settings.n_estimators.range.nb_values = 5  # Only relevant for grid-search
# Alternatively, use the set_range setter
rf_settings.n_estimators.set_range(min=10, max=100, nb_values=5)

# Let's now have a look at rf_settings.selection_mode which is a single-valued hyperparameter
# represented as a SingleCategoryHyperparameterSettings object.
# The object stores the valid options for this hyperparameter.
print(rf_settings.selection_mode)

# Features selection mode is not multi-valued so it's not actually searched during the
# hyperparameter search
rf_settings.selection_mode = "sqrt"

## 6. Saving changes made to settings

In [0]:
settings.save()

# Begin training session

In [0]:
training_time = datetime.datetime.strftime(datetime.datetime.now(), '%Y/%m/%d %H:%M')
ml_task.start_train(session_name=f'Programmatic Run @ {training_time}')
ml_task.wait_train_complete()

# Interact with training session

## 1. Get latest training session id

In [43]:
latest_session_id = sorted(ml_task.get_trained_models_ids(),reverse=True)[0].split('-')[4]

## 2. Compare results of latest training session & store best performing model

In [56]:
# Get the identifiers of the trained models
# There will be a list of the models is multiple models were trained in the same session
ids = ml_task.get_trained_models_ids(session_id=latest_session_id)

algorithm = ''
algorithm_id = ''
best_auc = 0

# Iterate through the trained models and get the best performing algo
for index, id in enumerate(ids):
    details = ml_task.get_trained_model_details(id)
    
    if index == 0:
        algorithm = details.get_modeling_settings()["algorithm"]
        algorithm_id = id
        best_auc = details.get_performance_metrics()["auc"]
        
    if details.get_performance_metrics()["auc"] > best_auc:
        algorithm = details.get_modeling_settings()["algorithm"]
        algorithm_id = id
        best_auc = details.get_performance_metrics()["auc"]
        
print(f'''
Best Performing Model : {algorithm}
Metric Achieved : {best_auc}
''')


Best Performing Model : LIGHTGBM_CLASSIFICATION
Metric Achieved : 0.7358589241485487



## 3. Deploy Best Model in Flow

In [0]:
# # Deploying best model as a new Saved Model in the flow
# ret = ml_task.deploy_to_flow(algorithm_id, "my_model", "trainset")

# print("Deployed to saved model id = %s train recipe = %s" % (ret["savedModelId"], ret["trainRecipeName"]))

# Redeploy best model to an exising Saved Model in the flow
saved_model_id = proj.list_saved_models()[0]['id'] # Assuming there is only one Saved Model in the flow
ml_task.redeploy_to_flow(algorithm_id, saved_model_id=saved_model_id)