# Automated ML

Import dependencies used throughout the project and connect to a computer cluster.

In [1]:
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

amlcompute_cluster_name = "ds12-compute"
ws = Workspace.from_config()
# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

Found existing cluster, use it.


## Dataset

### Overview
The dataset is from applications for employment, as well as information on if they were hired and how long they have worked for the company.  The dataset is not publically available, but the task is to train to those applications that will be hired and work the longest for the company to predict for future applications where they should focus their efforts.  The dataset is in my google drive currently to make this part work.  I did have to pare down the data to stay under 100 MB limit so that it would directly download.


In [2]:

datapath = "https://drive.google.com/uc?export=download&id=19-V0WLC9o10S2ncmhqB7mcKchG7W6v7J"
# choose a name for experiment
experiment_name = 'Application-Scoring'

experiment=Experiment(ws, experiment_name)
key = "Trucker Applications"

if key in ws.datasets.keys(): 
        dataset = ws.datasets[key] 
else
        dataset = TabularDatasetFactory.from_delimited_files(path=datapath)


df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,app.cre_AppId,app.cre_Experience,app.cre_MonthsExperienceinPast36,app.cre_Veteran,app.cre_WantTeamDriver,app.cre_CDLType,app.cre_ScoreCurrent,app.cre_ScoreInitial,app.cre_VettingStatus,app.cre_AccidentCount,...,cre_washonorablydischarged,cre_recklessdrivingcount,cre_driverchewtobacco,cre_driversmoker,cre_drivervapeuser,cre_teamchewtobaccousers,cre_teamsmokers,cre_teamvapeusers,cre_teamgender,workmonths
count,285415.0,220.0,218.0,285233.0,357.0,285381.0,200877.0,285149.0,283786.0,10316.0,...,10785.0,1368.0,10699.0,10699.0,10699.0,10470.0,10478.0,10484.0,10424.0,285415.0
mean,4046480.0,11.040909,3.733945,171140000.0,171140000.0,171140000.0,3.418679,3.744193,171140000.0,171140000.0,...,170902000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,2.632771,0.466493
std,82382.45,27.337117,9.471901,0.285483,0.2778276,1.072979,0.986843,0.69265,0.3435073,0.6942623,...,6378301.0,0.2293318,0.2112776,0.4451481,0.2855365,0.7431314,0.7543622,0.7601721,0.650777,3.849076
min,3903790.0,0.0,0.0,171140000.0,171140000.0,171140000.0,0.0,1.0,171140000.0,171140000.0,...,0.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,1.0,0.0
25%,3975136.0,0.0,0.0,171140000.0,171140000.0,171140000.0,3.0,3.0,171140000.0,171140000.0,...,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,2.0,0.0
50%,4046480.0,0.0,0.0,171140000.0,171140000.0,171140000.0,4.0,4.0,171140000.0,171140000.0,...,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,3.0,0.0
75%,4117822.0,2.0,0.0,171140000.0,171140000.0,171140000.0,4.0,4.0,171140000.0,171140000.0,...,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,3.0,0.0
max,4189172.0,120.0,36.0,171140000.0,171140000.0,171140000.0,5.0,5.0,171140000.0,171140000.0,...,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,171140000.0,3.0,83.0


## AutoML Configuration

The data is very heavily weighted towards applications that are never hired, so I am using the weighted Area Under the Curve as the primary metric, as that should help account for the natural data skew towards the zero months worked.  
I am attempting a classification that will bucket the number of months worked, although a regression method may be used as well.
I am running this in my own instance, and the timeout needed to be higher because of the large number of records in the dataset.

In [3]:
automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 4,
    "primary_metric" : 'AUC_weighted'
}
project_folder = './automl-trucking-dataset'
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="workmonths",   
                             path = project_folder,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [8]:
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

Different models seemed to be able to handle the data better. AutoML tried quite a few different processing methods.  I chose categorization for this set of runs, and with the sparse data and the auto imputation that is done it will likely be better in future to manually correct some of the data before running this again.

In [10]:

from azureml.widgets import RunDetails
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model
The best model found was an Ensemble, which allows multiple different models to be run to find the best option for a variety of data.


In [4]:
remote_run = azureml.train.automl.run.AutoMLRun(experiment, "AutoML_6d003079-c9d5-47b8-a7a1-b0de97860854")
best_run, model = remote_run.get_output()
model
metrics_output = best_run.get_metrics()
metrics_output

{'f1_score_micro': 0.970044145469834,
 'precision_score_macro': 0.0273621817332399,
 'recall_score_micro': 0.970044145469834,
 'AUC_macro': 0.9467556500780206,
 'balanced_accuracy': 0.01414141414141414,
 'norm_macro_recall': 0.00025608194622279116,
 'precision_score_micro': 0.970044145469834,
 'average_precision_score_macro': 0.13015300182908726,
 'recall_score_macro': 0.01414141414141414,
 'matthews_correlation': 0.03577364799338233,
 'log_loss': 0.23556484145540318,
 'accuracy': 0.970044145469834,
 'weighted_accuracy': 0.9999587565664301,
 'f1_score_weighted': 0.9553447502551431,
 'recall_score_weighted': 0.970044145469834,
 'AUC_weighted': 0.9816504975828168,
 'average_precision_score_weighted': 0.9788042356698312,
 'precision_score_weighted': 0.9429105938471918,
 'average_precision_score_micro': 0.9912906506703967,
 'f1_score_macro': 0.01417396643026245,
 'AUC_micro': 0.9990943482243209,
 'accuracy_table': 'aml://artifactId/ExperimentRun/dcid.AutoML_6d003079-c9d5-47b8-a7a1-b0de9786

In [7]:
import joblib

joblib.dump(model, 'automl_model.joblib')

['automl_model.joblib']

## Model Deployment

I deployed this model as it does have a better score than the hyper parameter models.  I am able to see the AUC and optimize for that, where it was difficult to find a model that would run in the required time using manual scripts.


In [5]:
description = 'Trucking Application Scoring'
model = remote_run.register_model(description = description,
                               tags={'area': 'mnist'}, model_name='trucking-app-model')

print(model.name, model.id, model.version, sep='\t')



trucking-app-model	trucking-app-model:2	2


In [6]:
%%writefile score.py
# ---------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# ---------------------------------------------------------
import json
import logging
import os
import pickle
import numpy as np
import pandas as pd
import joblib

import azureml.automl.core
from azureml.automl.core.shared import logging_utilities, log_server
from azureml.telemetry import INSTRUMENTATION_KEY

from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType


input_sample = pd.DataFrame({"app.cre_AppId": pd.Series([0], dtype="int64"), "app.cre_Experience": pd.Series([0.0], dtype="float64"), "app.cre_MonthsExperienceinPast36": pd.Series([0.0], dtype="float64"), "app.cre_PardotScore": pd.Series(["example_value"], dtype="object"), "app.cre_Veteran": pd.Series([0.0], dtype="float64"), "app.cre_WantTeamDriver": pd.Series([0.0], dtype="float64"), "app.cre_DriverApplicationSource": pd.Series(["example_value"], dtype="object"), "app.cre_RecordSource": pd.Series(["example_value"], dtype="object"), "app.cre_CDLType": pd.Series([0.0], dtype="float64"), "app.cre_AccidentInformationProvided": pd.Series([False], dtype="bool"), "app.cre_ContactInformationProvided": pd.Series([False], dtype="bool"), "app.cre_CriminalInformationProvided": pd.Series([False], dtype="bool"), "app.cre_TicketInformationProvided": pd.Series([False], dtype="bool"), "app.cre_ScoreCurrent": pd.Series([0.0], dtype="float64"), "app.cre_ScoreInitial": pd.Series([0.0], dtype="float64"), "app.cre_VettingStatus": pd.Series([0.0], dtype="float64"), "app.cre_AccidentCount": pd.Series([0.0], dtype="float64"), "app.cre_DUICount": pd.Series([0.0], dtype="float64"), "app.cre_MovingViolationCount": pd.Series([0.0], dtype="float64"), "app.cre_SoftFicoScore": pd.Series([0.0], dtype="float64"), "app.cre_CDLCLPExp": pd.Series(["example_value"], dtype="object"), "app.cre_FelonyCount": pd.Series([0.0], dtype="float64"), "address1_postalcode": pd.Series([0.0], dtype="float64"), "cre_referralcode": pd.Series(["example_value"], dtype="object"), "cre_referralestimatedexperience": pd.Series([0.0], dtype="float64"), "cre_referralsourceid": pd.Series(["example_value"], dtype="object"), "cre_accidentcount": pd.Series([0.0], dtype="float64"), "cre_canpassdrugtest": pd.Series(["example_value"], dtype="object"), "cre_cdlclass": pd.Series(["example_value"], dtype="object"), "cre_cdlexp": pd.Series([0.0], dtype="float64"), "cre_duicount": pd.Series(["example_value"], dtype="object"), "cre_hascdl": pd.Series(["example_value"], dtype="object"), "cre_honorablydischarged": pd.Series(["example_value"], dtype="object"), "cre_movingviolationcount": pd.Series([0.0], dtype="float64"), "cre_recordsource": pd.Series([0.0], dtype="float64"), "cre_veteran": pd.Series(["example_value"], dtype="object"), "cre_washonorablydischarged": pd.Series([0.0], dtype="float64"), "cre_minsoftficoscore": pd.Series(["example_value"], dtype="object"), "cre_softficoscore": pd.Series(["example_value"], dtype="object"), "cre_militarydischargedon": pd.Series(["2000-1-1"], dtype="datetime64[ns]"), "cre_recklessdrivingcount": pd.Series([0.0], dtype="float64"), "cre_driverchewtobacco": pd.Series([0.0], dtype="float64"), "cre_driversmoker": pd.Series([0.0], dtype="float64"), "cre_drivervapeuser": pd.Series([0.0], dtype="float64"), "cre_teamchewtobaccousers": pd.Series([0.0], dtype="float64"), "cre_teamoppositegender": pd.Series(["example_value"], dtype="object"), "cre_teamsmokers": pd.Series([0.0], dtype="float64"), "cre_teamvapeusers": pd.Series([0.0], dtype="float64"), "cre_teamgender": pd.Series([0.0], dtype="float64"), "cre_donottext": pd.Series([False], dtype="bool")})
output_sample = np.array([0])
try:
    log_server.enable_telemetry(INSTRUMENTATION_KEY)
    log_server.set_verbosity('INFO')
    logger = logging.getLogger('azureml.automl.core.scoring_script')
except:
    pass


def init():
    global model
    # This name is model.id of model that we want to deploy deserialize the model file back
    # into a sklearn model
    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'model.pkl')
    path = os.path.normpath(model_path)
    path_split = path.split(os.sep)
    log_server.update_custom_dimensions({'model_name': path_split[1], 'model_version': path_split[2]})
    try:
        logger.info("Loading model from path.")
        model = joblib.load(model_path)
        logger.info("Loading successful.")
    except Exception as e:
        logging_utilities.log_traceback(e, logger)
        raise


@input_schema('data', PandasParameterType(input_sample))
@output_schema(NumpyParameterType(output_sample))
def run(data):
    try:
        result = model.predict(data)
        return json.dumps({"result": result.tolist()})
    except Exception as e:
        result = str(e)
        return json.dumps({"error": result})


Overwriting score.py


In [7]:
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment

env = Environment.get(ws, "AzureML-AutoML").clone('Custom-AutoML')


inf_config = InferenceConfig(entry_script='score.py', environment=env)

In [9]:
from azureml.core.webservice import AciWebservice
from azureml.core import Model
service_name = 'trucking-app-scoring'
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)

service = model.deploy(ws, service_name, [model], overwrite=True, deployment_config=deployment_config, inference_config=inf_config)
service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running.............................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


Request sent, based on a record from the dataset, which returns a result of zero.

In [13]:
import json
from collections import namedtuple
import pandas as pd
#dataString = dataset.take(1).to_pandas_dataframe().to_json()
#service = azureml.core.Webservice(ws,'best-model-service-309')

data = [
        {"app.cre_AppId":4054803,
        "app.cre_Experience":0.0,
        "app.cre_MonthsExperienceinPast36":0.0,
        "app.cre_PardotScore":"object",
        "app.cre_Veteran":171140000,
        "app.cre_WantTeamDriver":0,
        "app.cre_DriverApplicationSource":"object",
        "app.cre_RecordSource":"{F13D1BBF-06FE-E611-80DA-0050569526E6}",
        "app.cre_CDLType":171140000,
        "app.cre_AccidentInformationProvided":False,
        "app.cre_ContactInformationProvided":False,
        "app.cre_CriminalInformationProvided":False,
        "app.cre_TicketInformationProvided":False,
        "app.cre_ScoreCurrent":2.0,
        "app.cre_ScoreInitial":4,
        "app.cre_VettingStatus":171140002,
        "app.cre_AccidentCount":0,
        "app.cre_DUICount":0,
        "app.cre_MovingViolationCount":0,
        "app.cre_SoftFicoScore":0,
        "app.cre_CDLCLPExp":"object",
        "app.cre_FelonyCount":0,
        "address1_postalcode":45011,
        "cre_referralcode":"object",
        "cre_referralestimatedexperience":0,
        "cre_referralsourceid":"object",
        "cre_accidentcount":0,
        "cre_canpassdrugtest":"true",
        "cre_cdlclass":"object",
        "cre_cdlexp":0,
        "cre_duicount":"example_value",
        "cre_hascdl":"false",
        "cre_honorablydischarged":"object",
        "cre_movingviolationcount":0,
        "cre_recordsource":7088,
        "cre_veteran":"false",
        "cre_washonorablydischarged":1,
        "cre_minsoftficoscore":"example_value",
        "cre_softficoscore":"example_value",
        "cre_militarydischargedon":"2000-1-1",
        "cre_recklessdrivingcount":0,
        "cre_driverchewtobacco":0,
        "cre_driversmoker":0,
        "cre_drivervapeuser":0,
        "cre_teamchewtobaccousers":0,
        "cre_teamoppositegender":"example_value",
        "cre_teamsmokers":0,
        "cre_teamvapeusers":0,
        "cre_teamgender":0,
        "cre_donottext":True}
    ]
input_payload = json.dumps({
    'data': data
})
print(input_payload)
output = service.run(input_payload)

print(output)

{"data": [{"app.cre_AppId": 4054803, "app.cre_Experience": 0.0, "app.cre_MonthsExperienceinPast36": 0.0, "app.cre_PardotScore": "object", "app.cre_Veteran": 171140000, "app.cre_WantTeamDriver": 0, "app.cre_DriverApplicationSource": "object", "app.cre_RecordSource": "{F13D1BBF-06FE-E611-80DA-0050569526E6}", "app.cre_CDLType": 171140000, "app.cre_AccidentInformationProvided": false, "app.cre_ContactInformationProvided": false, "app.cre_CriminalInformationProvided": false, "app.cre_TicketInformationProvided": false, "app.cre_ScoreCurrent": 2.0, "app.cre_ScoreInitial": 4, "app.cre_VettingStatus": 171140002, "app.cre_AccidentCount": 0, "app.cre_DUICount": 0, "app.cre_MovingViolationCount": 0, "app.cre_SoftFicoScore": 0, "app.cre_CDLCLPExp": "object", "app.cre_FelonyCount": 0, "address1_postalcode": 45011, "cre_referralcode": "object", "cre_referralestimatedexperience": 0, "cre_referralsourceid": "object", "cre_accidentcount": 0, "cre_canpassdrugtest": "true", "cre_cdlclass": "object", "cre_

Logs from the service, and delete the service to save money

In [14]:
#from azureml.core.webservice import Webservice
#service = Webservice(name=best_run.model_id, workspace=ws)
logs = service.get_logs()

for line in logs.split('\n'):
    print(line)

service.delete()

2021-02-03T03:01:25,162370467+00:00 - iot-server/run 
2021-02-03T03:01:25,163954167+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_265db83b0c6014ce472c5de2f0b97e04/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_265db83b0c6014ce472c5de2f0b97e04/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_265db83b0c6014ce472c5de2f0b97e04/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_265db83b0c6014ce472c5de2f0b97e04/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_265db83b0c6014ce472c5de2f0b97e04/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
2021-02-03T03:01:25,164727367+00:00 - gunicorn/run 
2021-02-03T03:01:25,163217067+00:00 - rsyslog/run 
rsyslogd