# Automated ML


Due to the new update to the xgboost framework, models are now saved differently.
This however has not been taken into consideration in the azure core package so problems arose
when trying to retrieve and save the trained model. Hence, the xgboost version is changed to the last one still handling models in .pkl format. 

In [1]:
# Pin the xgboost dependency to an older version
%conda remove xgboost
%pip install xgboost==0.90

import xgboost
print(xgboost.__version__)

Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / failed

PackagesNotFoundError: The following packages are missing from the target environment:
  - xgboost



Note: you may need to restart the kernel to use updated packages.
Collecting xgboost==0.90
  Downloading xgboost-0.90-py2.py3-none-manylinux1_x86_64.whl (142.8 MB)
[K     |████████████████████████████████| 142.8 MB 26 kB/s s eta 0:00:011
Installing collected packages: xgboost
  Attempting uninstall: xgboost
    Found existing installation: xgboost 1.3.3
    Uninstalling xgboost-1.3.3:
      Successfully uninstalled xgboost-1.3.3
Successfully installed xgboost-0.90
Note: you may need to restart the kernel to use updated packages.
0.90


TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [2]:
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.core.environment import Environment 
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset
from azureml.core.model import InferenceConfig 
from azureml.core.webservice import AciWebservice, Webservice
from azureml.core.model import Model


import os
import joblib
import pandas as pd
import json
import logging
from train import data_process

## Workspace Configuration

In [3]:
ws = Workspace.from_config()
exp = Experiment(workspace=ws, name="udacity-project")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code EEHW8F8RD to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.
Workspace name: quick-starts-ws-140005
Azure region: southcentralus
Subscription id: 9e65f93e-bdd8-437b-b1e8-0647cd6098f7
Resource group: aml-quickstarts-140005


## Create compute cluster

In [5]:
# Choose a name for your cluster.
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

### Overview

In this project, the *lending club* dataset from the LendingClub American peer-to-peer lending company was used. The purpose is to use data for risk analytics and minimization in a banking and financial context. To achieve that, statistical information about past loan applicants is used to build a model using supervised learning, where the labels are whether or not the applicant failed to fully repay the loan, to be able to predict if a new applicant is likely to repay the loan. The aim is for the model to identify patterns in the dataset that can be used to determine the outcome of the new application based on the financial history of the applicant. In this way, the probability of defaulting the loan can be assessed and lenders can make an informed decision accordingly that may reduce the loss of business for the company by cutting down the credit loss, e.g. by denying the loan, raising interest rates, offering a different loan amount, etc.

The task at hand is therefore, not only to train an accurate predicitive model usin a logistic regression algorithm, but also to gain an insight into the most important features that determine the result yielded by the model. This allows the company to understand which variables are strong indicators of loan default and apply this knowledge in future risk assessment.

Please note that here only an overview of the dataset is provided and the actual EDA process is carried out in the `train.py` script.

Below is a table with all the inormation available in the dataset for training the model.


|      LoanStatNew     |                                                                                                Description                                                                                               |
|:--------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
|--------------------| ---------------------------------------------------------------------------------|
| loan_amnt            | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.                             |
| term                 | The number of payments on the loan. Values are in months and can be either 36 or 60.                                                                                                                     |
| int_rate             | Interest Rate on the loan                                                                                                                                                                                |
| installment          | The monthly payment owed by the borrower if the loan originates.                                                                                                                                         |
| grade                | LC assigned loan grade                                                                                                                                                                                   |
| sub_grade            | LC assigned loan subgrade                                                                                                                                                                                |
| emp_title            | The job title supplied by the Borrower when applying for the loan.*                                                                                                                                      |
| emp_length           | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.                                                                        |
| home_ownership       | The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER                                                    |
| annual_inc           | The self-reported annual income provided by the borrower during registration.                                                                                                                            |
| verification_status  | Indicates if income was verified by LC, not verified, or if the income source was verified                                                                                                               |
| issue_d              | The month which the loan was funded                                                                                                                                                                      |
| loan_status          | Current status of the loan                                                                                                                                                                               |
| purpose              | A category provided by the borrower for the loan request.                                                                                                                                                |
| title                | The loan title provided by the borrower                                                                                                                                                                  |
| zip_code             | The first 3 numbers of the zip code provided by the borrower in the loan application.                                                                                                                    |
| addr_state           | The state provided by the borrower in the loan application                                                                                                                                               |
| dti                  | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. |
| earliest_cr_line     | The month the borrower's earliest reported credit line was opened                                                                                                                                        |
| open_acc             | The number of open credit lines in the borrower's credit file.                                                                                                                                           |
| pub_rec              | Number of derogatory public records                                                                                                                                                                      |
| revol_bal            | Total credit revolving balance                                                                                                                                                                           |
| revol_util           | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.                                                                               |
| total_acc            | The total number of credit lines currently in the borrower's credit file                                                                                                                                 |
| initial_list_status  | The initial listing status of the loan. Possible values are – W, F                                                                                                                                       |
| application_type     | Indicates whether the loan is an individual application or a joint application with two co-borrowers                                                                                                     |
| mort_acc             | Number of mortgage accounts.                                                                                                                                                                             |
| pub_rec_bankruptcies | Number of public record bankruptcies                                                                                                                                                                     |


TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [6]:
# Data was uploaded from a local file
df = pd.read_csv('./Data/lending_club_loan.csv')

# Use the clean_data function to clean your data.
x, y = data_process(df)
ds_df = pd.concat([x,y],axis=1)

# Save data to a csv and then upload it for the automl run
pd.DataFrame(ds_df).to_csv('./data.csv', index=False)

dataset = ws.get_default_datastore()
dataset.upload(src_dir = './', overwrite = True, show_progress = True)
train_data = Dataset.Tabular.from_delimited_files(
    path = dataset.path('./data.csv')
)

Uploading an estimated of 11 files
Uploading ./.amlignore
Uploaded ./.amlignore, 1 files out of an estimated total of 11
Uploading ./.amlignore.amltmp
Uploaded ./.amlignore.amltmp, 2 files out of an estimated total of 11
Uploading ./automl.ipynb
Uploaded ./automl.ipynb, 3 files out of an estimated total of 11
Uploading ./automl.ipynb.amltmp
Uploaded ./automl.ipynb.amltmp, 4 files out of an estimated total of 11
Uploading ./endpoint.py
Uploaded ./endpoint.py, 5 files out of an estimated total of 11
Uploading ./hyperparameter_tuning.ipynb
Uploaded ./hyperparameter_tuning.ipynb, 6 files out of an estimated total of 11
Uploading ./train.py
Uploaded ./train.py, 7 files out of an estimated total of 11
Uploading ./.ipynb_aml_checkpoints/automl-checkpoint2021-2-7-16-48-27.ipynb
Uploaded ./.ipynb_aml_checkpoints/automl-checkpoint2021-2-7-16-48-27.ipynb, 8 files out of an estimated total of 11
Uploading ./__pycache__/train.cpython-36.pyc
Uploaded ./__pycache__/train.cpython-36.pyc, 9 files out o

## AutoML Configuration

The automl settings and cofiguration in order of appearance are:

`enable_early_stopping` : this parameter allows for the run to be stopped if the model's score has not benn iteratively improving.

`max_concurrent_iterations': the maximum number of iterations allowed to be executed parallelly.

`max_cores_per_iteration`: cores from compute that are to be used per each iteration during training (-1 means all available).

`verbosity` : amount of information included in the training logs

`compute_traget` : compute to be used for training

`experiment_timeout_minutes` : defines how long the experiment can be run, if too low the experiment might not run, if too high experiment time out failures may occur and lead to unnecessary expenses.

`task` : type of experiment, in this case, regression.

`primary_metric` : main metric for the model to look at during training.

`training_data` : data to use for training the model.

`label_column_name` : name of column containing the labels for supervised training.

`n_cross_validations` : number of cross validations to perform after the normal training run. Useful if no validation set is provided, the metrics' values are averaged over (in this case 3) different sets of values. 



In [7]:
# TODO: Put your automl settings here
automl_settings = {
    'enable_early_stopping': True,
    'max_concurrent_iterations': 4,
    'max_cores_per_iteration': -1,
    'verbosity' : logging.INFO
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(
    compute_target = compute_target,
    experiment_timeout_minutes=60,
    task='classification',
    primary_metric = 'accuracy',
    training_data=train_data,
    label_column_name='loan_repaid',
    n_cross_validations = 3,
    **automl_settings
)

In [8]:
# Submit your automl run

remote_run = exp.submit(config= automl_config, show_output=True)
remote_run.wait_for_completion()

Running on remote.
No run_configuration provided, running on cpu-cluster with default configuration
Running on remote compute: cpu-cluster
Parent Run ID: AutoML_41dee557-4822-4d8a-8bba-d12b26a0af4f

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

******************************************************

{'runId': 'AutoML_41dee557-4822-4d8a-8bba-d12b26a0af4f',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-03-07T15:55:36.7088Z',
 'endTimeUtc': '2021-03-07T17:02:45.806493Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '3',
  'target': 'cpu-cluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"8611c307-e3eb-426a-b644-527cec71a470\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"./data.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-140005\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"9e65f93e-bdd8-437b-b1e8-0647cd6098f7\\\\\\", \\\\\\"workspaceName\\

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [10]:
RunDetails(remote_run).show()
for child_run in remote_run.get_children():
    print(child_run)

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

Run(Experiment: udacity-project,
Id: AutoML_41dee557-4822-4d8a-8bba-d12b26a0af4f_37,
Type: azureml.scriptrun,
Status: Completed)
Run(Experiment: udacity-project,
Id: AutoML_41dee557-4822-4d8a-8bba-d12b26a0af4f_36,
Type: azureml.scriptrun,
Status: Completed)
Run(Experiment: udacity-project,
Id: AutoML_41dee557-4822-4d8a-8bba-d12b26a0af4f_35,
Type: azureml.scriptrun,
Status: Canceled)
Run(Experiment: udacity-project,
Id: AutoML_41dee557-4822-4d8a-8bba-d12b26a0af4f_34,
Type: azureml.scriptrun,
Status: Completed)
Run(Experiment: udacity-project,
Id: AutoML_41dee557-4822-4d8a-8bba-d12b26a0af4f_33,
Type: azureml.scriptrun,
Status: Completed)
Run(Experiment: udacity-project,
Id: AutoML_41dee557-4822-4d8a-8bba-d12b26a0af4f_32,
Type: azureml.scriptrun,
Status: Completed)
Run(Experiment: udacity-project,
Id: AutoML_41dee557-4822-4d8a-8bba-d12b26a0af4f_31,
Type: azureml.scriptrun,
Status: Canceled)
Run(Experiment: udacity-project,
Id: AutoML_41dee557-4822-4d8a-8bba-d12b26a0af4f_30,
Type: azureml.

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.

In [11]:
# Retrieve and save your best automl model.

best_run, fitted_model = remote_run.get_output()
print('Best Run Id: ', best_run.id)
print('\nAccuracy: ', best_run.get_metrics()['accuracy'])
print(fitted_model)

Package:azureml-automl-runtime, training version:1.23.0, current version:1.22.0
Package:azureml-core, training version:1.23.0, current version:1.22.0
Package:azureml-dataprep, training version:2.10.1, current version:2.9.1
Package:azureml-dataprep-native, training version:30.0.0, current version:29.0.0
Package:azureml-dataprep-rslex, training version:1.8.0, current version:1.7.0
Package:azureml-dataset-runtime, training version:1.23.0, current version:1.22.0
Package:azureml-defaults, training version:1.23.0, current version:1.22.0
Package:azureml-interpret, training version:1.23.0, current version:1.22.0
Package:azureml-mlflow, training version:1.23.0, current version:1.22.0
Package:azureml-pipeline-core, training version:1.23.0, current version:1.22.0
Package:azureml-telemetry, training version:1.23.0, current version:1.22.0
Package:azureml-train-automl-client, training version:1.23.0, current version:1.22.0
Package:azureml-train-automl-runtime, training version:1.23.0, current versio

Best Run Id:  AutoML_41dee557-4822-4d8a-8bba-d12b26a0af4f_36

Accuracy:  0.8898079309185954
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                    min_samples_leaf=0.01,
                                                                                                    min_samples_split=0.01,
                                                     

In [12]:
# Download the model file

best_run.download_file('outputs/model.pkl', 'Automl_model.pkl')

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [13]:
# Get best model
best_run.register_model(model_name = "best_run_automl.pkl", model_path = './outputs/')
best_run.get_file_names()
best_run.download_file('outputs/conda_env_v_1_0_0.yml', 'env.yml')

# Register the model
model = remote_run.register_model(model_name = 'best_run_automl.pkl')

# Create inference
environment = best_run.get_environment()
entry_script='inference/scoring.py'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', entry_script)
inference_config = InferenceConfig(entry_script = entry_script, environment = environment)

# Deploy model as web service (ACI)
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                                    memory_gb = 1, 
                                                    auth_enabled= True, 
                                                    enable_app_insights= True)

service = Model.deploy(ws, "aciservice", [model], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running............................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [14]:
# Getting the service parameters

primary, secondary = service.get_keys()

print('Service state: ' + service.state)
print('Service scoring URI: ' + service.scoring_uri)
print('Service Swagger URI: ' + service.swagger_uri)
print('Service primary authentication key: ' + primary)

Service state: Healthy
Service scoring URI: http://08dc83d0-246a-4281-a39b-82285f763de5.southcentralus.azurecontainer.io/score
Service Swagger URI: http://08dc83d0-246a-4281-a39b-82285f763de5.southcentralus.azurecontainer.io/swagger.json
Service primary authentication key: Qa3dicRCEK5IRa91sY3pMXm5BAdR210X


In [18]:
# Getting two different values from the data to test the deployed model
# The data in the script.py has to be modified with these new dictionaires

# Loan defaulted
test_1 = x.iloc[4]
result_1 = test_1.to_json()
parsed_1 = json.loads(result_1)

# Loan repaid
test_2 = x.iloc[278]
result_2 = test_2.to_json()
parsed_2 = json.loads(result_2)

data = {'data': [parsed_1,parsed_2]}

print('data = {}'.format(data))


data = {'data': [{'loan_amnt': 24375.0, 'term': 60, 'int_rate': 17.27, 'installment': 609.33, 'annual_inc': 55000.0, 'dti': 33.95, 'earliest_cr_line': 1999, 'open_acc': 13.0, 'pub_rec': 0.0, 'revol_bal': 24584.0, 'revol_util': 69.8, 'total_acc': 43.0, 'mort_acc': 1.0, 'pub_rec_bankruptcies': 0.0, 'zip_code': '11650', 'sub_grade_A2': 0, 'sub_grade_A3': 0, 'sub_grade_A4': 0, 'sub_grade_A5': 0, 'sub_grade_B1': 0, 'sub_grade_B2': 0, 'sub_grade_B3': 0, 'sub_grade_B4': 0, 'sub_grade_B5': 0, 'sub_grade_C1': 0, 'sub_grade_C2': 0, 'sub_grade_C3': 0, 'sub_grade_C4': 0, 'sub_grade_C5': 1, 'sub_grade_D1': 0, 'sub_grade_D2': 0, 'sub_grade_D3': 0, 'sub_grade_D4': 0, 'sub_grade_D5': 0, 'sub_grade_E1': 0, 'sub_grade_E2': 0, 'sub_grade_E3': 0, 'sub_grade_E4': 0, 'sub_grade_E5': 0, 'sub_grade_F1': 0, 'sub_grade_F2': 0, 'sub_grade_F3': 0, 'sub_grade_F4': 0, 'sub_grade_F5': 0, 'sub_grade_G1': 0, 'sub_grade_G2': 0, 'sub_grade_G3': 0, 'sub_grade_G4': 0, 'sub_grade_G5': 0, 'verification_status_Source Verifie

TODO: In the cell below, send a request to the web service you deployed to test it.

In [19]:
# Consuming model endpoint
# Send request to deployed service.
# The endpoint.py has been modified to contain the REST point URL,
# primary key and the two new JSON strings for testing

%run endpoint.py

If result is 1, loan has been repaid
{"result": [0, 1]}


TODO: In the cell below, print the logs of the web service and delete the service

In [20]:
# Logging

print(service.get_logs())

2021-03-07T17:13:31,212359000+00:00 - rsyslog/run 
2021-03-07T17:13:31,219495300+00:00 - nginx/run 
2021-03-07T17:13:31,215019000+00:00 - gunicorn/run 
2021-03-07T17:13:31,214124800+00:00 - iot-server/run 
/usr/sbin/nginx: /azureml-envs/azureml_661474bbe74e96b5d8added5888dfc85/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_661474bbe74e96b5d8added5888dfc85/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_661474bbe74e96b5d8added5888dfc85/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_661474bbe74e96b5d8added5888dfc85/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_661474bbe74e96b5d8added5888dfc85/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

In [None]:
# Deleting service

service.delete()

In [17]:
# Cluster clean-up

compute_target.delete()

In [18]:
compute_target

AmlCompute(workspace=Workspace.create(name='quick-starts-ws-126905', subscription_id='82648f26-b738-43a4-9ebb-f954c9f1ff3a', resource_group='aml-quickstarts-126905'), name=cpu-cluster, id=/subscriptions/82648f26-b738-43a4-9ebb-f954c9f1ff3a/resourceGroups/aml-quickstarts-126905/providers/Microsoft.MachineLearningServices/workspaces/quick-starts-ws-126905/computes/cpu-cluster, type=AmlCompute, provisioning_state=Deleting, location=southcentralus, tags=None)

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

