# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [2]:
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.data.dataset_factory import TabularDatasetFactory
import pandas as pd
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails
from azureml.core.model import Model
from azureml.core.model import InferenceConfig
from azureml.core import Workspace, Environment
from azureml.core import Model
from azureml.core.webservice import AciWebservice, Webservice
import json
import joblib

In [3]:
# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.20.0


In [4]:
ws = Workspace.from_config()
experiment = Experiment(workspace=ws, name="Detection")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = experiment.start_logging()

Workspace name: quick-starts-ws-138190
Azure region: southcentralus
Subscription id: 976ee174-3882-4721-b90a-b5fef6b72f24
Resource group: aml-quickstarts-138190


In [6]:
# TODO: Create compute cluster

# Choose a name for your CPU cluster
cpu_cluster_name = "auto-ml-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 2)


Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

### Overview
The original dataset is in Kaggle Datasets. The original data is licensed by Open Database License (ODbL) 1.0.Open Database License (ODbL) 1.0.

This data is about fraud detection in credit card transactions. The data was made by credit cards in September 2013 by European cardholders. The dataset is highly unbalanced, the positive class which depicts fraudulent transactions (frauds) account for 0.17% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we do not have the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with Principal component analysis (PCA), the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise


This project aims to detect potential fraud cases is credit card transactions and the task here is to differentiate between them. My ultimate intent is to tackle this situation by building classification models to classify and distinguish fraud transactions.

In [7]:
path = 'https://media.githubusercontent.com/media/Tekhunt/Creditcard-fraud-detection/master/fraud-data.csv'
data = TabularDatasetFactory.from_delimited_files(path = path)
data.to_pandas_dataframe().head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## AutoML Settings and Configuration Explained

### n_cross-validation
How many cross validations to perform when user validation data is not specified.

###enable_early_stopping
Whether to enable early termination if the score is not improving in the short term. The default is False but it is set to True here.

### experiment_timeout_minutess
Maximum amount of time in minutes that all iterations combined can take before the experiment terminates. It is set to 15 minutes here.

### verbosity
This is the verbosity level for writing to the log file and it is set to logging.INFO

### training_data
This can be any of these: DataFrame or Dataset or DatasetDefinition or TabularDataset The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column). If training_data is specified, then the label_column_name parameter must also be specified.

### label_column_name
This is the name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers. Here we have column headers and our arget column is the Class column which we aim to predict in the project.

### max_cores_per_iteration
The maximum number of threads to use for a given training iteration. Acceptable values: Equal to -1, which means to use all the possible cores per iteration per child-run.

### max_concurrent_iterations
Represents the maximum number of iterations that would be executed in parallel. The value used here is 4

### compute_target
The Azure Machine Learning compute target to run the Automated Machine Learning experiment on.

### primary_metric
The metric that Automated Machine Learning will optimize for model selection. Accuracy is the primary_metric here.

### task
The type of task to run. Values the here is 'classification'


In [10]:
# TODO: Put your automl settings here
automl_settings = {
       "n_cross_validations": 3,
       "enable_early_stopping": True,
       "experiment_timeout_minutes" :5,
       "max_concurrent_iterations": 4,
       "max_cores_per_iteration": -1,
       "verbosity": logging.INFO,
   }

# TODO: Puyour experiment name here


automl_config = AutoMLConfig(
    compute_target=cpu_cluster,
    task='classification',
    primary_metric= 'accuracy',
    training_data= data,
    label_column_name= 'Class',
    n_cross_validations=5)

In [11]:
# TODo your experiment name here
#remote_run = experiment.submit(automl_config)

remote_run = experiment.submit(automl_config, show_output=True)
remote_run.wait_for_completion()

Running on remote.
No run_configuration provided, running on auto-ml-cluster with default configuration
Running on remote compute: auto-ml-cluster
Parent Run ID: AutoML_26986b18-0c16-4670-b009-817d60da1c62

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias t

{'runId': 'AutoML_26986b18-0c16-4670-b009-817d60da1c62',
 'target': 'auto-ml-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-02-08T19:38:32.920955Z',
 'endTimeUtc': '2021-02-08T22:44:23.732322Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'auto-ml-cluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"5c9cf99e-bf70-449d-ad1d-3c9057be00cb\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"isArchive\\\\\\": false, \\\\\\"path\\\\\\": {\\\\\\"target\\\\\\": 4, \\\\\\"resourceDetails\\\\\\": [{\\\\\\"path\\\\\\": \\\\\\"https://media.githubusercontent.com/media/Tekhunt/Creditcard-fraud-detection/master/fraud-data.csv\\\\\\"}]}}, \\\\\\"localData\\\\\\": {}, \\\\\\"isEnabled\\\\\\": tr

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [12]:
RunDetails(remote_run).show()


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [13]:
best_run, fitted_model = remote_run.get_output()

Package:azureml-automl-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-core, training version:1.21.0.post1, current version:1.20.0
Package:azureml-dataprep, training version:2.8.2, current version:2.7.3
Package:azureml-dataprep-native, training version:28.0.0, current version:27.0.0
Package:azureml-dataprep-rslex, training version:1.6.0, current version:1.5.0
Package:azureml-dataset-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-defaults, training version:1.21.0, current version:1.20.0
Package:azureml-interpret, training version:1.21.0, current version:1.20.0
Package:azureml-pipeline-core, training version:1.21.0, current version:1.20.0
Package:azureml-telemetry, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-client, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-runtime, training version:1.21.0, current version:1.20.0


In [17]:
parameter_values = best_run.get_details()['runDefinition']['arguments']


In [18]:
# Retrieve the best automl model.

print(best_run)
print(fitted_model)

#Register the model
description ='this model predicts whether a transaction is fraudulent or not'
model_name='creditcard-fraud-detection-model'
model_path='./'
tags = None
model = remote_run.register_model(model_name = model_name, description = description , tags = tags)
print(remote_run.model_id)

Run(Experiment: Detection,
Id: AutoML_26986b18-0c16-4670-b009-817d60da1c62_45,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                reg_alpha=0.5789473684210527,
                                                                                                reg_lambda=0.42105263157894735,
               

In [20]:
print('Best Run Id: ', best_run.id)

Best Run Id:  AutoML_26986b18-0c16-4670-b009-817d60da1c62_45


In [None]:
#TODO: Save the best model

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [21]:
#Save the best model for the deployement
import os
os.makedirs('./remote_model', exist_ok=True)

best_run.download_file('/outputs/model.pkl',os.path.join('./remote_model','best_remote_model.pkl'))

for f in best_run.get_file_names():
    if f.startswith('outputs'):
        output_file_path = os.path.join('./remote_model', f.split('/')[-1])
        print(f'Downloading from {f} to {output_file_path} ...')
        best_run.download_file(name=f, output_file_path=output_file_path)

Downloading from outputs/conda_env_v_1_0_0.yml to ./remote_model/conda_env_v_1_0_0.yml ...
Downloading from outputs/env_dependencies.json to ./remote_model/env_dependencies.json ...
Downloading from outputs/internal_cross_validated_models.pkl to ./remote_model/internal_cross_validated_models.pkl ...
Downloading from outputs/model.pkl to ./remote_model/model.pkl ...
Downloading from outputs/pipeline_graph.json to ./remote_model/pipeline_graph.json ...
Downloading from outputs/scoring_file_v_1_0_0.py to ./remote_model/scoring_file_v_1_0_0.py ...


In [23]:
#Register the best model for the deployement

model=best_run.register_model(
            model_name = 'creditcard-fraud-detection-model', 
            model_path = './outputs/model.pkl',
            model_framework=Model.Framework.SCIKITLEARN,
            description='The model detects whether a card transaction is fraudulent or not'
)

In [26]:
# Download the conda environment file and define the environement
best_run.download_file('outputs/conda_env_v_1_0_0.yml', 'conda_env.yml')
myenv = Environment.from_conda_specification(name = 'myenv',
                                             file_path = 'conda_env.yml')

In [28]:
# download the scoring file produced by AutoML
best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'score_remote_run.py')

# set inference config
inference_config = InferenceConfig(entry_script= 'score_remote_run.py',
                                    environment=myenv)

In [31]:
# set Aci Webservice config
aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1, auth_enabled=False)

In [32]:
# deploye the model as a web service
service_name =  'automl-prediction'
service = Model.deploy(workspace=ws, 
                       name= service_name,
                       models=[model], 
                       inference_config=inference_config,
                       deployment_config=aci_config,
                       overwrite=True)

TODO: In the cell below, send a request to the web service you deployed to test it.

In [34]:
service

AciWebservice(workspace=Workspace.create(name='quick-starts-ws-138190', subscription_id='976ee174-3882-4721-b90a-b5fef6b72f24', resource_group='aml-quickstarts-138190'), name=automl-prediction, image_id=None, compute_type=None, state=ACI, scoring_uri=Transitioning, tags=None, properties={}, created_by={})

In [35]:
# wait for deployment to finish and display the scoring uri and swagger uri
service.wait_for_deployment(show_output=True)

print('Service state:')
print(service.state)

print('Scoring URI:')
print(service.scoring_uri)

print('Swagger URI:')
print(service.swagger_uri)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running..........................................................................................................................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"
Service state:
Healthy
Scoring URI:
http://2beb2e57-4e0a-4beb-889f-4da2184753a5.southcentralus.azurecontainer.io/score
Swagger URI:
http://2beb2e57-4e0a-4beb-889f-4da2184753a5.southcentralus.azurecontainer.io/swagger.json


In [42]:
import requests
import json

# URL for the web service
scoring_uri = 'http://2beb2e57-4e0a-4beb-889f-4da2184753a5.southcentralus.azurecontainer.io/score'


data = {"data":
        [
            [
                0.0,
                1.359807,
                -0.072781,
                2.536347,
                1.378155,
                -0.338321,
                0.462388,
                0.239599,
                0.098698,
                0.363787,
                0.090794172,
                -0.551599533,
                -0.617800856,
                -0.991389847,
                -0.311169354,
                1.468176972,
                -0.470400525,
                0.207971242,
                0.02579058,
                0.40399296,
                0.251412098,
                -0.018306778,
                0.277837576,
                -0.11047391,
                0.066928075,
                0.128539358,
                -0.189114844,
                0.133558377,
                -0.021053053,
                149.62
            ],
            [
                0.0,
                1.191857111,
                0.266150712,
                0.166480113,
                0.448154078,
                0.060017649,
                -0.082360809,
                -0.078802983,
                0.085101655,
                -0.255425128,
                -0.166974414,
                1.612726661,
                1.065235311,
                0.489095016,
                -0.143772296,
                0.635558093,
                0.463917041,
                -0.114804663,
                -0.18336127,
                -0.145783041,
                -0.069083135,
                -0.225775248,
                -0.638671953,
                0.101288021,
                -0.339846476,
                0.167170404,
                0.125894532,
                -0.008983099,
                0.014724169,
                2.69
            ]
        ]
        }
        
# Convert to JSON string
input_data = json.dumps(data)

# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header
#headers['Authorization'] = f'Bearer {key}'

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.text)

"{\"result\": [0, 0]}"


TODO: In the cell below, print the logs of the web service and delete the service

In [43]:
print(service.get_logs())

2021-02-08T23:01:24,088599066+00:00 - iot-server/run 
2021-02-08T23:01:24,090402808+00:00 - gunicorn/run 
2021-02-08T23:01:24,090566212+00:00 - rsyslog/run 
2021-02-08T23:01:24,092109648+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_7785023fceb74e4facc1b1a577b1faf9/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7785023fceb74e4facc1b1a577b1faf9/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7785023fceb74e4facc1b1a577b1faf9/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7785023fceb74e4facc1b1a577b1faf9/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7785023fceb74e4facc1b1a577b1faf9/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

In [44]:

service.delete()
cpu_cluster.delete()

Current provisioning state of AmlCompute is "Deleting"

