# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
from azureml.core import Workspace, Experiment
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.data.dataset_factory import TabularDatasetFactory
from train import split_data
from sklearn.model_selection import train_test_split
from azureml.core import ScriptRunConfig 
import os

## Dataset

### Overview
For this project, the dataset chosen is the **[Winconsin Breast Cancer](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)** dataset from Kaggle. There are a total of 32 columns which includes,<br>
Attribute information:
<ul>
    <li>ID Number</li>
    <li>Diagnosis (M = malignant, B = benign)</li>
</ul>
and real-valued features computed from digitized image of a Fine Needle Aspirate of a breast mass and describes the characteristics of the cell nuclei present in the image,
<ul>
    <li>radius (mean of distances from center to points on the perimeter)</li>
    <li>texture (standard deviation of gray-scale values)</li>
    <li>perimeter</li>
    <li>area</li>
    <li>smoothness (local variation in radius lengths) </li>
    <li>compactness (perimeter^2 / area - 1.0) </li>
    <li>concavity (severity of concave portions of the contour) </li>
    <li>concave points (number of concave portions of the contour)</li>
    <li>symmetry</li>
    <li>fractal dimension ("coastline approximation" - 1)</li>
</ul>
The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features.

The task here to classify the given details of the FNA image as malignant or benign and thus a **binary classification** algorithm is required. All the features except the *ID Number* is being used for training the model and the column *diagnosis* is considered as the taget variable. 


The dataset is accessed from the workspace through this [URL](https://raw.githubusercontent.com/JoanneJons/azure-machine-learning-capstone/main/breast-cancer-dataset.csv?token=AJ5V2OGXYLJ22BGYXN4EUODAC6P4K)

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'capstone-project-automl'

experiment=Experiment(ws, experiment_name)

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

web_path = "https://raw.githubusercontent.com/JoanneJons/azure-machine-learning-capstone/main/breast-cancer-dataset.csv?token=AJ5V2OGXYLJ22BGYXN4EUODAC6P4K"
dataset = TabularDatasetFactory.from_delimited_files(path=web_path)

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code F2ZE8NUFZ to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.
Workspace name: quick-starts-ws-136132
Azure region: southcentralus
Subscription id: 9e65f93e-bdd8-437b-b1e8-0647cd6098f7
Resource group: aml-quickstarts-136132


In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cpu_cluster_name = "cpucluster"
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
cpu_cluster.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [4]:
import pandas as pd
x, y = split_data(dataset)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)

try:
    os.makedirs('./data', exist_ok=True)
except OSError as error:
    print('New directory cannot be created')

train_df = X_train
train_df['diagnosis'] = y_train

train_path = 'data/train-data.csv'
train_df.to_csv(train_path)

test_df = X_test
test_df['diagnosis'] = y_test

test_path = 'data/test-data.csv'
test_df.to_csv(test_path)

datastore = ws.get_default_datastore()
datastore.upload(src_dir='data', target_path='data')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Uploading an estimated of 2 files
Uploading data/test-data.csv
Uploaded data/test-data.csv, 1 files out of an estimated total of 2
Uploading data/train-data.csv
Uploaded data/train-data.csv, 2 files out of an estimated total of 2
Uploaded 2 files


$AZUREML_DATAREFERENCE_d915dbd50a414905bda133a94e1bf0c7

In [5]:
train_data = TabularDatasetFactory.from_delimited_files(path=[(datastore, ('data/train-data.csv'))])
test_data = TabularDatasetFactory.from_delimited_files(path=[(datastore, ('data/test-data.csv'))])

In [6]:
train_data

{
  "source": [
    "('workspaceblobstore', 'data/train-data.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ]
}

## AutoML Configuration

For this project, AutoML was configured using an instance of the  `AutoMLConfig` object. The following parameters were set:<br>
1. `experiment_timeout_minutes = 30`<br>
*Maximum amount of time in minutes that all iterations combined can take before the experiment terminates.*<br>
For this project, this has been set as 30 because of the time restrictions of Udacity labs.<br><br>
2. `task = 'classification'`<br>
*The type of task to run depending on the automated ML problem to solve.*<br>
This project handles a binary classification task.<br><br>
3. `compute_target=cpu_cluster`<br>
*The Azure Machine Learning compute target to run the AutoML experiment on.*<br>For this experiment, a compute cluster called `cpu_cluster` is created before configuring AutoML. This computer cluser is *STANDARD_D2_V2* with a maximum of 4 nodes.<br><br>
4. `training_data = train_data`<br>
*The training data to be used within the experiment.*<br>Here `train_data` is a TabularDataset loaded from a CSV file.<br><br>
5. `primary_metric = 'accuracy'`<br>
*The metric that AutoML will optimize for model selection.*<br><br>
6. `label_column_name = 'diagnosis'`<br>
*The name of the label column.*<br>Here the target column is 'diagnosis' which specifies whether the instance is malignant (1) or benign (0).<br><br>
7. `n_cross_validations = 5`<br>
*The number of cross validations to perform when user validation data is not specified.*<br><br>

In [7]:
from azureml.train.automl import AutoMLConfig

# Define Automl config 
automl_config = AutoMLConfig(
    experiment_timeout_minutes=30,
    task='classification',
    primary_metric='accuracy',
    training_data=train_data,
    label_column_name='diagnosis',
    n_cross_validations=5,
    compute_target=cpu_cluster
)

In [8]:
# Submit your experiment
remote_run = experiment.submit(automl_config, show_output=True)
RunDetails(remote_run).show()
remote_run.wait_for_completion(show_output=True)

Running on remote.
No run_configuration provided, running on cpucluster with default configuration
Running on remote compute: cpucluster
Parent Run ID: AutoML_306a6b47-7940-4c6b-afd5-bcf5eedce08e

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…



****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and no high cardinality features were detected.
              Learn more abo

{'runId': 'AutoML_306a6b47-7940-4c6b-afd5-bcf5eedce08e',
 'target': 'cpucluster',
 'status': 'Completed',
 'startTimeUtc': '2021-01-26T14:38:00.83882Z',
 'endTimeUtc': '2021-01-26T15:20:11.24408Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'cpucluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"8cba299f-f23c-4199-9b28-684537d33036\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"data/train-data.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-136132\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"9e65f93e-bdd8-437b-b1e8-0647cd6098f7\\\\\\", \\\\\\"workspac

## Best Model

In [12]:
best_run, fitted_model = remote_run.get_output()

print(best_run)

Run(Experiment: capstone-project-automl,
Id: AutoML_306a6b47-7940-4c6b-afd5-bcf5eedce08e_22,
Type: azureml.scriptrun,
Status: Completed)


In [13]:
print(fitted_model)

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('StandardScalerWrapper',
                 <azureml...el_wrappers.StandardScalerWrapper object at 0x7fb25c463080>),
                ('LogisticRegression',
                 LogisticRegression(C=0.8286427728546842, class_weight=None,
                                    dual=False, fit_intercept=True,
                                    intercept_scaling=1, l1_ratio=None,
                                    max_iter=100, multi_c

In [14]:
best_run.get_tags()

{'_aml_system_azureml.automlComponent': 'AutoML',
 '_aml_system_ComputeTargetStatus': '{"AllocationState":"steady","PreparingNodeCount":0,"RunningNodeCount":0,"CurrentNodeCount":1}',
 '_aml_system_automl_is_child_run_end_telemetry_event_logged': 'True'}

In [15]:
metrics = best_run.get_metrics()
metrics['accuracy']

0.9766038454216638

In [16]:
Save the best model
import joblib
from azureml.core.model import Model

description = "AutoML model trained on Wisoconsin Breast Cancer dataset"

os.makedirs('outputs', exist_ok=True)
joblib.dump(fitted_model, filename="outputs/automl-model.pkl")
automl_model = remote_run.register_model(model_name='automl-breast-cancer', description=description)

## Model Deployment

In [17]:
from azureml.core.webservice import AciWebservice

aci_config = AciWebservice.deploy_configuration(
    cpu_cores=1,
    memory_gb=1,
    description='Predict tumor as Malignant(1) or Benign(0)',
    auth_enabled=True
)

In [19]:
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core import Workspace
from azureml.core.model import Model
from azureml.automl.core.shared import constants

model = Model(ws, 'automl-breast-cancer')


myenv = best_run.get_environment()
entry_script = 'score.py'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', entry_script)
best_run.download_file(constants.CONDA_ENV_FILE_PATH, 'myenv.yml')

inference_config = InferenceConfig(entry_script=entry_script, environment=myenv)

service = Model.deploy(workspace=ws, 
                       name='automl-breast-cancer', 
                       models=[model], 
                       inference_config=inference_config, 
                       deployment_config=aci_config)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running.....................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [20]:
service.update(enable_app_insights=True)

In [21]:
print("State "+ service.state)
print("Key " + service.get_keys()[0])
print("Swagger URI " + service.swagger_uri)
print("Scoring URI " + service.scoring_uri)

State Healthy
Key wLNKqhKi4Fr2f0vsmMfMxnPoCPMgzrQz
Swagger URI http://a6d9f881-3dba-4b47-a901-a13f96fef301.southcentralus.azurecontainer.io/swagger.json
Scoring URI http://a6d9f881-3dba-4b47-a901-a13f96fef301.southcentralus.azurecontainer.io/score


In [22]:
# Model endpoint is consumed using a script which contains 2 datapoints
# Expected result [0, 1]
%run endpoint.py

{"result": [0, 1]}


## Print Logs and Delete Resources

In [23]:
print(service.get_logs())


2021-01-26T15:41:07,083143717+00:00 - gunicorn/run 
2021-01-26T15:41:07,082865008+00:00 - rsyslog/run 
2021-01-26T15:41:07,085280586+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_265db83b0c6014ce472c5de2f0b97e04/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_265db83b0c6014ce472c5de2f0b97e04/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_265db83b0c6014ce472c5de2f0b97e04/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_265db83b0c6014ce472c5de2f0b97e04/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_265db83b0c6014ce472c5de2f0b97e04/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
2021-01-26T15:41:07,083662834+00:00 - iot-server/run 
rsyslogd

In [47]:
service.delete()
cpu_cluster.delete()

No service with name automl-breast-cancer found to delete.
Current provisioning state of AmlCompute is "Deleting"

