# Automated ML

## Introduction
Asteroids are minor planets, especially of the inner Solar System. Larger asteroids have also been called planetoids.
The study of asteroids is also crucial as historical events prove some of them being hazardous.

For the purpose of this Capstone project, I thought of using machine learning to predict whether an asteroid could be hazardous or not.

## Azure Machine Learning SDK-specific Imports
In the cell below, we are importing all the dependencies that we will need to complete the project.

In [18]:
from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from train import clean_data
from azureml.core.datastore import Datastore
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails
import pandas as pd
from azureml.core.model import InferenceConfig
import json
import requests
from azureml.core.model import Model

## Initialize Workspace and Create an Azure ML Experiment
Let's initialize a workspace object from persisted configuration and create an experiment named "capstone_automl".

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'capstone_automl'

exp = Experiment(ws, experiment_name)

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

Workspace name: quick-starts-ws-137123
Azure region: southcentralus
Subscription id: a24a24d5-8d87-4c8a-99b6-91ed2d2df51f
Resource group: aml-quickstarts-137123


## Create or Attach an AmlCompute Cluster
We need to create a compute target for our AutoML run. The compute target previously created for the HyperDrive run can also be reused for executing this AutoML run.

In [3]:
aml_compute_target = "cpu-cluster"
try:
  aml_compute = AmlCompute(ws, aml_compute_target)
  print("Found existing compute target!")
except ComputeTargetException:
  print("Creating new compute cluster...")
  provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", min_nodes = 1, max_nodes = 6)
  aml_compute = ComputeTarget.create(ws, aml_compute_target, provisioning_config)
  aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

print("Azure Machine Learning Compute Cluster Created!")

Found existing compute target!
Azure Machine Learning Compute Cluster Created!


## Dataset

### Overview
The data is about Asteroids and is provided by NEOWS(Near-Earth Object Web Service). It is a NASA's dataset and can be found on Kaggle. One can download the dataset from [this](https://www.kaggle.com/shrutimehta/nasa-asteroids-classification/download) link.

The dataset contains various information about the asteroids and labels each asteroid as hazardous(1) or non-hazardous(0). The dataset consists of ***4687 data instances(rows) and 40 features(columns)***. Also, there are no null values in the dataset.

The task here is to use AutoML to perform Classification on this dataset and predict whether a given asteroid is hazardous or not.

For the purpose of this project, I downloaded this dataset and saved it in the project's GitHub repository and accessing it using [this](https://raw.githubusercontent.com/Anupriya-S/Capstone-Azure-Machine-Learning-Engineer/main/nasa.csv) link.

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/Anupriya-S/Capstone-Azure-Machine-Learning-Engineer/main/nasa.csv')

In [5]:
# Use clean_data() (defined in 'train.py') for processing the dataframe

df = clean_data(df)

In [6]:
found = False
key = "NASA Asteroids"
description_text = "The data is about Asteroids and is provided by NEOWS(Near-Earth Object Web Service)."

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        data_store = Datastore.get_default(ws)
        dataset = TabularDatasetFactory.register_pandas_dataframe(df, target=data_store, name=key, description=description_text, tags=None, show_progress=True)

Method register_pandas_dataframe: This is an experimental method, and may change at any time.<br/>For more information, see https://aka.ms/azuremlexperimental.


Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/df3c3c5e-debf-4e25-9109-00a5dbb754c4/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


## AutoML Configuration
For the AutoML run we use several settings and configurations as follows:
1. Experiment timeout is set to 30 minutes because we want our AutoML run to complete in a given timeframe.
2. Maximum Concurrent Iterations are set to 5 because the upper limit of nodes for our compute cluster is set to 6.
3. Primary metric is set to Accuracy so that we can measure the 'goodness' of our model.
4. Task is set to classification for obvious reasons.
5. Value of n for n cross validations is set to 5 so that our model can be properly validated.
6. Enable early stopping is set to True so that our experiment does not waste time once the performance of models starts deteriorating.

In [7]:
# TODO: Put your automl settings here
automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'accuracy'
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(compute_target=aml_compute,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="Hazardous",  
                             n_cross_validations=5,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [8]:
# TODO: Submit your experiment

automl_run = exp.submit(config=automl_config, show_output=True)

Running on remote.
No run_configuration provided, running on cpu-cluster with default configuration
Running on remote compute: cpu-cluster
Parent Run ID: AutoML_14a7c021-4812-4a2f-805c-07d30f839ab4

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input

## Run Details
In the cell below, we are using the `RunDetails` widget to show the training logs almost in real-time.

In [9]:
RunDetails(automl_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [10]:
automl_run.wait_for_completion(show_output=False)



****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class.
+---------------------------------+---------------------------------+--------------------------------------+
|Size of the smallest class       |Name/Label of the smallest class |Number of samples in the training data|
|755                              |1                                |4687                                  |
+---------------------------------+---------------------------------+--------------------------------------+

********************************************

{'runId': 'AutoML_14a7c021-4812-4a2f-805c-07d30f839ab4',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-02-03T20:05:59.325206Z',
 'endTimeUtc': '2021-02-03T20:27:57.746342Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'cpu-cluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"e362512e-bf00-43d0-bff9-ff792de028f6\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"managed-dataset/df3c3c5e-debf-4e25-9109-00a5dbb754c4/\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-137123\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"a24a24d5-8d87-4c8a-99b

## Best Model
In the cell below, we are retrieving the best model from the AutoML experiment using `get_output()` method. Further, we are displaying the details associated with the best AutoML run and they are:
1. Run ID
2. Status
3. Accuracy
4. Model Steps
5. Files uploaded during the run

In [11]:
#TODO: Save the best model

best_automl_run, model = automl_run.get_output()

print(best_automl_run)

print(best_automl_run.get_metrics()['accuracy'])


Package:azureml-automl-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-core, training version:1.21.0.post1, current version:1.20.0
Package:azureml-dataprep, training version:2.8.2, current version:2.7.3
Package:azureml-dataprep-native, training version:28.0.0, current version:27.0.0
Package:azureml-dataprep-rslex, training version:1.6.0, current version:1.5.0
Package:azureml-dataset-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-defaults, training version:1.21.0, current version:1.20.0
Package:azureml-interpret, training version:1.21.0, current version:1.20.0
Package:azureml-pipeline-core, training version:1.21.0, current version:1.20.0
Package:azureml-telemetry, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-client, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-runtime, training version:1.21.0, current version:1.20.0


Run(Experiment: capstone_automl,
Id: AutoML_14a7c021-4812-4a2f-805c-07d30f839ab4_38,
Type: azureml.scriptrun,
Status: Completed)
0.9961599989077332


In [12]:
model.steps

[('datatransformer',
  DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                  feature_sweeping_config=None, feature_sweeping_timeout=None,
                  featurization_config=None, force_text_dnn=None,
                  is_cross_validation=None, is_onnx_compatible=None, logger=None,
                  observer=None, task=None, working_dir=None)),
 ('prefittedsoftvotingclassifier',
  PreFittedSoftVotingClassifier(classification_labels=None,
                                estimators=[('0',
                                             Pipeline(memory=None,
                                                      steps=[('maxabsscaler',
                                                              MaxAbsScaler(copy=True)),
                                                             ('lightgbmclassifier',
                                                              LightGBMClassifier(boosting_type='gbdt',
                                                          

In [13]:
# list the model files uploaded during the run

print("\n", best_automl_run.get_file_names())


 ['accuracy_table', 'automl_driver.py', 'azureml-logs/55_azureml-execution-tvmps_eda92d709739e22f2e8334dabd6a328b41d96759deed3ad03f3b041003188629_d.txt', 'azureml-logs/65_job_prep-tvmps_eda92d709739e22f2e8334dabd6a328b41d96759deed3ad03f3b041003188629_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_eda92d709739e22f2e8334dabd6a328b41d96759deed3ad03f3b041003188629_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'confusion_matrix', 'logs/azureml/104_azureml.log', 'logs/azureml/azureml_automl.log', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log', 'outputs/conda_env_v_1_0_0.yml', 'outputs/env_dependencies.json', 'outputs/internal_cross_validated_models.pkl', 'outputs/model.pkl', 'outputs/pipeline_graph.json', 'outputs/scoring_file_v_1_0_0.py']


## Model Deployment

Since this is the best of the two models we have we will go further with the deployment of this model.

In the cell below, we register the model.

In [14]:
registered_model = best_automl_run.register_model(model_name='automl-model', model_path='outputs/model.pkl')

In the cell below, we download the `model.pkl` and `scoring_file_v_1_0_0.py` as `scoring.py` so that we can use them for inferencing purpose in the next step.

In [15]:
best_automl_run.download_file('outputs/scoring_file_v_1_0_0.py', 'score.py')
best_automl_run.download_file('outputs/model.pkl', 'automl_model.pkl')

In the cell below, we create an inference config.

In [16]:
inference_config = InferenceConfig(entry_script='score.py', environment=best_automl_run.get_environment())

In the cell below, we deploy the model as a webservice named `asteroid-classification` and retrieve the deploy state alongwith the HTTP API to interact with the deployed model.

In [19]:
service_name = 'asteroid-classification'

service = Model.deploy(ws, service_name, [registered_model], inference_config, overwrite=True)
service.wait_for_deployment(show_output=True)

print(service.state)
print(service.scoring_uri)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running..............................................
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy
http://e372db09-a692-4219-abb9-6efd2dcb9ef1.southcentralus.azurecontainer.io/score


### ***In the following cell we implement one of the stand out suggestions. This line of code will enable logging in our deployed web service.***

In [20]:
service.update(enable_app_insights=True)

In the cell below, we are creating a sample data for testing our deployed service.

In [29]:
df=dataset.to_pandas_dataframe()

Hazardous=df.pop('Hazardous')

keys=[]

for col in df.columns:
    keys.append(col)

data={'data':[dict(zip(keys, df.iloc[0].tolist()))]}

### In the following two cells we will be testing our deployed model by two different ways.

**This is the first method.** We send a request to the web service we deployed using `requests.post()` method.

In [32]:
scoring_uri=service.scoring_uri

input_data = json.dumps(data)
with open("data.json", "w") as _f:
    _f.write(input_data)

# Set the content type
headers = {'Content-Type': 'application/json'}

# Make the request and display the response
response = requests.post(scoring_uri, input_data, headers=headers)
print(response.json())

{"result": [1]}


**This is the second method.** We send a request to the web service we deployed using `service.run()` method.

In [33]:
prediction = service.run(input_data)

print(prediction)

{"result": [1]}


In the cell below, we print the logs of the web service.

In [34]:
logs = service.get_logs()

for line in logs.split('\n'):
    print(line)

2021-02-03T20:46:23,970089479+00:00 - iot-server/run 
2021-02-03T20:46:23,970347292+00:00 - gunicorn/run 
2021-02-03T20:46:24,070191333+00:00 - rsyslog/run 
2021-02-03T20:46:24,072157935+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

In the cell below, we delete the service.

In [35]:
service.delete()

In the cell below, we delete the previously created compute cluster to avoid the unnecessary comsumption of resources.

In [36]:
aml_compute.delete()

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

