# Automated ML

Install and import dependencies

In [1]:
!pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.9.tar.gz (58 kB)
[K     |████████████████████████████████| 58 kB 3.1 MB/s eta 0:00:011
Collecting python-slugify
  Downloading python-slugify-4.0.1.tar.gz (11 kB)
Collecting slugify
  Downloading slugify-0.0.1.tar.gz (1.2 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 3.4 MB/s eta 0:00:011
[?25hBuilding wheels for collected packages: kaggle, python-slugify, slugify
  Building wheel for kaggle (setup.py) ... [?25l- \ | done
[?25h  Created wheel for kaggle: filename=kaggle-1.5.9-py3-none-any.whl size=73265 sha256=92bdfd135e21394b41b0f85d3ae6ad8f3a4bb9db3007d9689ec372dcb623851f
  Stored in directory: /home/azureuser/.cache/pip/wheels/9d/50/3d/2644504bb1e8c782f3fef5984f03d76fc4a74698fdec128b29
  Building wheel for python-slugify (setup.py) ... [?25l- done
[?25h  Created wheel for python-slugify: filename=python_slugify-4.0.1-py2.py3-non

If the import of data_prep fails, see the README on how to download and copy the kaggle.json

In [3]:
from azureml.core.workspace import Workspace
from azureml.core import Experiment, Model, Webservice
from azureml.widgets import RunDetails
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

import data_prep

## Dataset

### Overview
The Portable Executable (PE) format is a file format for executables, object code, DLLs and others used in 32-bit and 64-bit versions of Windows operating systems. The header of PE files contains a number for things like the size of the file, imported libraries, and more. This dataset from [Kaggle](https://www.kaggle.com/divg07/malware-analysis-dataset) contains data extracted from PE headers from both known malware samples and benign software samples.

### Task
The task for this project is to train models to classify whether an executable is malware or benign using features extracted from their PE Header. The 'legitimate' column in the dataset is 1 when the executible file is from a legitimate source (aka benign software or goodware), and 0 when it is malware.  


In [4]:
ws = Workspace.from_config()
experiment_name = 'automl_experiment'

experiment=Experiment(ws, experiment_name)

dataset = data_prep.get_dataset(ws)

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code RNBVNFGUW to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.
Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/b601c0b5-a16e-4b95-996f-5150c0a8d98f/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.




## Find or Create Compute Cluster

In [5]:
cpu_cluster_name = "cpu-cluster"

# Check if the cluster exists. If there is an error, create the cluster
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2',
                                                            max_nodes=10)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    

cpu_cluster.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## AutoML Configuration

Below, we set the AutoML settings and configuration. Since I am using the Azure lab provided by Udacity, I only have a limited amount of time to wait for a run and limited resources, so we set the max_concurrent_iterations to 5 and the timeout to 30 minutes. On a different instance with more resources these could be increased

We also define a 'classification' task, auto featurization from AutoML, our primary metric as 'accuracy', and specify the dataset and the 'legitimate' column from the dataset as the label column.

In [7]:
automl_settings =  {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'accuracy',
    "n_cross_validations": 3,
    "enable_early_stopping": True,
    "featurization": 'auto'
}

automl_config = AutoMLConfig(
                    task="classification",
                    training_data=dataset,
                    label_column_name="legitimate",
                    compute_target=cpu_cluster,
                    **automl_settings)

In [8]:
automl_run = experiment.submit(automl_config)

Running on remote.


## Run Details

In the cell below, the `RunDetails` widget shows the different experiments.

In [9]:
RunDetails(automl_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

In the cell below, we get the best model from the automl experiments and display the properties of the model.



In [10]:
best_automl_run, best_automl_model = automl_run.get_output()
best_run_metrics = best_automl_run.get_metrics()

print('Best Run Id: ', best_automl_run.id)
print('\n Accuracy:', best_run_metrics['accuracy'])
print(best_automl_model._final_estimator)
print(best_automl_run.get_tags())

Best Run Id:  AutoML_0ecfca36-e82f-497d-b214-dae4e4aade01_38

 Accuracy: 0.9999637807139545
PreFittedSoftVotingClassifier(classification_labels=None,
                              estimators=[('1',
                                           Pipeline(memory=None,
                                                    steps=[('maxabsscaler',
                                                            MaxAbsScaler(copy=True)),
                                                           ('xgboostclassifier',
                                                            XGBoostClassifier(base_score=0.5,
                                                                              booster='gbtree',
                                                                              colsample_bylevel=1,
                                                                              colsample_bynode=1,
                                                                              colsample_bytree=1,
         

In [11]:
model = best_automl_run.register_model(model_name = 'best_automl_model', model_path = 'outputs/model.pkl', model_framework=Model.Framework.SCIKITLEARN)

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

In [13]:
from azureml.core import Environment
from azureml.core.model import InferenceConfig

env = best_automl_run.get_environment()

script_name = 'score.py'

best_automl_run.download_file('outputs/scoring_file_v_1_0_0.py', script_name)

inference_config = InferenceConfig(entry_script= script_name,
                                    environment=env)

In [35]:
rest_service = Model.deploy(ws, "best-model-service", [model], inference_config=inference_config, overwrite=True)

rest_service.wait_for_deployment(show_output=True)
rest_service.update(enable_app_insights=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running.......................................

In [15]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

Grab a random sample of 3 rows from the dataset to test the endpoint

In [31]:
data_sample = dataset.to_pandas_dataframe().sample(3)
y_true = data_sample.pop('legitimate')
sample_json = json.dumps({'data':data_sample.to_dict(orient='records')})
print(sample_json)

{"data": [{"Name": 45980, "md5": 64759, "Machine": 332, "SizeOfOptionalHeader": 224, "Characteristics": 258, "MajorLinkerVersion": 11, "MinorLinkerVersion": 0, "SizeOfCode": 71168, "SizeOfInitializedData": 80384, "SizeOfUninitializedData": 0, "AddressOfEntryPoint": 16077, "BaseOfCode": 4096, "BaseOfData": 77824, "ImageBase": 4194304.0, "SectionAlignment": 4096, "FileAlignment": 512, "MajorOperatingSystemVersion": 5, "MinorOperatingSystemVersion": 1, "MajorImageVersion": 0, "MinorImageVersion": 0, "MajorSubsystemVersion": 5, "MinorSubsystemVersion": 1, "SizeOfImage": 163840, "SizeOfHeaders": 1024, "CheckSum": 148039, "Subsystem": 2, "DllCharacteristics": 33088, "SizeOfStackReserve": 1048576, "SizeOfStackCommit": 4096, "SizeOfHeapReserve": 1048576, "SizeOfHeapCommit": 4096, "LoaderFlags": 0, "NumberOfRvaAndSizes": 16, "SectionsNb": 5, "SectionsMeanEntropy": 4.46048566183, "SectionsMinEntropy": 2.74553733754, "SectionsMaxEntropy": 6.660474843459999, "SectionsMeanRawsize": 28467.2, "Sectio

In [32]:
output = rest_service.run(sample_json)
print('Prediction: ', output)
print('True: ', y_true)

{"result": [0, 1, 1]}


Run the cell below to see the logs from the web service

In [33]:
logs = rest_service.get_logs()

for line in logs.split('\n'):
    print(line)

2020-11-18T14:27:30.8728945Z stdout F 2020-11-18T14:27:30,866083300+00:00 - gunicorn/run 
2020-11-18T14:27:30.8876924Z stdout F 2020-11-18T14:27:30,882383300+00:00 - iot-server/run 
2020-11-18T14:27:30.9726689Z stdout F 2020-11-18T14:27:30,966470800+00:00 - nginx/run 
2020-11-18T14:27:30.9806922Z stderr F /usr/sbin/nginx: /azureml-envs/azureml_0e3a8a6dba181476a2523c12c58dfc97/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
2020-11-18T14:27:31.0008346Z stderr F /usr/sbin/nginx: /azureml-envs/azureml_0e3a8a6dba181476a2523c12c58dfc97/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
2020-11-18T14:27:31.0008346Z stderr F /usr/sbin/nginx: /azureml-envs/azureml_0e3a8a6dba181476a2523c12c58dfc97/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
2020-11-18T14:27:31.0083562Z stderr F /usr/sbin/nginx: /azureml-envs/azureml_0e3a8a6dba181476a2523c12c58dfc97/lib/libssl.so.1.0.0: no version i

Run the cell below to delete the web service and the compute cluster to clean up the lab when finished

In [34]:
rest_service.delete()
cpu_cluster.delete()