# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [None]:
%pip install azureml-train-automl
%pip install azureml-widgets
%pip install azureml-pipeline
%pip install azureml-pipeline-steps
%pip install kaggle


In [None]:
from azureml.core import Workspace, Experiment, ScriptRunConfig, Environment
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.webservice.aci import AciWebservice
from azureml.data import TabularDataset
from azureml.exceptions import ComputeTargetException
from azureml.train.automl import AutoMLConfig
from azureml.pipeline.steps import AutoMLStep, PythonScriptStep



In [None]:
%kaggle datasets download -d rezaunderfit/instagram-fake-and-real-accounts-dataset

## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.


TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [None]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'udacity-instagram-users-train'
project_folder = './automl-pipeline-project'
experiment=Experiment(ws, experiment_name)


In [None]:
import pandas as pd

key = "instagram-fake-real-accounts"
if key in ws.datasets.keys():
    dataset = ws.datasets[key]
else:
    dataset = pd.read_csv("./kaggle_data_ig_accounts/final-v1.csv")
    dataset = TabularDataset.Tabular.register_pandas_dataframe(dataframe=dataset, 
                                                               target=ws.get_default_datastore(), 
                                                               name=key, 
                                                               description="Kaggle dataset of Instagram accounts both real and fake.")

In [None]:
compute_cluster_name = "my-udacity-compute"

try:
    compute_cluster = ComputeTarget(ws, compute_cluster_name)
    print("Found an existing cluster, using it")
except ComputeTargetException as ex:
    compute_cluster_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", max_nodes=4)
    compute_cluster = ComputeTarget.create(ws, compute_cluster_name, compute_cluster_config)
    
compute_cluster.wait_for_completion(show_output=True, timeout_in_minutes=10)

## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

## Configuration Settings Rationale

- task: Since the dataset schema contains a label column `is_fake` with Boolean values, a classification model is the obvious choice for training. 
- `max_concurrent_iterations`: Determined by the size of the largest VM we are permitted to use for compute in the Udacity environment
- `primary_metric`: To get a baseline model for evaluation, I chose to optimize accuracy for the AutoML run. However, that metric "may not optimize as well for datasets that are small, have very large class skew (class imbalance), or when the expected metric value is very close to 0.0 or 1.0" (Microsoft Learn, "Set up AutoML training with Python º Metrics for classification scenarios")[https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-configure-auto-train-v1#metrics-for-classification-scenarios]. My dataset meets at least one and possibly two of these criteria (less than 800 records, and ~88% of them are labeled as fake). As such, after obtaining a baseline model and its performance metrics, I intend to run a second AutoML training job with weighted AUC as the primary metric, and then compare its performance to the first job's output.
- `featurization`: Intentionally left as the default value `auto`. 

## AutoML Step Encapsulation Rationale

I originally wanted to parameterize the primary_metric setting to better facilitate turning this experiment
into a reusable pipeline for retraining the model using different metrics (such as weighted AUC). However, after struggling with type-checking issues around passing a PipelineParameter into the AutoMLConfig, I discovered that this approach is actually not supported by Azure ML for use in an AutoMLStep: https://stackoverflow.com/a/62382471

It took a couple days, but I eventually succeeded in implementing the workaround suggested by the Microsoft AML product group. My pipeline now contains a PythonScriptStep rather than an AutoMLStep, and the Python script receives a pipeline parameter value for the `primary_metric`, generates an AutoMLConfig object using that input, and generates an AutoML job as a child run of the pipeline. The only drawback of this approach is that the programmatic submission of an AutoML run by the script requires interactive authentication - once the step has started, the user must open the details window for the step, navigate to the `user_logs/std_log.txt` log file in the Outputs tab, and follow the instruction there by opening a new browser tab to https://microsoft.com/devicelogin and entering the login code therein. Removing this interactive authentication requirement requires the creation of an Azure Service Principal; because the Udacity lab environment prevents oneself from creating a Service Principal, this piece is left unimplemented here.

However, although it has run successfully at least once, it has not done so reliably; missing package version dependencies in the PythonScriptStep's Conda environment continue to plague it with runtime exceptions, with the specific packages required appearing to change from one run of the pipeline to the next. Since this has become an unproductive use of valuable time, I am thus stopping further work on this implementation, but have left the code for it here for review and posterity.

In [None]:
from azureml.pipeline.core import Pipeline, PipelineData, TrainingOutput
from azureml.pipeline.core import PipelineParameter

automl_pipeline_param_primarymetric = PipelineParameter(name="primary_metric", default_value="accuracy")

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

# automl_step = AutoMLStep(
#     name = 'automl_module', 
#     automl_config = automl_config,
#     outputs = [metrics_data, model_data],
#     allow_reuse = True
# )

from azureml.core.conda_dependencies import CondaDependencies

script_env = Environment.from_conda_specification(name="ig-pipeline-env", file_path="./conda_dependencies.yml")
script_env.python.user_managed_dependencies = False

#script_env.register(workspace=ws)

# set up script run configuration
script_config = ScriptRunConfig(
    source_directory='.\train_scripts',
    script='automl_train_script.py',
    compute_target=compute_cluster,
    environment=script_env
)


script_step = PythonScriptStep(name="train script step", 
                                script_name='automl_train_script.py', 
                                source_directory="./train_scripts",
                                compute_target=compute_cluster,
                                runconfig=script_config.run_config,
                                # outputs = [metrics_data, model_data],
                                arguments=["--primary_metric", automl_pipeline_param_primarymetric],
                                allow_reuse=True
                                )

In [None]:
# NOTE: After submitting, you must open the script step in Designer and perform interactive authentication!!
automl_pipeline = Pipeline(workspace=ws, description="An automated ML pipeline for training a classifier to detect fake IG accounts", steps = [script_step])
pipeline_run = experiment.submit(automl_pipeline)

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [None]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [None]:
for step in pipeline_run.get_steps():
    print("Outputs of step " + step.name)
    
    # Get a dictionary of StepRunOutputs with the output name as the key 
    output_dict = step.get_outputs()
    
    for name, output in output_dict.items():
        
        output_reference = output.get_port_data_reference() # Get output port data reference
        print("\tname: " + name)
        print("\tdatastore: " + output_reference.datastore_name)
        print("\tpath on datastore: " + output_reference.path_on_datastore)

In [None]:
train_step = pipeline_run.find_step_run('train script step')

if train_step:
    train_step_obj = train_step[0] # since we have only one step by name 'train.py'
    train_step_obj.get_output_data('metrics_data').download("./outputs") # download the output to current directory
    train_step_obj.get_output_data('model_data').download("./outputs")

In [None]:
#TODO: Save the best model

## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
