# Automated ML

In [175]:
import requests
import json

from azureml.core import Workspace, Experiment, Dataset, Model
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails
from azureml.core.webservice import AciWebservice
from azureml.core import Environment
from azureml.core.model import InferenceConfig, Model

import pandas as pd

## Dataset

### Overview
I am using the [NHL Game Data dataset](https://www.kaggle.com/datasets/martinellis/nhl-game-data) from Kaggle.com compiled by Martin Ellis which contains a series of CSV files containing NHL hockey game and player data from 2001 to 2020.

![Kaggle Dataset](screens/Kaggle.png)

The dataset is stored in multiple CSV files structured like a relational database. In fact, the author of the dataset was so kind as to provide this schema diagram of the dataset below:

![Schema Diagram](data/table_relationships.JPG)

Now, I don't care about all of that information for my experiment, and giving all of that to a model training process would be difficult.

I care about penalties per game and which teams are playing each other (in human-readable form) and want to provide a cleansed dataset to the model training process that looks something like this:

![Training Data](screens/TrainingData.png)

This requires the use of the following tables:

- game
- game_team_stats
- team_info

The `dataprep.ipynb` consolidates all of this data into a single tabular data source using Pandas dataframes to aggregate and join. Once the dataset is compiled, it is then registered as a dataset on Azure using the Azure ML SDK and the `Dataset.Tabular.register_pandas_dataframe` function. 

![Registering the Dataset](screens/RegisterDataset.png)

See `dataprep.ipynb` for more details on the registration process.

Once that operation completes, the dataset is available on Azure:

![Datasets](screens/Datasets.png)
![Dataset Details](screens/DatasetDetails.png)

A profiling run can also be done to visualize the distribution of that data:
![Dataset Profile](screens/DatasetProfile.png)

### Preparing for the Experiment
But before we can grab our training data from Azure, we need a reference to our workspace.

In [176]:
# Load the workspace information from config.json using the Azure ML SDK
ws = Workspace.from_config()
ws.name

'DataScience'

In [177]:
# Next, grab our dataset from Azure
ds = Dataset.get_by_name(workspace=ws, name='NHL-Penalties-2020')
print(ds.name + ' v' + str(ds.version) + ': ' + ds.description)

# Display the data structure here for verification
ds.to_pandas_dataframe().head()

NHL-Penalties-2020 v4: A breakdown of penalty minutes per game matchup


Unnamed: 0,penaltyMinutes,type,homeTeam,awayTeam
0,12.0,R,Stars,Avalanche
1,29.0,R,Stars,Avalanche
2,18.0,R,Stars,Avalanche
3,24.0,R,Stars,Avalanche
4,4.0,R,Stars,Avalanche


In [178]:
# Now let's make sure we have a compute resource
cluster_name = "Low-End-Compute-Cluster"
max_nodes = 4

# Fetch or create the compute resource
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cluster_name) # This will throw a ComputeTargetException if this doesn't exist
    print('Using existing compute: ' + cluster_name)
except ComputeTargetException:
    # Create the cluster
    print('Provisioning cluster...')
    compute_config = AmlCompute.provisioning_configuration(vm_size="Standard_D2DS_V4", min_nodes=0, max_nodes=max_nodes)
    cpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

# Ensure the cluster is ready to go
cpu_cluster.wait_for_completion(show_output=True)

Using existing compute: Low-End-Compute-Cluster
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## AutoML Experiment

Next, I performed a machine learning experiment using Azure's Automated ML features to build a regression model to predict the amount of penalties in minutes that should be expected when two teams play each other.

In [179]:
# Create a Machine Learning Experiment
experiment_name = 'NHL-Penalty-Minute-Prediction'

experiment=Experiment(ws, experiment_name)

We're doing a regression run to predict a single numerical value, the number of penalty minutes we can expect to encounter in a game between two different teams.

Regression metrics tend to revolve around error or distance of the predicted value from the true value, and Mean Absolute Error is generally a good default measurement of that to go with barring any other reasons.

Because the dataset is not huge (~9100 rows and 5 columns), I'm comfortable using only a small number of cross-validation passes, but I want to stick to a max iteration count of at least 20 iterations to ensure that AutoML has a chance to find an optimal solution.

In [180]:
max_runs = 40

# Set up the experiment
automl_config = AutoMLConfig(
    task='regression',
    primary_metric='normalized_mean_absolute_error',
    compute_target=cpu_cluster,
    max_concurrent_iterations=max_nodes,
    iterations=max_runs,
    iteration_timeout_minutes=5,
    training_data=ds, # The pre-registered cleaned version of the Kaggle dataset
    n_cross_validations=5,
    label_column_name='penaltyMinutes')

# Submit the experiment
run = experiment.submit(automl_config)

# Wait for the experiment to complete
RunDetails(run).show()
run.wait_for_completion(show_output=False)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
NHL-Penalty-Minute-Prediction,AutoML_c7d50e29-c3b8-4d08-a028-83410237a818,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

{'runId': 'AutoML_c7d50e29-c3b8-4d08-a028-83410237a818',
 'target': 'Low-End-Compute-Cluster',
 'status': 'Completed',
 'startTimeUtc': '2022-04-19T03:19:30.343159Z',
 'endTimeUtc': '2022-04-19T03:39:22.870252Z',
 'services': {},
   'message': 'No scores improved over last 10 iterations, so experiment stopped early. This early stopping behavior can be disabled by setting enable_early_stopping = False in AutoMLConfig for notebook/python SDK runs.'}],
 'properties': {'num_iterations': '40',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'normalized_mean_absolute_error',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'Low-End-Compute-Cluster',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"3eec7f38-eba2-49e7-abf7-d18ef0ad00ee\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': 'False',
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'regression',
  'd

In [181]:
# Grab the resulting model and best run
best_auto_run, automl_model = run.get_output()

# Display details about the best run
best_auto_run.id

'AutoML_c7d50e29-c3b8-4d08-a028-83410237a818_36'

### Run Details

Display the run details and aggregate the results

In [182]:
RunDetails(best_auto_run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

### Best Model

Get the best model from the automl experiments and display all the properties of the model.



In [183]:
automl_model

INFO:interpret_community.common.explanation_utils:Using default datastore for uploads


RegressionPipeline(pipeline=Pipeline(memory=None,
                                     steps=[('datatransformer',
                                             DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='regression', working_dir='c:\\Dev\\Udac...
                                             PreFittedSoftVotingRegressor(estimators=[('0', Pipeline(memory=None, steps=[('maxabsscaler', MaxAbsScaler(copy=True)), ('lightgbmregressor', LightGBMRegressor(min_data_in_leaf=20, n_jobs=1, problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'cpu'}), random_state=None))], verbose=False)), ('26', Pipeline(memory=None, steps=[('maxabsscaler', MaxAbsScaler(copy=True)), ('extratreesregressor', ExtraTreesRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_feat

In [184]:
best_auto_run.get_metrics()

INFO:interpret_community.common.explanation_utils:Using default datastore for uploads


{'root_mean_squared_log_error': 0.6594224016005603,
 'spearman_correlation': 0.18930846939686735,
 'mean_absolute_percentage_error': 75.65962047677264,
 'median_absolute_error': 10.693619095032048,
 'normalized_root_mean_squared_error': 0.09487061798468881,
 'normalized_root_mean_squared_log_error': 0.12343338520199051,
 'normalized_mean_absolute_error': 0.06515187684409296,
 'mean_absolute_error': 13.551590383571334,
 'r2_score': 0.05877984637348486,
 'explained_variance': 0.05931413987570337,
 'root_mean_squared_error': 19.73308854081527,
 'normalized_median_absolute_error': 0.05141163026457716,
 'predicted_true': 'aml://artifactId/ExperimentRun/dcid.AutoML_c7d50e29-c3b8-4d08-a028-83410237a818_36/predicted_true',
 'residuals': 'aml://artifactId/ExperimentRun/dcid.AutoML_c7d50e29-c3b8-4d08-a028-83410237a818_36/residuals'}

In [185]:
# Save the best model locally
best_auto_run.download_files(output_directory='automl-output')

## Model Deployment

Register and deploy the model to Azure

In [186]:
# Register the model in Azure
best_auto_run.register_model(model_name='NHL-Penalties-AutoML', model_path='outputs/model.pkl', description='NHL Game Penalty Prediction Best AutoML Run')

Model(workspace=Workspace.create(name='DataScience', subscription_id='efba8785-116c-4443-9a05-764c75c7bb0d', resource_group='datascience'), name=NHL-Penalties-AutoML, id=NHL-Penalties-AutoML:2, version=2, tags={}, properties={})

INFO:interpret_community.common.explanation_utils:Using default datastore for uploads


In [187]:
# Grab the model back from Azure (useful if you want to do this later without re-training)
model = Model(ws, name='NHL-Penalties-AutoML')
model

Model(workspace=Workspace.create(name='DataScience', subscription_id='efba8785-116c-4443-9a05-764c75c7bb0d', resource_group='datascience'), name=NHL-Penalties-AutoML, id=NHL-Penalties-AutoML:2, version=2, tags={}, properties={})

INFO:interpret_community.common.explanation_utils:Using default datastore for uploads


In [188]:
env = Environment.from_conda_specification("AutoML-env", "automl-output/outputs/conda_env_v_1_0_0.yml") # Environment(name='myenv', )
inference_config = InferenceConfig(environment=env, 
                                   source_directory='./automl-output/outputs', 
                                   entry_script='./scoring_file_v_2_0_0.py')

INFO:interpret_community.common.explanation_utils:Using default datastore for uploads


In [189]:
# Replace these with something stronger from a config file for sustained production use
key1 = 'this-is-a-test-key' 
key2 = key1 + '-2'

deployment_config = AciWebservice.deploy_configuration(
    cpu_cores = 1, 
    memory_gb = 1, 
    auth_enabled=True, 
    primary_key=key1, 
    secondary_key=key2)

# Deploy the model
service = Model.deploy(ws, "penalty-predictor", [model], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)

# Turn on app insights for debugging
service.update(enable_app_insights=True)

# Grab our scoring endpoint for testing
scoring_uri = service.scoring_uri
print('Endpoint active at ' + scoring_uri)

INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads


Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2022-04-18 23:42:57-04:00 Creating Container Registry if not exists.
2022-04-18 23:42:57-04:00 Registering the environment.
2022-04-18 23:42:58-04:00 Use the existing image.
2022-04-18 23:42:59-04:00 Generating deployment configuration.
2022-04-18 23:42:59-04:00 Submitting deployment to compute.
2022-04-18 23:43:04-04:00 Checking the status of deployment penalty-predictor.

INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Usin

.
2022-04-18 23:45:50-04:00 Checking the status of inference endpoint penalty-predictor.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Endpoint active at http://1df374ab-7948-493a-a2a6-6e0f452550e7.northcentralus.azurecontainer.io/score


Request all matchup predictions from the web service and store them in an array for visualization

In [190]:
# We need a list of teams to iterate through
df = ds.to_pandas_dataframe()
teams = df['awayTeam'].unique()

# Loop over all home and away matchups to build a list of the penalties we should expect
matchups = []
scenarios = []
for homeTeam in teams:
  for awayTeam in teams:

    # Don't allow home teams to play themselves
    if homeTeam == awayTeam:
      continue

    matchups.append({
        "homeTeam": homeTeam,
        "awayTeam": awayTeam
    })
    scenarios.append(      
      {
        "type": "R", # Regular Season
        "homeTeam": homeTeam,
        "awayTeam": awayTeam
      }) 
    scenarios.append(      
      {
        "type": "P", # Playoff game
        "homeTeam": homeTeam,
        "awayTeam": awayTeam
      }) 

data = {
  "Inputs": {
    "data": scenarios
  },
  "GlobalParameters": 1.0
}

# Convert to JSON string
input_data = json.dumps(data)

# Set the content type
headers = {'Content-Type': 'application/json'}

# If authentication is enabled, set the authorization header
headers['Authorization'] = f'Bearer {key}'

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
results = resp.json()['Results']

results

[28.26241329399898,
 33.89451580583235,
 29.14531660434827,
 31.173701238682987,
 29.178170666423828,
 30.898110583130983,
 30.025261845613436,
 35.98643692522054,
 27.474695512269548,
 39.628416529901244,
 28.112093060139706,
 35.77135756916986,
 27.88406342814549,
 29.937774793699333,
 31.098648891776044,
 42.094079052535776,
 33.74297653223903,
 58.38100953676972,
 29.952657661413284,
 35.69312192105473,
 27.989538756589614,
 32.26099830968236,
 29.991291484964393,
 34.51055108177385,
 35.425481280635395,
 39.621162982044765,
 26.66526400039694,
 36.48845674823335,
 31.08239405631389,
 33.01805994040492,
 25.36954852062424,
 27.902637687407484,
 28.743573041821143,
 31.254617156225578,
 27.95180099936803,
 35.20444027632374,
 28.02940745139786,
 33.71242633552588,
 27.29367254794393,
 32.50626611873658,
 26.442508771255266,
 33.01033461110987,
 28.44114527423943,
 33.24050987283628,
 27.989091090780455,
 34.98936526172824,
 26.410595810636988,
 33.62020061907359,
 30.929722699207943

In [191]:
# Aggregate the predictions into the matchups objects
i = 0
for matchup in matchups:
    matchup['penaltyMinutesRegularSeason'] = results[i]
    matchup['penaltyMinutesPlayoffs'] =  results[i + 1]
    i += 2

matchups

[{'homeTeam': 'Avalanche',
  'awayTeam': 'Flyers',
  'penaltyMinutesRegularSeason': 28.26241329399898,
  'penaltyMinutesPlayoffs': 33.89451580583235},
 {'homeTeam': 'Avalanche',
  'awayTeam': 'Capitals',
  'penaltyMinutesRegularSeason': 29.14531660434827,
  'penaltyMinutesPlayoffs': 31.173701238682987},
 {'homeTeam': 'Avalanche',
  'awayTeam': 'Sharks',
  'penaltyMinutesRegularSeason': 29.178170666423828,
  'penaltyMinutesPlayoffs': 30.898110583130983},
 {'homeTeam': 'Avalanche',
  'awayTeam': 'Kings',
  'penaltyMinutesRegularSeason': 30.025261845613436,
  'penaltyMinutesPlayoffs': 35.98643692522054},
 {'homeTeam': 'Avalanche',
  'awayTeam': 'Canucks',
  'penaltyMinutesRegularSeason': 27.474695512269548,
  'penaltyMinutesPlayoffs': 39.628416529901244},
 {'homeTeam': 'Avalanche',
  'awayTeam': 'Blue Jackets',
  'penaltyMinutesRegularSeason': 28.112093060139706,
  'penaltyMinutesPlayoffs': 35.77135756916986},
 {'homeTeam': 'Avalanche',
  'awayTeam': 'Canadiens',
  'penaltyMinutesRegularS

INFO:interpret_community.common.explanation_utils:Using default datastore for uploads


In [193]:
# Display all logs from the server
logs = service.get_logs()

for line in logs.split('\n'):
    print(line)

2022-04-19T03:45:30,702472600+00:00 - rsyslog/run 
2022-04-19T03:45:30,708485800+00:00 - iot-server/run 
2022-04-19T03:45:30,717721300+00:00 - nginx/run 
2022-04-19T03:45:30,700906400+00:00 - gunicorn/run 
Dynamic Python package installation is disabled.
Starting HTTP server
rsyslogd: /azureml-envs/azureml_c6d7f1a6dd29b67f708ef6a71b939b7a/lib/libuuid.so.1: no version information available (required by rsyslogd)
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2022-04-19T03:45:31,245563700+00:00 - iot-server/finish 1 0
2022-04-19T03:45:31,247069200+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 20.1.0
Listening at: http://127.0.0.1:31311 (73)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 100
SPARK_HOME not set. Skipping PySpark Initialization.
Initializing logger
2022-04-19 03:45:33,278 | root | INFO | Starting up app insights client
logging socket was found. logging is available.
logging socket was found.

In [199]:
# Delete the service
service.delete()

INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads


In [192]:
# Move our matchup predictions to a dataframe for visualization in visualization.ipynb
df = pd.DataFrame(matchups)

df.to_csv('data/matchup_predictions.csv')

# Display the top few rows
df.head()

Unnamed: 0,homeTeam,awayTeam,penaltyMinutesRegularSeason,penaltyMinutesPlayoffs
0,Avalanche,Flyers,28.26,33.89
1,Avalanche,Capitals,29.15,31.17
2,Avalanche,Sharks,29.18,30.9
3,Avalanche,Kings,30.03,35.99
4,Avalanche,Canucks,27.47,39.63
