## Problem Statement

We have taken a campus placement dataset from Kaggle. The dataset talks about the students placement status. In this dataset we have following student's details:

1. Gender
2. 10th percentage
3. 10th Board of Education 
4. HSC(Higher Secondary Education) percentage
5. HSC Board of Education
6. Specialisation in HSC(science, commerce and others)
7. Degree Specialisation such as Science& Tech, Commerce & Management and Others
8. Degree Percentage
9. Work experience
10. Employability test percentage ( conducted by college)
11. Post Graduation(MBA)- Specialization (Marketing&Finance, Marketing&HR)
12. MBA percentage
13. Salary offered by corporate to candidates
14. Status of placement- Placed/Not Placed







Since we are using Azure ML Studio, we have created a workspace manually and created compute instance for the same to run the code
In the below code, we are loading the workspace from config file

In [1]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

Ready to use Azure ML 1.48.0 to work with ml-workspace


###### The placement dataset is uploaded to data assets manually and from below code we are retrieving the dataset from datastore and converting it into pandas dataframe. Printing the head of the dataframe.

In [2]:
from azureml.core import Dataset
from azureml.data.datapath import DataPath
import pandas as pd

default_ds = ws.get_default_datastore()
dataset = Dataset.get_by_name(ws, name='placement')
df_data = dataset.to_pandas_dataframe()
df_data.head()

Unnamed: 0,gender,ssc_p,ssc_b,hsc_p,hsc_b,degree_p,workex,etest_p,specialisation,mba_p,status,salary,hsc_s_Commerce,hsc_s_Science,degree_t_Others,degree_t_Sci&Tech
0,1,67.0,1,91.0,1,58.0,0,55.0,1,58.8,1,270000.0,1,0,0,1
1,1,79.33,0,78.33,1,77.48,1,86.5,0,66.28,1,200000.0,0,1,0,1
2,1,65.0,0,68.0,0,64.0,0,75.0,0,57.8,1,250000.0,0,0,0,0
3,1,56.0,0,52.0,0,52.0,0,66.0,1,59.43,0,288655.4054,0,1,0,1
4,1,85.8,0,73.6,0,73.3,0,96.8,0,55.5,1,425000.0,1,0,0,0


###### Creating an experiment folder to store all .py files and .yml files

In [4]:
import os
# Create a folder for the all step files including pipeline, hyperdrive
experiment_folder = 'placement-exp'
os.makedirs(experiment_folder, exist_ok=True)

print(experiment_folder)

placement-exp


###### The script includes a argument named --prepped-data, which references the folder where the resulting data should be saved.

In [5]:
%%writefile $experiment_folder/prep_placement.py
# Import libraries
import os
import argparse
import pandas as pd
from azureml.core import Run
from sklearn.preprocessing import MinMaxScaler

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument("--input-data", type=str, dest='raw_dataset_id', help='raw dataset')
parser.add_argument('--prepped-data', type=str, dest='prepped_data', default='prepped_data', help='Folder for results')
args = parser.parse_args()
save_folder = args.prepped_data

# Get the experiment run context
run = Run.get_context()

# load the data (passed as an input dataset)
print("Loading Data...")
placement = run.input_datasets['raw_data'].to_pandas_dataframe()

# Log raw row count
row_count = (len(placement))
run.log('raw_rows', row_count)

# remove nulls if any
placement = placement.dropna()

# Normalize the numeric columns
scaler = MinMaxScaler()
num_cols = ['ssc_p','hsc_p','degree_p','etest_p','mba_p','salary']
placement[num_cols] = scaler.fit_transform(placement[num_cols])

# Log processed rows
row_count = (len(placement))
run.log('processed_rows', row_count)

# Save the prepped data
print("Saving Data...")
os.makedirs(save_folder, exist_ok=True)
save_path = os.path.join(save_folder,'data.csv')
placement.to_csv(save_path, index=False, header=True)

# End the run
run.complete()

Writing placement-exp/prep_placement.py


In [6]:
%%writefile $experiment_folder/train_placement.py
# Import libraries
from azureml.core import Run, Model
import argparse
import pandas as pd
import numpy as np
import joblib
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument("--training-data", type=str, dest='training_data', help='training data')
args = parser.parse_args()
training_data = args.training_data

# Get the experiment run context
run = Run.get_context()

# load the prepared data file in the training folder
print("Loading Data...")
file_path = os.path.join(training_data,'data.csv')
placement = pd.read_csv(file_path)

# Separate features and labels
X, y = placement[['ssc_p','hsc_p','degree_p','etest_p','mba_p','salary','ssc_b','hsc_b','workex','specialisation','hsc_s_Commerce','hsc_s_Science','degree_t_Others','degree_t_Sci&Tech']].values, placement['status'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model...')
model = LogisticRegression().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
fig = plt.figure(figsize=(6, 4))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
run.log_image(name = "ROC", plot = fig)
plt.show()

# Save the trained model in the outputs folder
print("Saving model...")
os.makedirs('outputs', exist_ok=True)
model_file = os.path.join('outputs', 'placement_model.pkl')
joblib.dump(value=model, filename=model_file)

# Register the model
print('Registering model...')
Model.register(workspace=run.experiment.workspace,
               model_path = model_file,
               model_name = 'placement_model',
               tags={'Training context':'Pipeline'},
               properties={'AUC': np.float(auc), 'Accuracy': np.float(acc)})


run.complete()
     

Writing placement-exp/train_placement.py


###### Compute cluster with 'your-cluster-name' is created. The code checks if there is an existing compute target with this name, then suggests to use the same else creates a new compute target.

In [7]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "your-compute-cluster"

try:
    # Check for existing compute target
    pipeline_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=2)
        pipeline_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        pipeline_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)
    

Found existing cluster, use it.


###### An environment is created to run the pipeline and hyperdrive steps


In [8]:
%%writefile $experiment_folder/experiment_env.yml
name: experiment_env
dependencies:
- python=3.6.2
- scikit-learn
- ipykernel
- matplotlib
- pandas
- pip
- pip:
  - azureml-defaults
  - azureml-interpret
  - pyarrow

Writing placement-exp/experiment_env.yml


###### Conda configuration file: Creating an environment and use it in the run configuration for the pipeline.


In [9]:
from azureml.core import Environment
from azureml.core.runconfig import RunConfiguration

# Create a Python environment for the experiment (from a .yml file)
experiment_env = Environment.from_conda_specification("experiment_env", experiment_folder + "/experiment_env.yml")

# Register the environment 
experiment_env.register(workspace=ws)
registered_env = Environment.get(ws, 'experiment_env')

# Create a new runconfig object for the pipeline
pipeline_run_config = RunConfiguration()

# Use the compute you created above. 
pipeline_run_config.target = pipeline_cluster

# Assign the environment to the run configuration
pipeline_run_config.environment = registered_env

print ("Run configuration created.")

Run configuration created.


###### The first step must write the prepared data to a folder that can be read from by the second step. The OutputFileDatasetConfig class is used for interim storage locations that can be passed between pipeline steps. Output of step 1 is the input to step 2. 

In [10]:
from azureml.data import OutputFileDatasetConfig
from azureml.pipeline.steps import PythonScriptStep
from azureml.train.hyperdrive import GridParameterSampling, HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.core import Experiment, ScriptRunConfig, Environment

# Get the training dataset
placement_ds = ws.datasets.get("placement")

# Create an OutputFileDatasetConfig (temporary Data Reference) for data passed from step 1 to step 2
prepped_data = OutputFileDatasetConfig("prepped_data")

# Step 1, Run the data prep script
prep_step = PythonScriptStep(name = "Prepare Data",
                                source_directory = experiment_folder,
                                script_name = "prep_placement.py",
                                arguments = ['--input-data', placement_ds.as_named_input('raw_data'),
                                             '--prepped-data', prepped_data],
                                compute_target = pipeline_cluster,
                                runconfig = pipeline_run_config,
                                allow_reuse = True)

# Step 2, run the training script
train_step = PythonScriptStep(name = "Train and Register Model",
                                source_directory = experiment_folder,
                                script_name = "train_placement.py",
                                arguments = ['--training-data', prepped_data.as_input()],
                                compute_target = pipeline_cluster,
                                runconfig = pipeline_run_config,
                                allow_reuse = True)

print("Pipeline steps defined")

Pipeline steps defined


###### Run it as an experiment

In [11]:
from azureml.core import Experiment
from azureml.pipeline.core import Pipeline
from azureml.widgets import RunDetails

# Construct the pipeline
pipeline_steps = [prep_step, train_step]
pipeline = Pipeline(workspace=ws, steps=pipeline_steps)
print("Pipeline is built.")

# Create an experiment and run the pipeline
experiment = Experiment(workspace=ws, name = 'placement-exp')
pipeline_run = experiment.submit(pipeline, regenerate_outputs=True)
print("Pipeline submitted for execution.")
RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion(show_output=True)

Pipeline is built.
Created step Prepare Data [b6dfc060][36d349bf-e00b-4598-8bd4-ce179ae57101], (This step will run and generate new outputs)
Created step Train and Register Model [bf6aeaee][6b0d0069-b104-4cbd-94b3-a67730c78d46], (This step will run and generate new outputs)
Submitted PipelineRun 11b9a7fb-68b1-4f14-aa0f-13ae6da440b1
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/11b9a7fb-68b1-4f14-aa0f-13ae6da440b1?wsid=/subscriptions/625feea7-c3f9-4427-b9c3-60e6e77b59dc/resourcegroups/ml-ba09-dp100/workspaces/ml-workspace&tid=474565c1-bca4-4295-a2f5-b0c7dbf2591c
Pipeline submitted for execution.


_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

PipelineRunId: 11b9a7fb-68b1-4f14-aa0f-13ae6da440b1
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/11b9a7fb-68b1-4f14-aa0f-13ae6da440b1?wsid=/subscriptions/625feea7-c3f9-4427-b9c3-60e6e77b59dc/resourcegroups/ml-ba09-dp100/workspaces/ml-workspace&tid=474565c1-bca4-4295-a2f5-b0c7dbf2591c
PipelineRun Status: Running


StepRunId: f58ce692-2fc1-4f98-9c64-eb36089ba9f4
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/f58ce692-2fc1-4f98-9c64-eb36089ba9f4?wsid=/subscriptions/625feea7-c3f9-4427-b9c3-60e6e77b59dc/resourcegroups/ml-ba09-dp100/workspaces/ml-workspace&tid=474565c1-bca4-4295-a2f5-b0c7dbf2591c
StepRun( Prepare Data ) Status: NotStarted
StepRun( Prepare Data ) Status: Running

StepRun(Prepare Data) Execution Summary
StepRun( Prepare Data ) Status: Finished
{'runId': 'f58ce692-2fc1-4f98-9c64-eb36089ba9f4', 'target': 'your-compute-cluster', 'status': 'Completed', 'startTimeUtc': '2023-02-13T18:07:20.036051Z', 'endTimeUtc': '2023-02-13T18:09:20.226299Z', 

'Finished'

###### Checking metrics from previous run

In [24]:
for run in pipeline_run.get_children():
    print(run.name, ':')
    metrics = run.get_metrics()
    for metric_name in metrics:
        print('\t',metric_name, ":", metrics[metric_name])

Train and Register Model :
	 Accuracy : 0.8148148148148148
	 AUC : 0.904610492845787
	 ROC : aml://artifactId/ExperimentRun/dcid.a5da375b-4f37-42db-a08b-43be31ee7c88/ROC_1676283061.png
Prepare Data :
	 raw_rows : 215
	 processed_rows : 215


In [13]:
%%writefile $experiment_folder/train_hyperdrive_placement.py
# Import libraries
from azureml.core import Run, Model
import argparse
import pandas as pd
import numpy as np
import joblib
import os
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt


# Get parameters
parser = argparse.ArgumentParser()

parser.add_argument("--input-data", type=str, dest='training_dataset_id', help='training dataset')

# Hyperparameters
parser.add_argument('--learning_rate', type=float, dest='learning_rate', default=0.1, help='learning rate')
parser.add_argument('--n_estimators', type=int, dest='n_estimators', default=100, help='number of estimators')

args = parser.parse_args()

# Get the experiment run context
run = Run.get_context()

# Log Hyperparameter values
run.log('learning_rate',  np.float(args.learning_rate))
run.log('n_estimators',  np.int(args.n_estimators))


# load the prepared data file in the training folder
print("Loading Data...")
# file_path = os.path.join(training_data,'data.csv')
# placement = pd.read_csv(file_path)

placement = run.input_datasets['training_data'].to_pandas_dataframe() # Get the training data from the estimator input

# Separate features and labels
X, y = placement[['ssc_p','hsc_p','degree_p','etest_p','mba_p','salary','ssc_b','hsc_b','workex','specialisation','hsc_s_Commerce','hsc_s_Science','degree_t_Others','degree_t_Sci&Tech']].values, placement['status'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Train a Gradient Boosting classification model with the specified hyperparameters
print('Training a classification model')
model = GradientBoostingClassifier(learning_rate=args.learning_rate,
                                   n_estimators=args.n_estimators).fit(X_train, y_train)


# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
fig = plt.figure(figsize=(6, 4))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
run.log_image(name = "ROC", plot = fig)
plt.show()


# Save the trained model in the outputs folder
print("Saving model...")
os.makedirs('outputs', exist_ok=True)
model_file = os.path.join('outputs', 'placement_model.pkl')
joblib.dump(value=model, filename=model_file)


# Register the model
print('Registering model...')
Model.register(workspace=run.experiment.workspace,
               model_path = model_file,
               model_name = 'placement_model',
               tags={'Training context':'Pipeline'},
               properties={'AUC': np.float(auc), 'Accuracy': np.float(acc)})


run.complete()

Overwriting placement-exp/train_hyperdrive_placement.py


In [14]:
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.train.hyperdrive import GridParameterSampling, HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.widgets import RunDetails

# Create a Python environment for the experiment
hyper_env = Environment.from_conda_specification("placement-exp", experiment_folder + "/experiment_env.yml")

# Get the training dataset
placement_ds = ws.datasets.get("placement")

# Create a script config
script_config = ScriptRunConfig(source_directory=experiment_folder,
                                script='train_hyperdrive_placement.py',
                                # Add non-hyperparameter arguments -in this case, the training dataset
                                arguments = ['--input-data', placement_ds.as_named_input('training_data')],
                                environment=hyper_env,
                                compute_target = pipeline_cluster)

# Sample a range of parameter values
params = GridParameterSampling(
    {
        # Hyperdrive will try 6 combinations, adding these as script arguments
        '--learning_rate': choice(0.01, 0.1, 1.0),
        '--n_estimators' : choice(10, 100)
    }
)

# Configure hyperdrive settings
hyperdrive = HyperDriveConfig(run_config=script_config, 
                          hyperparameter_sampling=params, 
                          policy=None, # No early stopping policy
                          primary_metric_name='AUC', # Find the highest AUC metric
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=6, # Restict the experiment to 6 iterations
                          max_concurrent_runs=2) # Run up to 2 iterations in parallel

# Run the experiment
experiment = Experiment(workspace=ws, name='placement-exp')
run = experiment.submit(config=hyperdrive)

# Show the status in the notebook as the experiment runs
RunDetails(run).show()
run.wait_for_completion()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

{'runId': 'HD_e9bf02e1-1975-4253-ad86-bc1662208c5e',
 'target': 'your-compute-cluster',
 'status': 'Completed',
 'startTimeUtc': '2023-02-13T18:11:37.099854Z',
 'endTimeUtc': '2023-02-13T18:16:53.097319Z',
 'services': {},
 'properties': {'primary_metric_config': '{"name":"AUC","goal":"maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '6e935781-a44a-4250-b27f-7ada9eb16b9c',
  'user_agent': 'python/3.8.10 (Linux-5.15.0-1031-azure-x86_64-with-glibc2.17) msrest/0.7.1 Hyperdrive.Service/1.0.0 Hyperdrive.SDK/core.1.48.0',
  'space_size': '6',
  'score': '0.9896661367249603',
  'best_child_run_id': 'HD_e9bf02e1-1975-4253-ad86-bc1662208c5e_4',
  'best_metric_status': 'Succeeded',
  'best_data_container_id': 'dcid.HD_e9bf02e1-1975-4253-ad86-bc1662208c5e_4'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'configuration': None,
  'attribution': None,
  'telemetryValues': {'amlCli

In [15]:
# Print all child runs, sorted by the primary metric
for child_run in run.get_children_sorted_by_primary_metric():
    print(child_run)

# Get the best run, and its metrics and arguments
best_run = run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
script_arguments = best_run.get_details() ['runDefinition']['arguments']
print('Best Run Id: ', best_run.id)
print(' -AUC:', best_run_metrics['AUC'])
print(' -Accuracy:', best_run_metrics['Accuracy'])
print(' -Arguments:',script_arguments)

{'run_id': 'HD_e9bf02e1-1975-4253-ad86-bc1662208c5e_5', 'hyperparameters': '{"--learning_rate": 1.0, "--n_estimators": 100}', 'best_primary_metric': 0.9896661367249603, 'status': 'Completed'}
{'run_id': 'HD_e9bf02e1-1975-4253-ad86-bc1662208c5e_4', 'hyperparameters': '{"--learning_rate": 1.0, "--n_estimators": 10}', 'best_primary_metric': 0.9896661367249603, 'status': 'Completed'}
{'run_id': 'HD_e9bf02e1-1975-4253-ad86-bc1662208c5e_1', 'hyperparameters': '{"--learning_rate": 0.01, "--n_estimators": 100}', 'best_primary_metric': 0.958664546899841, 'status': 'Completed'}
{'run_id': 'HD_e9bf02e1-1975-4253-ad86-bc1662208c5e_3', 'hyperparameters': '{"--learning_rate": 0.1, "--n_estimators": 100}', 'best_primary_metric': 0.9523052464228935, 'status': 'Completed'}
{'run_id': 'HD_e9bf02e1-1975-4253-ad86-bc1662208c5e_2', 'hyperparameters': '{"--learning_rate": 0.1, "--n_estimators": 10}', 'best_primary_metric': 0.9523052464228935, 'status': 'Completed'}
{'run_id': 'HD_e9bf02e1-1975-4253-ad86-bc1

In [16]:
from azureml.core import Model

# Register model
best_run.register_model(model_path='outputs/placement_model.pkl', model_name='placement_model',
                        tags={'Training context':'Hyperdrive'},
                        properties={'AUC': best_run_metrics['AUC'], 'Accuracy': best_run_metrics['Accuracy']})

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

placement_model version: 38
	 Training context : Hyperdrive
	 AUC : 0.9896661367249603
	 Accuracy : 0.9814814814814815


placement_model version: 37
	 Training context : Pipeline
	 AUC : 0.9896661367249603
	 Accuracy : 0.9814814814814815


placement_model version: 36
	 Training context : Pipeline
	 AUC : 0.9896661367249603
	 Accuracy : 0.9814814814814815


placement_model version: 35
	 Training context : Pipeline
	 AUC : 0.9523052464228935
	 Accuracy : 0.9444444444444444


placement_model version: 34
	 Training context : Pipeline
	 AUC : 0.9523052464228935
	 Accuracy : 0.9444444444444444


placement_model version: 33
	 Training context : Pipeline
	 AUC : 0.8585055643879174
	 Accuracy : 0.6851851851851852


placement_model version: 32
	 Training context : Pipeline
	 AUC : 0.958664546899841
	 Accuracy : 0.9444444444444444


placement_model version: 31
	 Training context : Pipeline
	 AUC : 0.904610492845787
	 Accuracy : 0.8148148148148148


placement_model version: 30
	 Training context :

# Batch Inference Pipeline

In [17]:
from azureml.core import Datastore, Dataset
import pandas as pd
import os

# Set default data store
ws.set_default_datastore('workspaceblobstore')
default_ds = ws.get_default_datastore()

# Enumerate all datastores, indicating which is the default
for ds_name in ws.datastores:
    print(ds_name, "- Default =", ds_name == default_ds.name)

# Load the placement data
placement =  Dataset.get_by_name(ws, name='placement').to_pandas_dataframe()
# Get a 100-item sample of the feature columns (not the placement label)
sample = placement[['ssc_p','hsc_p','degree_p','etest_p','mba_p','salary','ssc_b','hsc_b','workex','specialisation','hsc_s_Commerce','hsc_s_Science','degree_t_Others','degree_t_Sci&Tech']].sample(n=100).values

# Create a folder
batch_folder = './batch-data'
os.makedirs(batch_folder, exist_ok=True)
print("Folder created!")

# Save each sample as a separate file
print("Saving files...")
for i in range(100):
    fname = str(i+1) + '.csv'
    sample[i].tofile(os.path.join(batch_folder, fname), sep=",")
print("files saved!")

# Upload the files to the default datastore
print("Uploading files to datastore...")
default_ds = ws.get_default_datastore()
default_ds.upload(src_dir="batch-data", target_path="batch-data", overwrite=True, show_progress=True)

# Register a dataset for the input data
batch_data_set = Dataset.File.from_files(path=(default_ds, 'batch-data/'), validate=False)
try:
    batch_data_set = batch_data_set.register(workspace=ws, 
                                             name='batch-data',
                                             description='batch data',
                                             create_new_version=True)
except Exception as ex:
    print(ex)

print("Done!")

workspaceworkingdirectory - Default = False
workspaceartifactstore - Default = False
workspacefilestore - Default = False
workspaceblobstore - Default = True
Folder created!
Saving files...
files saved!
Uploading files to datastore...
Uploading an estimated of 100 files
Uploading batch-data/1.csv
Uploaded batch-data/1.csv, 1 files out of an estimated total of 100
Uploading batch-data/10.csv
Uploaded batch-data/10.csv, 2 files out of an estimated total of 100
Uploading batch-data/100.csv
Uploaded batch-data/100.csv, 3 files out of an estimated total of 100
Uploading batch-data/11.csv
Uploaded batch-data/11.csv, 4 files out of an estimated total of 100
Uploading batch-data/12.csv
Uploaded batch-data/12.csv, 5 files out of an estimated total of 100
Uploading batch-data/13.csv
Uploaded batch-data/13.csv, 6 files out of an estimated total of 100
Uploading batch-data/14.csv
Uploaded batch-data/14.csv, 7 files out of an estimated total of 100
Uploading batch-data/15.csv
Uploaded batch-data/15

###### Python code for batch inferencing pipeline is defined. Let's create a folder where we can keep all the files used by the pipeline.

In [18]:
import os
# Create a folder for the experiment files
experiment_folder = 'batch_pipeline'
os.makedirs(experiment_folder, exist_ok=True)

print(experiment_folder)

batch_pipeline


###### In the below code, the first sentence to save the python code where the actual batch inference steps are defined and it stored under batch_pipeline folder. 


In [19]:
%%writefile $experiment_folder/batch_placement.py
import os
import numpy as np
from azureml.core import Model
import joblib

def init():
    # Runs when the pipeline step is initialized
    global model

    # load the model
    model_path = Model.get_model_path('placement_model')
    model = joblib.load(model_path)


def run(mini_batch):
    # This runs for each batch
    resultList = []

    # process each file in the batch
    for f in mini_batch:
        # Read the comma-delimited data into an array
        data = np.genfromtxt(f, delimiter=',')
        # Reshape into a 2-dimensional array for prediction (model expects multiple items)
        prediction = model.predict(data.reshape(1, -1))
        # Append prediction to results
        resultList.append("{}: {}".format(os.path.basename(f), prediction[0]))
    return resultList

Writing batch_pipeline/batch_placement.py


###### The environment is created using Conda specification including the packages the code uses.


In [20]:
%%writefile $experiment_folder/batch_environment.yml
name: batch_environment
dependencies:
- python=3.6.2
- scikit-learn
- pip
- pip:
  - azureml-defaults
     

Writing batch_pipeline/batch_environment.yml


In [21]:
from azureml.core import Environment
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

# Create an Environment for the experiment
batch_env = Environment.from_conda_specification("experiment_env", experiment_folder + "/batch_environment.yml")
batch_env.docker.base_image = DEFAULT_CPU_IMAGE
print('Configuration ready.')

Configuration ready.


In [22]:
from azureml.pipeline.steps import ParallelRunConfig, ParallelRunStep
from azureml.data import OutputFileDatasetConfig

output_dir = OutputFileDatasetConfig(name='inferences')

parallel_run_config = ParallelRunConfig(
    source_directory=experiment_folder,
    entry_script="batch_placement.py",
    mini_batch_size="5",
    error_threshold=10,
    output_action="append_row",
    environment=batch_env,
    compute_target=pipeline_cluster,
    node_count=2)

parallelrun_step = ParallelRunStep(
    name='batch-score-placement',
    parallel_run_config=parallel_run_config,
    inputs=[batch_data_set.as_named_input('placement_batch')],
    output=output_dir,
    arguments=[],
    allow_reuse=True
)

print('Steps defined')

Steps defined


In [30]:
from azureml.core import Experiment
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[parallelrun_step])
pipeline_run = Experiment(ws, 'placement-batch').submit(pipeline)
pipeline_run.wait_for_completion(show_output=True)
     

Created step batch-score-placement [ec01136c][273c633c-0b09-4fba-98de-5f381b4fb0ce], (This step is eligible to reuse a previous run's output)
Submitted PipelineRun 23a915ee-89ae-4153-917f-7ebaf5933881
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/23a915ee-89ae-4153-917f-7ebaf5933881?wsid=/subscriptions/625feea7-c3f9-4427-b9c3-60e6e77b59dc/resourcegroups/ml-ba09-dp100/workspaces/ml-workspace&tid=474565c1-bca4-4295-a2f5-b0c7dbf2591c
PipelineRunId: 23a915ee-89ae-4153-917f-7ebaf5933881
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/23a915ee-89ae-4153-917f-7ebaf5933881?wsid=/subscriptions/625feea7-c3f9-4427-b9c3-60e6e77b59dc/resourcegroups/ml-ba09-dp100/workspaces/ml-workspace&tid=474565c1-bca4-4295-a2f5-b0c7dbf2591c
PipelineRun Status: NotStarted
PipelineRun Status: Running

PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': '23a915ee-89ae-4153-917f-7ebaf5933881', 'status': 'Completed', 'startTimeUtc': '2023-02-13T18:47:51.747013Z', 'e

'Finished'

###### The below code helps us in retrieving the resulting predictions saved in the outputs of the experiment associated with the first step in the pipeline.

In [40]:
import pandas as pd
import shutil

# Remove the local results folder if left over from a previous run
shutil.rmtree('placement-results', ignore_errors=True)

# Get the run for the first step and download its output
prediction_run = next(pipeline_run.get_children())
prediction_output = prediction_run.get_output_data('inferences')
prediction_output.download(local_path='placement-results')

# Traverse the folder hierarchy and find the results file
for root, dirs, files in os.walk('placement-results'):
    for file in files:
        if file.endswith('parallel_run_step.txt'):
            result_file = os.path.join(root,file)

# cleanup output format
df = pd.read_csv(result_file, delimiter=":", header=None)
df.columns = ["File", "Prediction"]

# Display the first 20 results
df.head(20)

Unnamed: 0,File,Prediction
0,1.csv,1
1,10.csv,1
2,100.csv,1
3,11.csv,1
4,12.csv,1
5,13.csv,1
6,14.csv,1
7,15.csv,1
8,16.csv,1
9,17.csv,1


# Model Interpretability

###### The Python environment in which the script is ran, the azureml-interpret library is included in the training environment so the script can create a TabularExplainer and use the ExplainerClient class.

In [28]:
%%writefile placement-exp/placement_interpret.py
# Import libraries
import pandas as pd
import numpy as np
import joblib
import os
from azureml.core import Dataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Import Azure ML run library
from azureml.core.run import Run

# Import libraries for model explanation
from azureml.interpret import ExplanationClient
from interpret.ext.blackbox import TabularExplainer


# Get the experiment run context
run = Run.get_context()

# load the placement dataset
# print("Loading Data...")
data = pd.read_csv('placement.csv')

features = ['ssc_p','hsc_p','degree_p','etest_p','mba_p','salary','ssc_b','hsc_b','workex','specialisation','hsc_s_Commerce','hsc_s_Science','degree_t_Others','degree_t_Sci&Tech']
labels = ['Placed', 'Not-placed']

# Separate features and labels
X, y = data[features].values, data['status'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model')
model = LogisticRegression().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/placement.pkl')

# Get explanation
explainer = TabularExplainer(model, X_train, features=features, classes=labels)
explanation = explainer.explain_global(X_test)

# Get an Explanation Client and upload the explanation
explain_client = ExplanationClient.from_run(run)
explain_client.upload_model_explanation(explanation, comment='Tabular Explanation')

# Complete the run
run.complete()

Overwriting placement-exp/placement_interpret.py


###### Run the experiment

In [29]:
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.runconfig import DockerConfiguration
from azureml.widgets import RunDetails


# Create a Python environment for the experiment
experiment_folder = 'placement-exp'
explain_env = Environment.from_conda_specification("placement-exp", experiment_folder + "/experiment_env.yml")

# Create a script config
script_config = ScriptRunConfig(source_directory=experiment_folder,
                      script='placement_interpret.py',
                      environment=explain_env,
                      docker_runtime_config=DockerConfiguration(use_docker=True),
                      compute_target = pipeline_cluster) 

# submit the experiment
experiment_name = 'placement-exp'
experiment = Experiment(workspace=ws, name=experiment_name)
run = experiment.submit(config=script_config)
RunDetails(run).show()
run.wait_for_completion()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'placement-exp_1676313621_1cfd9fa4',
 'target': 'your-compute-cluster',
 'status': 'Completed',
 'startTimeUtc': '2023-02-13T18:45:39.084789Z',
 'endTimeUtc': '2023-02-13T18:47:03.876204Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'amlctrain',
  'ContentSnapshotId': '0466c444-ffda-4d6f-8abd-4f80d48f3009',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'placement_interpret.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'your-compute-cluster',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'datacaches': [],
  'jobName': None,
  'maxRunDurationSeconds': 2592000,
  'nodeCount': 1,
  'instanceTypes': [],
  'priority': None,
  'credentialPassthrough': False,
  'identity': None,
  'environm

KeyError: 'log_files'

###### ExplanationClient - This class will help us retrieve the feature importance.
###### The below code gives all the important features in high priority order. After salary all the features can be ignored as they don't make much change to the model. Here class Status = 'Placed' is considered to get the feature importance.

In [31]:
from azureml.interpret import ExplanationClient

# Get the feature explanations
client = ExplanationClient.from_run(run)
engineered_explanations = client.download_model_explanation()
feature_importances = engineered_explanations.get_feature_importance_dict()

# Overall feature importance
print('Feature\tImportance')
for key, value in feature_importances.items():
    print(key, '\t', value)
     

Feature	Importance
ssc_p 	 2.4855128475351584
mba_p 	 1.2813546910693512
etest_p 	 0.7430705717376743
hsc_p 	 0.7042301068678086
salary 	 0.06297910664203765
degree_p 	 0.033640794633517664
specialisation 	 0.00817199535309511
workex 	 0.0058274987181733675
hsc_s_Commerce 	 0.002433665614885705
hsc_b 	 0.0014206787176340014
ssc_b 	 0.0007736451351135847
degree_t_Others 	 0.00024274803096136627
hsc_s_Science 	 7.930635378788702e-05
degree_t_Sci&Tech 	 1.8121213130645295e-05


In [33]:
published_pipeline = pipeline_run.publish_pipeline(
    name='placement-batch-pipeline', description='Batch scoring of placement data', version='1.0')

published_pipeline
     

Name,Id,Status,Endpoint
placement-batch-pipeline,6ebb25c8-704d-4da3-920c-25d1c3f4e257,Active,REST Endpoint


###### The published pipeline has an endpoint, which can be seen in the Azure portal. 

In [34]:
rest_endpoint = published_pipeline.endpoint
print(rest_endpoint)

https://centralindia.api.azureml.ms/pipelines/v1.0/subscriptions/625feea7-c3f9-4427-b9c3-60e6e77b59dc/resourceGroups/ml-ba09-dp100/providers/Microsoft.MachineLearningServices/workspaces/ml-workspace/PipelineRuns/PipelineSubmit/6ebb25c8-704d-4da3-920c-25d1c3f4e257


In [35]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
print('Authentication header ready.')

Authentication header ready.


###### The pipeline runs asynchronously, to get an identifier back, which can be use to track the pipeline experiment as it runs

In [42]:
import requests

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "placement-batch"})
run_id = response.json()["Id"]
run_id

'2040ecbe-9f77-4b6f-8b63-2a2f3eda2251'

The RunDetails widget to view the experiment as it runs as we are using the RunID

In [59]:
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails
from azureml.core import Experiment

experiment = Experiment(ws, "placement-batch")
pipeline_run = PipelineRun(experiment, run_id)


# Block until the run completes
pipeline_run.wait_for_completion(show_output=True)

PipelineRunId: 2040ecbe-9f77-4b6f-8b63-2a2f3eda2251
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/2040ecbe-9f77-4b6f-8b63-2a2f3eda2251?wsid=/subscriptions/625feea7-c3f9-4427-b9c3-60e6e77b59dc/resourcegroups/ml-ba09-dp100/workspaces/ml-workspace&tid=474565c1-bca4-4295-a2f5-b0c7dbf2591c

PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': '2040ecbe-9f77-4b6f-8b63-2a2f3eda2251', 'status': 'Completed', 'startTimeUtc': '2023-02-13T18:54:26.517383Z', 'endTimeUtc': '2023-02-13T18:54:27.655029Z', 'services': {}, 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'Unavailable', 'runType': 'HTTP', 'azureml.parameters': '{}', 'azureml.continue_on_step_failure': 'False', 'azureml.continue_on_failed_optional_input': 'True', 'azureml.pipelineid': '6ebb25c8-704d-4da3-920c-25d1c3f4e257', 'azureml.pipelineComponent': 'pipelinerun', 'azureml.pipelines.stages': '{"Initialization":null,"Execution":{"StartTime":"2023-02-13T18:54:26.9681706+00:00","En

'Finished'

###### Below code gives the output of first pipeline step

In [41]:
import pandas as pd
import shutil

# Remove the local results folder if left over from a previous run
shutil.rmtree('placement-results', ignore_errors=True)

# Get the run for the first step and download its output
prediction_run = next(pipeline_run.get_children())
prediction_output = prediction_run.get_output_data('inferences')
prediction_output.download(local_path='placement-results')

# Traverse the folder hierarchy and find the results file
for root, dirs, files in os.walk('placement-results'):
    for file in files:
        if file.endswith('parallel_run_step.txt'):
            result_file = os.path.join(root,file)

# cleanup output format
df = pd.read_csv(result_file, delimiter=":", header=None)
df.columns = ["File", "Prediction"]

# Display the first 20 results
df.head(20)

Unnamed: 0,File,Prediction
0,11.csv,0
1,12.csv,0
2,13.csv,1
3,14.csv,0
4,15.csv,0
5,16.csv,1
6,17.csv,1
7,18.csv,1
8,19.csv,0
9,2.csv,1
