# The "Azure ML SDK" for SMS Spam Inference 

## Introduction

In this notebook, we will show the use of Azure ML SDK to train, deploy and consume a model through Azure ML.


Steps:

1. Create a workspace. Create an Experiment in an existing Workspace.
2. Create a Compute cluster.
3. Load the dataset.
4. Select and Train a model.
5. Configure HyperDrive.
6. Run the HyperDrive experiment usinng HyperDriveConfig.
7. Explore the results and get the best model.
8. Register the best model.
9. Deploy the best model and create endpoint.
10. Consume the endpoint.
11. Delete Service


## Azure Machine Learning SDK-specific imports

In [1]:
from azureml.core import Workspace, Experiment
from azureml.core.compute import AmlCompute
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import AciWebservice

In [2]:
import numpy as np
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [3]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

nahmed30-azureml-workspace
epe-poc-nazeer
centralus
16bc73b5-82be-47f2-b5ab-f2373344794c


## Create an Azure ML experiment

Let's create an experiment named 'aml-experiment' in the workspace we just initialized.

In [4]:
experiment_name = 'nb_lr_exp_0920_v10'
experiment = Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
nb_lr_exp_0920_v10,nahmed30-azureml-workspace,Link to Azure Machine Learning studio,Link to Documentation


## Create a Compute Cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/concept-azure-machine-learning-architecture#compute-target) for your AutoML run.

In [5]:
aml_name = "cpu-cluster"
try:
    aml_compute = AmlCompute(ws, aml_name)
    print('Found existing AML compute context.')
except:
    print('Creating new AML compute context.')
    aml_config = AmlCompute.provisioning_configuration(vm_size = "Standard_D2_v2", min_nodes=1, max_nodes=3)
    aml_compute = AmlCompute.create(ws, name = aml_name, provisioning_configuration = aml_config)
    aml_compute.wait_for_completion(show_output = True)

cts = ws.compute_targets
compute_target = cts[aml_name]

Found existing AML compute context.


## Load Data
Make sure you have uploaded the dataset to Azure ML and that the key is the same name as the dataset.

In [6]:
key = 'UdacityPrjEmailSpamDataSet'
smsspam_ds = ws.datasets[key]
df = smsspam_ds.to_pandas_dataframe()
df.describe()

Unnamed: 0,v1,v2,Column3,Column4,Column5
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""",GE,"GNT:-)"""
freq,4825,30,3,2,2


### Create workspace folders

In [7]:
import os

experiment_folder = 'nb_lr_exp_0920_v10'
os.makedirs(experiment_folder, exist_ok=True)

## Environment

In [8]:
%%writefile $experiment_folder/hyperdrive_env_v10.yml
name: batch_environment
dependencies:
- python=3.8.5
- scikit-learn
- pandas
- numpy
- regex
- nltk
- pip
- pip:
  - azureml-defaults

Writing nb_lr_exp_0920_v10/hyperdrive_env_v10.yml


# Python script to prepare the data train the model.

In [9]:
%%writefile $experiment_folder/train_0920_v10.py

#import libraries
import argparse, joblib, os
from azureml.core import Run

import logging
import os
import csv
import string
from datetime import datetime
import numpy as np
import pandas as pd
from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
from nltk.stem import SnowballStemmer

from sklearn.metrics import accuracy_score

import regex as re

import pickle
import tempfile

from sklearn.preprocessing import StandardScaler
import nltk
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

import azureml.core
from azureml.core.dataset import Dataset
import scipy.sparse

import numpy as np
from scipy.sparse import lil_matrix, csr_matrix, issparse
import re


# Get the experiment run context
run = Run.get_context()

# Get script arguments
parser = argparse.ArgumentParser()

# Input dataset
parser.add_argument("--input-data", type=str, dest='input_data', help='training dataset')

# Hyperparameters
parser.add_argument('--C', type=float, default=1.0, help="indicates regularization")
parser.add_argument('--max_iter', type=int, default=50, help="Maximum number of iterations")

# Add arguments to args collection
args = parser.parse_args()

# Log Hyperparameter values
# run.log('learning_rate',  np.float(args.learning_rate))
# run.log('n_estimators',  np.int(args.n_estimators))

run.log("Regularization Strength:", np.float(args.C))
run.log("Max iterations:", np.int(args.max_iter))

 
# load the sms spam dataset -- Get the training data from the input
print("Loading SMS Spam Data...")
df = run.input_datasets['training_data'].to_pandas_dataframe() 

#--------------------------Prepare Data-------------------------------------------------
# Cleanup and Prepare Data # Find and eliminate stop words 
nltk.download('stopwords')
stop_words= set(stopwords.words("english"))
stop_words.update(['https', 'http', 'amp', 'CO', 't', 'u', 'new', "I'm", "would"])


spam = df.query("v1=='spam'").v2.str.cat(sep=" ")
ham = df.query("v1=='ham'").v2.str.cat(sep=" ")

# convert spam to 1 and ham to 0
df = df.replace('spam', 1)
df = df.replace('ham', 0)

# Clean the text
def clean_text(text):
    whitespace = re.compile(r"\s+")
    web_address = re.compile(r"(?i)http(s):\/\/[a-z0-9.~_\-\/]+")
    user = re.compile(r"(?i)@[a-z0-9_]+")
    text = text.replace('.', '')
    text = whitespace.sub(' ', text)
    text = web_address.sub('', text)
    text = user.sub('', text)
    text = re.sub(r"\[[^()]*\]", "", text)
    text = re.sub(r"\d+", "", text)
    text = re.sub(r'[^\w\s]','',text)
    text = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", text)
    return text.lower()

df.v2 = [clean_text(item) for item in df.v2]

#---------------------More Data Prep-----------#
df = df.drop(['Column3', 'Column4', 'Column5'], axis = 1)

df_msg_copy = df['v2'].copy()

def text_preprocess(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
    return " ".join(text)

df_msg_copy = df_msg_copy.apply(text_preprocess)

def stemmer (text):
    text = text.split()
    words = ""
    for i in text:
            stemmer = SnowballStemmer("english")
            words += (stemmer.stem(i))+" "
    return words

df_msg_copy = df_msg_copy.apply(stemmer)
vectorizer = TfidfVectorizer(stop_words='english')
train_corpus = vectorizer.fit_transform(df_msg_copy)


# Split Train and Test
xTrain, xTest, yTrain, yTest = train_test_split(train_corpus, df.v1, test_size=0.3, random_state=20)

# --------------------------End Prepare Data--------------------------------------------


# --------------------------Start Training----------------------------------------------
# Train a Logistic Regression classification model without the specified hyperparameters
print('Training a classification model')

model = LogisticRegression(solver='liblinear', penalty='l1', C=args.C, max_iter=args.max_iter )
model.fit(xTrain, yTrain)
pred = model.predict(xTest)
acc = accuracy_score(yTest,pred)

# ---------------------------End Training------------------------------------------------

run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(xTest)
auc = roc_auc_score(yTest,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# Save the model in the run outputs
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/model_v10.pkl')

run.complete()


Writing nb_lr_exp_0920_v10/train_0920_v10.py


## HyperDrive Configuration

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#primary-metric

# Run a hyperparameter tuning experiment

In [10]:
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.train.hyperdrive import GridParameterSampling, HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.widgets import RunDetails
from azureml.train.hyperdrive.policy import BanditPolicy

# Create a Python environment for the experiment
hyper_env = Environment.from_conda_specification("experiment_env", experiment_folder + "/hyperdrive_env_v10.yml")

# Early termination policy
early_termination_policy = BanditPolicy(evaluation_interval=2,slack_factor=0.2)

# Create a script config
script_config = ScriptRunConfig(source_directory=experiment_folder,
                                script='train_0920_v10.py',
                                # Add non-hyperparameter arguments -in this case, the training dataset
                                arguments = ['--input-data', smsspam_ds.as_named_input('training_data')],
                                environment=hyper_env,
                                compute_target = aml_compute)

# Sample a range of parameter values
params = GridParameterSampling(
    {
        # Hyperdrive will try 6 combinations, adding these as script arguments
        '--C': choice(0.01, 0.1, 1.0),
        '--max_iter' : choice(10, 50)
        # '--C': choice(1.0),
        # '--max_iter' : choice(10)
    }
)

# Configure hyperdrive settings
hyperdrive = HyperDriveConfig(run_config=script_config, 
                          hyperparameter_sampling=params, 
                          policy=early_termination_policy, # Banditpolicy early stopping policy
                          primary_metric_name='Accuracy', # Find the highest AUC metric
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=36, # Restict the experiment to 6 iterations
                          max_concurrent_runs=2) # Run up to 2 iterations in parallel

# Run the experiment
experiment = Experiment(workspace=ws, name=experiment_name)
run = experiment.submit(config=hyperdrive)

# Show the status in the notebook as the experiment runs
RunDetails(run).show()
run.wait_for_completion()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

{'runId': 'HD_40388996-9492-4ad7-94f0-d14a82bb1887',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2022-09-21T16:31:39.252786Z',
 'endTimeUtc': '2022-09-21T16:38:12.395925Z',
 'services': {},
 'properties': {'primary_metric_config': '{"name":"Accuracy","goal":"maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '90bfb86a-0de3-47cf-805d-2b9b567fabeb',
  'user_agent': 'python/3.8.5 (Linux-5.4.0-1077-azure-x86_64-with-glibc2.10) msrest/0.6.21 Hyperdrive.Service/1.0.0 Hyperdrive.SDK/core.1.41.0',
  'space_size': '6',
  'score': '0.9497607655502392',
  'best_child_run_id': 'HD_40388996-9492-4ad7-94f0-d14a82bb1887_4',
  'best_metric_status': 'Succeeded',
  'best_data_container_id': 'dcid.HD_40388996-9492-4ad7-94f0-d14a82bb1887_4'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'configuration': None,
  'attribution': None,
  'telemetryValues': {'amlClientTy

KeyError: 'log_files'

# Select the best performing run

In [11]:
# Print all child runs, sorted by the primary metric
for child_run in run.get_children_sorted_by_primary_metric():
    print(child_run)


{'run_id': 'HD_40388996-9492-4ad7-94f0-d14a82bb1887_4', 'hyperparameters': '{"--C": 1.0, "--max_iter": 10}', 'best_primary_metric': 0.9497607655502392, 'status': 'Completed'}
{'run_id': 'HD_40388996-9492-4ad7-94f0-d14a82bb1887_5', 'hyperparameters': '{"--C": 1.0, "--max_iter": 50}', 'best_primary_metric': 0.9497607655502392, 'status': 'Completed'}
{'run_id': 'HD_40388996-9492-4ad7-94f0-d14a82bb1887_3', 'hyperparameters': '{"--C": 0.1, "--max_iter": 50}', 'best_primary_metric': 0.8726076555023924, 'status': 'Completed'}
{'run_id': 'HD_40388996-9492-4ad7-94f0-d14a82bb1887_2', 'hyperparameters': '{"--C": 0.1, "--max_iter": 10}', 'best_primary_metric': 0.8726076555023924, 'status': 'Completed'}
{'run_id': 'HD_40388996-9492-4ad7-94f0-d14a82bb1887_1', 'hyperparameters': '{"--C": 0.01, "--max_iter": 50}', 'best_primary_metric': 0.8606459330143541, 'status': 'Completed'}
{'run_id': 'HD_40388996-9492-4ad7-94f0-d14a82bb1887_0', 'hyperparameters': '{"--C": 0.01, "--max_iter": 10}', 'best_primary_

In [12]:
# Get the best run, and its metrics and arguments
best_run = run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
script_arguments = best_run.get_details() ['runDefinition']['arguments']
print('Best Run Id: ', best_run.id)
print(' -AUC:', best_run_metrics['AUC'])
print(' -Accuracy:', best_run_metrics['Accuracy'])
print(' -Arguments:',script_arguments)

Best Run Id:  HD_40388996-9492-4ad7-94f0-d14a82bb1887_4
 -AUC: 0.9534294499935874
 -Accuracy: 0.9497607655502392
 -Arguments: ['--input-data', 'DatasetConsumptionConfig:training_data', '--C', '1', '--max_iter', '10']


## Register Best Model

Now that you've found the best run, you can register the model it trained.

In [13]:
# Register best model best model
reg_model = best_run.register_model(model_name='sms-spam-hd-0920-v10-model',
                                        model_path='outputs/model_v10.pkl', 
                                        tags={'Method':'LogisticRegression Hyperdrive'}, 
                                        # sample_input_dataset = smsspam_ds,
                                        properties={'Accuracy': best_run_metrics['Accuracy'],
                                                    'AUC': best_run_metrics['AUC']})
print(reg_model)

Model(workspace=Workspace.create(name='nahmed30-azureml-workspace', subscription_id='16bc73b5-82be-47f2-b5ab-f2373344794c', resource_group='epe-poc-nazeer'), name=sms-spam-hd-0920-v10-model, id=sms-spam-hd-0920-v10-model:7, version=7, tags={'Method': 'LogisticRegression Hyperdrive'}, properties={'Accuracy': '0.9497607655502392', 'AUC': '0.9534294499935874'})


In [14]:
reg_model.id

'sms-spam-hd-0920-v10-model:7'

In [15]:
for child_run in run.get_children():
    print(child_run,"\n")

Run(Experiment: nb_lr_exp_0920_v10,
Id: HD_40388996-9492-4ad7-94f0-d14a82bb1887_4,
Type: azureml.scriptrun,
Status: Completed) 

Run(Experiment: nb_lr_exp_0920_v10,
Id: HD_40388996-9492-4ad7-94f0-d14a82bb1887_5,
Type: azureml.scriptrun,
Status: Completed) 

Run(Experiment: nb_lr_exp_0920_v10,
Id: HD_40388996-9492-4ad7-94f0-d14a82bb1887_3,
Type: azureml.scriptrun,
Status: Completed) 

Run(Experiment: nb_lr_exp_0920_v10,
Id: HD_40388996-9492-4ad7-94f0-d14a82bb1887_2,
Type: azureml.scriptrun,
Status: Completed) 

Run(Experiment: nb_lr_exp_0920_v10,
Id: HD_40388996-9492-4ad7-94f0-d14a82bb1887_1,
Type: azureml.scriptrun,
Status: Completed) 

Run(Experiment: nb_lr_exp_0920_v10,
Id: HD_40388996-9492-4ad7-94f0-d14a82bb1887_0,
Type: azureml.scriptrun,
Status: Completed) 



In [16]:
reg_model.name

'sms-spam-hd-0920-v10-model'

In [30]:
model_name = reg_model.name
script_file = "scripts/score_v10.py"
description = "aml SMS spam lr hd project sdk"


In [31]:
from azureml.automl.core.shared import constants
env = best_run.get_environment()

## Deploy Model as Webservice on Azure Container Instance

Run the following code to deploy the best model. You can see the state of the deployment in the Azure ML portal. This step can take a few minutes.

In [32]:
inference_config = InferenceConfig(entry_script=script_file, environment=best_run.get_environment())


In [33]:
inference_config

InferenceConfig(entry_script=scripts/score_v10.py, runtime=None, conda_file=None, extra_docker_file_steps=None, source_directory=None, enable_gpu=None, base_image=None, base_image_registry=<azureml.core.container_registry.ContainerRegistry object at 0x7fbf6d9b42b0>)

In [34]:
aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1,
                                               memory_gb = 1,
                                               tags = {'type': "automl-SMS-spam-prediction"},
                                               description = 'Sample service for AutoML SMS Spam Prediction')

aci_service_name = 'smsspam-lrhd-v10-7'

## Create Web Service

In [35]:
aci_service = Model.deploy(ws, aci_service_name, [reg_model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2022-09-21 03:26:32-04:00 Creating Container Registry if not exists.
2022-09-21 03:26:32-04:00 Registering the environment.
2022-09-21 03:26:33-04:00 Use the existing image.
2022-09-21 03:26:34-04:00 Submitting deployment to compute.
2022-09-21 03:26:39-04:00 Checking the status of deployment smsspam-lrhd-v10-7..
2022-09-21 03:28:46-04:00 Checking the status of inference endpoint smsspam-lrhd-v10-7.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy


## Service test

***REST Endpoint***

In [36]:
scoring_url=aci_service.scoring_uri
print(scoring_url)

http://f179cabe-8ce5-48b1-b153-924873c6816b.centralus.azurecontainer.io/score


In [37]:
import urllib.request
import json
import os
import ssl

def allowSelfSignedHttps(allowed):
    # bypass the server certificate verification on client side
    if allowed and not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None):
        ssl._create_default_https_context = ssl._create_unverified_context

allowSelfSignedHttps(True) # this line is needed if you use self-signed certificate in your scoring service.

# Request data goes here
# The example below assumes JSON formatting which may be updated
# depending on the format your endpoint expects.
# More information can be found here:
# https://docs.microsoft.com/azure/machine-learning/how-to-deploy-advanced-entry-script

data =  {
  "data": [
    {
      "v2": "Is that seriously how you spell his name"
    }
  ],
  "method": "predict"
}

body = str.encode(json.dumps(data))

# url = 'http://85c032ba-bdb1-4060-801d-cd7a6bc543a8.centralus.azurecontainer.io/score'
url = scoring_url
api_key = '' # Replace this with the API key for the web service

# The azureml-model-deployment header will force the request to go to a specific deployment.
# Remove this header to have the request observe the endpoint traffic rules
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}

req = urllib.request.Request(url, body, headers)

try:
    response = urllib.request.urlopen(req)

    result = response.read()
    print(result)
except urllib.error.HTTPError as error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())
    print(error.read().decode("utf8", 'ignore'))


b'"{\\"result\\": [0]}"'


In [38]:
import urllib.request
import json
import os
import ssl

def allowSelfSignedHttps(allowed):
    # bypass the server certificate verification on client side
    if allowed and not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None):
        ssl._create_default_https_context = ssl._create_unverified_context

allowSelfSignedHttps(True) # this line is needed if you use self-signed certificate in your scoring service.

# Request data goes here
# The example below assumes JSON formatting which may be updated
# depending on the format your endpoint expects.
# More information can be found here:
# https://docs.microsoft.com/azure/machine-learning/how-to-deploy-advanced-entry-script

data =  {
  "data": [
    {
      "v2": "XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here"
    }
  ],
  "method": "predict"
}

body = str.encode(json.dumps(data))

# url = 'http://85c032ba-bdb1-4060-801d-cd7a6bc543a8.centralus.azurecontainer.io/score'
url = scoring_url
api_key = '' # Replace this with the API key for the web service

# The azureml-model-deployment header will force the request to go to a specific deployment.
# Remove this header to have the request observe the endpoint traffic rules
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}

req = urllib.request.Request(url, body, headers)

try:
    response = urllib.request.urlopen(req)

    result = response.read()
    print(result)
except urllib.error.HTTPError as error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())
    print(error.read().decode("utf8", 'ignore'))


b'"{\\"result\\": [1]}"'


## Delete Service

In [39]:
aci_service.delete

<bound method Webservice.delete of AciWebservice(workspace=Workspace.create(name='nahmed30-azureml-workspace', subscription_id='16bc73b5-82be-47f2-b5ab-f2373344794c', resource_group='epe-poc-nazeer'), name=smsspam-lrhd-v10-7, image_id=None, image_digest=None, compute_type=ACI, state=Healthy, scoring_uri=http://f179cabe-8ce5-48b1-b153-924873c6816b.centralus.azurecontainer.io/score, tags={'type': 'automl-SMS-spam-prediction'}, properties={'azureml.git.repository_uri': 'https://github.com/Nazeer2013/nd00333-capstone.git', 'mlflow.source.git.repoURL': 'https://github.com/Nazeer2013/nd00333-capstone.git', 'azureml.git.branch': 'master', 'mlflow.source.git.branch': 'master', 'azureml.git.commit': '12f6fc1128f76313f7a9bff2fc986fb76046de43', 'mlflow.source.git.commit': '12f6fc1128f76313f7a9bff2fc986fb76046de43', 'azureml.git.dirty': 'True', 'hasInferenceSchema': 'True', 'hasHttps': 'False', 'authEnabled': 'False'}, created_by={'userObjectId': 'a8930881-263c-498d-8975-58e6a0c28f2c', 'userPuId':