# The "Azure ML SDK" for SMS Spam Inference 

## Introduction

In this notebook, we will show the use of Azure ML SDK to train, deploy and consume a model through Azure ML.


Steps:

1. Create a workspace. Create an Experiment in an existing Workspace.
2. Create a Compute cluster.
3. Load the dataset.
4. Configure AutoML using AutoMLConfig.
5. Run the AutoML experiment.
6. Explore the results and get the best model.
7. Register the best model.
8. Deploy the best model.
9. Consume the endpoint.

## Azure Machine Learning SDK-specific imports

In [13]:
from azureml.core import Workspace, Experiment
from azureml.core.compute import AmlCompute
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import AciWebservice

In [14]:
import numpy as np
import pandas as pd
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [15]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

nahmed30-azureml-workspace
epe-poc-nazeer
centralus
16bc73b5-82be-47f2-b5ab-f2373344794c


## Create an Azure ML experiment

Let's create an experiment named 'aml-experiment' in the workspace we just initialized.

In [16]:
experiment_name = 'nb_lr_exp_0918_v7'
experiment = Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
nb_lr_exp_0918_v7,nahmed30-azureml-workspace,Link to Azure Machine Learning studio,Link to Documentation


## Create a Compute Cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/concept-azure-machine-learning-architecture#compute-target) for your AutoML run.

In [17]:
aml_name = "cpu-cluster"
try:
    aml_compute = AmlCompute(ws, aml_name)
    print('Found existing AML compute context.')
except:
    print('Creating new AML compute context.')
    aml_config = AmlCompute.provisioning_configuration(vm_size = "Standard_D2_v2", min_nodes=1, max_nodes=3)
    aml_compute = AmlCompute.create(ws, name = aml_name, provisioning_configuration = aml_config)
    aml_compute.wait_for_completion(show_output = True)

cts = ws.compute_targets
compute_target = cts[aml_name]

Found existing AML compute context.


## Data
Make sure you have uploaded the dataset to Azure ML and that the key is the same name as the dataset.

In [18]:
key = 'UdacityPrjEmailSpamDataSet'
smsspam_ds = ws.datasets[key]
df = smsspam_ds.to_pandas_dataframe()
df.describe()

Unnamed: 0,v1,v2,Column3,Column4,Column5
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""",GE,"GNT:-)"""
freq,4825,30,3,2,2


In [19]:
df.head()

Unnamed: 0,v1,v2,Column3,Column4,Column5
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


### Create workspace folders

In [20]:
import os

experiment_folder = 'nb_lr_exp_0918_v7'
os.makedirs(experiment_folder, exist_ok=True)

## Environment

In [21]:
%%writefile $experiment_folder/hyperdrive_env.yml
name: batch_environment
dependencies:
- python=3.8.5
- scikit-learn
- pandas
- numpy
- regex
- nltk
- pip
- pip:
  - azureml-defaults

Overwriting nb_lr_exp_0918_v7/hyperdrive_env.yml


# Create  Python script to train the model.

In [22]:
%%writefile $experiment_folder/train.py

# Import libraries
import argparse, joblib, os
from azureml.core import Run

import logging
import os
import csv
import string
from datetime import datetime
import numpy as np
import pandas as pd
from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
from nltk.stem import SnowballStemmer

from sklearn.metrics import accuracy_score

import regex as re

import pickle
import tempfile

from sklearn.preprocessing import StandardScaler
import nltk
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

import azureml.core
from azureml.core.dataset import Dataset

# Get the experiment run context
run = Run.get_context()

# Get script arguments
parser = argparse.ArgumentParser()

# Input dataset
parser.add_argument("--input-data", type=str, dest='input_data', help='training dataset')

# Hyperparameters
#parser.add_argument('--learning_rate', type=float, dest='learning_rate', default=0.1, help='learning rate')
# parser.add_argument('--n_estimators', type=int, dest='n_estimators', default=100, help='number of estimators')

parser.add_argument('--C', type=float, default=1.0, help="indicates regularization")
parser.add_argument('--max_iter', type=int, default=100, help="Maximum number of iterations")

# Add arguments to args collection
args = parser.parse_args()

# Log Hyperparameter values
# run.log('learning_rate',  np.float(args.learning_rate))
# run.log('n_estimators',  np.int(args.n_estimators))

run.log("Regularization Strength:", np.float(args.C))
run.log("Max iterations:", np.int(args.max_iter))

 
# load the sms spam dataset -- Get the training data from the input
print("Loading SMS Spam Data...")
df = run.input_datasets['training_data'].to_pandas_dataframe() 

#--------------------------Prepare Data-------------------------------------------------
# Cleanup and Prepare Data # Find and eliminate stop words 
nltk.download('stopwords')
stop_words= set(stopwords.words("english"))
stop_words.update(['https', 'http', 'amp', 'CO', 't', 'u', 'new', "I'm", "would"])


spam = df.query("v1=='spam'").v2.str.cat(sep=" ")
ham = df.query("v1=='ham'").v2.str.cat(sep=" ")

# convert spam to 1 and ham to 0
df = df.replace('spam', 1)
df = df.replace('ham', 0)

# Clean the text
def clean_text(text):
    whitespace = re.compile(r"\s+")
    web_address = re.compile(r"(?i)http(s):\/\/[a-z0-9.~_\-\/]+")
    user = re.compile(r"(?i)@[a-z0-9_]+")
    text = text.replace('.', '')
    text = whitespace.sub(' ', text)
    text = web_address.sub('', text)
    text = user.sub('', text)
    text = re.sub(r"\[[^()]*\]", "", text)
    text = re.sub(r"\d+", "", text)
    text = re.sub(r'[^\w\s]','',text)
    text = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", text)
    return text.lower()

df.v2 = [clean_text(item) for item in df.v2]

#---------------------More Data Prep-----------#
df = df.drop(['Column3', 'Column4', 'Column5'], axis = 1)

df_msg_copy = df['v2'].copy()

# vectorizer = TfidfVectorizer(stop_words='english')

def text_preprocess(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
    return " ".join(text)

df_msg_copy = df_msg_copy.apply(text_preprocess)

# text2 = vectorizer.fit_transform(df_msg_copy)

def stemmer (text):
    text = text.split()
    words = ""
    for i in text:
            stemmer = SnowballStemmer("english")
            words += (stemmer.stem(i))+" "
    return words

df_msg_copy = df_msg_copy.apply(stemmer)
vectorizer = TfidfVectorizer(stop_words='english')
msg_mat = vectorizer.fit_transform(df_msg_copy)


# Split Train and Test
xTrain, xTest, yTrain, yTest = train_test_split(msg_mat, df.v1, test_size=0.3, random_state=20)

print("Nazeer: xTrain type ************************", type(xTrain))

print("Nazeer: xTest type ************************", type(xTest))

print("Nazeer: yTest ************************", type(yTest))

print("Nazeer: yTest ************************", yTest)

# --------------------------End Prepare Data--------------------------------------------
# --------------------------Start Training----------------------------------------------
# Train a Logistic Regression classification model without the specified hyperparameters
print('Training a classification model')

model = LogisticRegression(solver='liblinear', penalty='l1', C=args.C, max_iter=args.max_iter )
model.fit(xTrain, yTrain)
pred = model.predict(xTest)
acc = accuracy_score(yTest,pred)

# ---------------------------End Training------------------------------------------------


# Train a LogisticRegression classification model with the specified hyperparameters
# print('Training a classification model')
# model = LogisticRegression(learning_rate=args.learning_rate,
#                                   n_estimators=args.n_estimators).fit(xTrain, yTrain)

# calculate accuracy
# y_hat = model.predict(xTest)
# acc = np.average(y_hat == yTest)
# print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(xTest)
auc = roc_auc_score(yTest,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))



# Save the model in the run outputs
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/model_v7.pkl')


run.complete()


Overwriting nb_lr_exp_0918_v7/train.py


## HyperDrive Configuration

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#primary-metric

# Run a hyperparameter tuning experiment

In [23]:
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.train.hyperdrive import GridParameterSampling, HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.widgets import RunDetails

# Create a Python environment for the experiment
hyper_env = Environment.from_conda_specification("experiment_env", experiment_folder + "/hyperdrive_env.yml")

# Get the training dataset
# diabetes_ds = ws.datasets.get("sms spam dataset")
# dataset

# Create a script config
script_config = ScriptRunConfig(source_directory=experiment_folder,
                                script='train.py',
                                # Add non-hyperparameter arguments -in this case, the training dataset
                                arguments = ['--input-data', smsspam_ds.as_named_input('training_data')],
                                environment=hyper_env,
                                compute_target = aml_compute)

# Sample a range of parameter values
params = GridParameterSampling(
    {
        # Hyperdrive will try 6 combinations, adding these as script arguments
        '--C': choice(0.01, 0.1, 1.0),
        '--max_iter' : choice(10, 50)
    }
)

# Configure hyperdrive settings
hyperdrive = HyperDriveConfig(run_config=script_config, 
                          hyperparameter_sampling=params, 
                          policy=None, # No early stopping policy
                          primary_metric_name='Accuracy', # Find the highest AUC metric
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=36, # Restict the experiment to 6 iterations
                          max_concurrent_runs=2) # Run up to 2 iterations in parallel

# Run the experiment
experiment = Experiment(workspace=ws, name=experiment_name)
run = experiment.submit(config=hyperdrive)

# Show the status in the notebook as the experiment runs
# RunDetails(run).show()
run.wait_for_completion()

{'runId': 'HD_c1e3f1b9-bc57-4e4e-9d9b-edc635a3982e',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2022-09-19T06:38:10.234783Z',
 'endTimeUtc': '2022-09-19T06:42:14.028572Z',
 'services': {},
 'properties': {'primary_metric_config': '{"name":"Accuracy","goal":"maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '34b37696-085e-47f2-ad29-1b51099c84cf',
  'user_agent': 'python/3.8.12 (macOS-10.15.7-x86_64-i386-64bit) msrest/0.6.21 Hyperdrive.Service/1.0.0 Hyperdrive.SDK/core.1.44.0',
  'space_size': '6',
  'score': '0.9503588516746412',
  'best_child_run_id': 'HD_c1e3f1b9-bc57-4e4e-9d9b-edc635a3982e_4',
  'best_metric_status': 'Succeeded',
  'best_data_container_id': 'dcid.HD_c1e3f1b9-bc57-4e4e-9d9b-edc635a3982e_4'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'configuration': None,
  'attribution': None,
  'telemetryValues': {'amlClientType': 'azurem

# Determine the best performing run

In [None]:
# Print all child runs, sorted by the primary metric
for child_run in run.get_children_sorted_by_primary_metric():
    print(child_run)


In [None]:
# Get the best run, and its metrics and arguments
best_run = run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
script_arguments = best_run.get_details() ['runDefinition']['arguments']
print('Best Run Id: ', best_run.id)
print(' -AUC:', best_run_metrics['AUC'])
print(' -Accuracy:', best_run_metrics['Accuracy'])
print(' -Arguments:',script_arguments)

In [None]:
best_run

In [None]:
best_run.download_files()

Now that you've found the best run, you can register the model it trained.

In [None]:
# Register best model best model
reg_model = best_run.register_model(model_name='sms-spam-hd-0917-v7-model',
                                        model_path='outputs/model_v7.pkl', 
                                        tags={'Method':'LogisticRegression Hyperdrive'}, 
                                        properties={'Accuracy': best_run_metrics['Accuracy'],
                                                    'AUC': best_run_metrics['AUC']})
print(reg_model)

In [None]:
type(reg_model)

In [None]:
reg_model.id

## Save the best model

In [None]:
best_run.get_properties()

In [None]:
for child_run in run.get_children():
    print(child_run,"\n")

In [None]:
reg_model.name

In [None]:
model_name = reg_model.name
script_file = "scripts/score_v1_2.py"
description = "aml SMS spam lr hd project sdk"


In [None]:
from azureml.automl.core.shared import constants
env = best_run.get_environment()

# best_run.download_file('outputs/scoring_file_v_1_0_0.py', script_file)
# best_run.download_file(constants.CONDA_ENV_FILE_PATH, 'env.yml')

## Deploy the Best Model

Run the following code to deploy the best model. You can see the state of the deployment in the Azure ML portal. This step can take a few minutes.

In [None]:
inference_config = InferenceConfig(entry_script=script_file, environment=best_run.get_environment())


In [None]:
inference_config

In [None]:
aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1,
                                               memory_gb = 1,
                                               tags = {'type': "automl-SMS-spam-prediction"},
                                               description = 'Sample service for AutoML SMS Spam Prediction')

aci_service_name = 'smsspam-lrhd-v7-30'

In [None]:
aci_service = Model.deploy(ws, aci_service_name, [reg_model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

## Consume the Endpoint
You can add inputs to the following input sample. 

In [None]:
scoring_uri = aci_service.scoring_uri
print(scoring_uri)

In [None]:
import requests
import json

 
data = {
  "data": [
    {
      "v2": "Click link below to collect $10000",
      "Column4": "example_value",
      "Column5": "example_value",
      "Column6": "example_value"
    }
  ],
  "method": "predict"
}
    
# Convert to JSON string
input_data = json.dumps(data)
with open("data.json", "w") as _f:
    _f.write(input_data)

# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header
#headers['Authorization'] = f'Bearer {key}'

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print("prediction is :" , resp.json())

In [None]:
import requests
import json

 
data = {
  "data": [
    {
      "v2": "I'm waiting here see you soon",
      "Column4": "example_value",
      "Column5": "example_value",
      "Column6": "example_value"
    }
  ],
  "method": "predict"
}
    
# Convert to JSON string
input_data = json.dumps(data)
with open("data.json", "w") as _f:
    _f.write(input_data)

# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header
#headers['Authorization'] = f'Bearer {key}'

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print("prediction is :" , resp.json())