# How to be a Data Scientist on Azure

## or

# Adventures with Pythons, Pandas, and Penguins

![animals](./images/zoo.png)


### A 30 minute exploration of data analysis and machine learning for beginners

#### Graeme Malcolm - Principle Content PM - Data Science and AI, Microsoft Worldwide Learning

## Data Preparation and Exploration

Let's try to see how characteristics (*features*) of penguin observations might relate to their species (*label*)

In [None]:
import pandas as pd

# load the training dataset
penguins = pd.read_csv('./data/penguin-data.csv')

# Display a random sample of 10 observations
sample = penguins.sample(10)
sample

Let's match the species IDs to the actual species names.

In [None]:
penguin_classes = ['Amelie', 'Gentoo', 'Chinstrap']
species_names = sample['Species'].apply(lambda x: penguin_classes[x])
sample['SpeciesName'] = species_names
sample

Often, we need to begin by cleaning up the data. For example, are there any missing values?

In [None]:
# Show rows containing nulls
penguins[penguins.isnull().any(axis=1)]

OK, let's just get rid of those rows so we have nice, clean data to work with.

In [None]:
# Drop rows containing NaN values
penguins=penguins.dropna()
# Confirm there are now no nulls
penguins.isnull().sum()

There's lots we can do to explore the data. For example, let's visualize the distribution of the features for each species, so we can see how they compare.

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

penguin_features = ['CulmenLength','CulmenDepth','FlipperLength','BodyMass']
penguin_label = 'Species'
for col in penguin_features:
    penguins.boxplot(column=col, by=penguin_label, figsize=(6,6))
    plt.title(col)
plt.show()

It looks like there are some relationships that migth help us identify a species from the measurements:

- Species 0 (Amelie) tends to have a short, but deep culmen (bill), and generally has short flippers and a low body mass.
- Species 1 (Gentoo) tends to have medium length, thin culmen, with long flippers and a high body mass.
- Species 2 (Chinstrap) tends to have a long and deep culmen, medium length flipper, and a low body mass.

## Machine Learning

Now that we know the features in the data might help differentiate the different species, we can apply an algorithm to analyze and the relationships between the features (measurements) and known labels (species), and encapsulate them in a model that can be used to predict the species for a new penguin observation based on its measurements.

In machine learning terms, this kind of model is a *classification* model (because it predicts the *class* or *category* of something), and it's a form of *supervised* machine learning; which means that we use data for which we have known feature *and* label values to train a model that can predict unknown *labels* from known *features*.

The first thing we'll do is reserve (or *hold-back*) some of the data. That way we can use some of the data to train the model, and then use the data we held back to test how well it predicts.

In [None]:
from sklearn.model_selection import train_test_split

# Separate features and labels
penguins_X, penguins_y = penguins[penguin_features].values, penguins[penguin_label].values

# Split data 70%-30% into training set and test set
x_penguin_train, x_penguin_test, y_penguin_train, y_penguin_test = train_test_split(penguins_X, penguins_y,
                                                                                    test_size=0.30,
                                                                                    random_state=0,
                                                                                    stratify=penguins_y)

print ('Training Set: %d, Test Set: %d \n' % (x_penguin_train.shape[0], x_penguin_test.shape[0]))
print('\nTraining data:')
print('Features:')
print(x_penguin_train[0:5])
print('Labels:')
print(y_penguin_train[0:5])
print('\nValidation data:')
print('Features:')
print(x_penguin_test[0:5])
print('Labels:')
print(y_penguin_test[0:5])

OK, now we're ready to fit the training dataset to a classification algorithm in order to create a predictive model.

In this case, we're using a logistic regression algorithm, which determines the probability of an observation belonging to each class. There are lots of algorithms you can use, and data scientists generally experiment with lots of them to find the best model for their needs.

In [None]:
from sklearn.linear_model import LogisticRegression

# Set regularization rate
reg = 0.1

# train a logistic regression model on the training set
model = LogisticRegression(C=1/reg, solver='lbfgs', multi_class='auto', max_iter=10000).fit(x_penguin_train, y_penguin_train)
print (model)

With the model trained, let's see how well it predicts the classes for the test data we held back.

In [None]:
# Get predictions from test data
penguin_predictions = model.predict(x_penguin_test)

# Show me the predictions!
df_predictions = pd.DataFrame({
    'Predicted label':penguin_predictions,
    'Actual label':y_penguin_test
})
df_predictions

It's difficult to tell from just the raw predicted and actual values like this, so we generally calculate some standard metrics that quantify the model's performance.

In [None]:
from sklearn. metrics import classification_report

print(classification_report(y_penguin_test, penguin_predictions))

The classification report includes the following metrics for each class  (0, 1, and 2)

> note that the header row may not line up with the values!

* *Precision*: Of the predictions the model made for this class, what proportion were correct?
* *Recall*: Out of all of the instances of this class in the test dataset, how many did the model identify?
* *F1-Score*: An average metric that takes both precision and recall into account.
* *Support*: How many instances of this class are there in the test dataset?

The classification report also includes overall metrics for the model as a whole.

Another common way to evaluate a classification model is to generate a *confusion matrix* that tabulates true and false predictions for each class. 

In [None]:
from sklearn.metrics import confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Print the confusion matrix
c_matrix = confusion_matrix(y_penguin_test, penguin_predictions)
plt.imshow(c_matrix, interpolation="nearest", cmap=plt.cm.Blues)
plt.colorbar()
tick_marks = np.arange(len(penguin_classes))
plt.xticks(tick_marks, penguin_classes, rotation=45)
plt.yticks(tick_marks, penguin_classes)
plt.xlabel("Predicted Species")
plt.ylabel("Actual Species")
plt.show()

Now that we have a trained model, we can use it to predict the species for a new penguin, based on its measurements.

In [None]:
x_new = np.array([[50.4,15.3,224,5550]])
print ('New sample: {}'.format(x_new[0]))

# The model returns an array of predictions - one for each set of features submitted
# In our case, we only submitted one penguin, so our prediction is the first one in the resulting array.
penguin_pred = model.predict(x_new)[0]
print('Predicted class is', penguin_classes[penguin_pred])

## Azure Machine Learning

So far, we've looked at machine learning using fairly common open source frameworks.

When you need to work as a team to perform machine learning tasks at scale, Azure Machine Learning can help.

One way in which Azure Machine Learning helps is to enable us to manage *experiments*, logging metrics and storing outputs. Commonly, these experiments consist of Python scripts used to train machine learning models - just like we did previously.

Let's start by creating a folder to contain the files for our training experiment.

In [None]:
import os, shutil

# Create a folder for the experiment files
training_folder = 'penguin-training'
os.makedirs(training_folder, exist_ok=True)

# Copy the data file into the experiment folder
shutil.copy('data/penguin-data.csv', os.path.join(training_folder, "penguin-data.csv"))

Now we'll create a Python training script, using more or less the same code we used previously; but with the addition of a few Azure Machine Learning commands to track details of the experiment.

In [None]:
%%writefile $training_folder/train-penguins.py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
import joblib
import os

# Import Azure ML run
from azureml.core import Run

# Get the experiment run context
run = Run.get_context()

# load the training dataset
print('Loading data...')
penguins = pd.read_csv('penguin-data.csv')
# Drop rows containing NaN values
penguins=penguins.dropna()
# Separate features and labels
penguin_features = ['CulmenLength','CulmenDepth','FlipperLength','BodyMass']
penguin_label = 'Species'
penguins_X, penguins_y = penguins[penguin_features].values, penguins[penguin_label].values
# Split data 70%-30% into training set and test set
x_penguin_train, x_penguin_test, y_penguin_train, y_penguin_test = train_test_split(penguins_X, penguins_y,
                                                                                    test_size=0.30,
                                                                                    random_state=0,
                                                                                    stratify=penguins_y)
# train a logistic regression model on the training set
print('Training model...')
reg = 0.1
model = LogisticRegression(C=1/reg, solver='lbfgs', multi_class='auto', max_iter=10000).fit(x_penguin_train, y_penguin_train)
# Evaluate the model
print('Evaluating model')
penguin_predictions = model.predict(x_penguin_test)
# Log metrics
run.log("Accuracy", np.float(accuracy_score(y_penguin_test, penguin_predictions)))
run.log("Precision", np.float(precision_score(y_penguin_test, penguin_predictions, average='macro')))
run.log("Recall", np.float(recall_score(y_penguin_test, penguin_predictions, average='macro')))
# Log the confusion matrix
c_matrix = confusion_matrix(y_penguin_test, penguin_predictions)
fig = plt.figure(figsize=(8, 8))
plt.imshow(c_matrix, interpolation="nearest", cmap=plt.cm.Blues)
plt.colorbar()
penguin_classes = ['Amelie', 'Gentoo', 'Chinstrap']
tick_marks = np.arange(len(penguin_classes))
plt.xticks(tick_marks, penguin_classes, rotation=45)
plt.yticks(tick_marks, penguin_classes)
plt.xlabel("Predicted Species")
plt.ylabel("Actual Species")
plt.show()
run.log_image(name = 'Confusion Matrix', plot = fig)
# Save the trained model in the outputs folder
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/penguin-model.pkl')


To run this experiment, we need to connect to an Azure Machine Learning workspace.

In [None]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

And now we can run the experiment - note that I'm using Python code to submit an experiment in which more python code is run. The real magic here though is that I can ask Azure Machine Learning to run the experiment script on remote compute, creating a dedicated Python environment that includes the packages I need, and taking advantage of cloud-scale compute resources that I only pay for while I'm using them.

In [None]:
from azureml.core import Environment, Experiment, ScriptRunConfig
from azureml.core.conda_dependencies import CondaDependencies
from azureml.widgets import RunDetails

# Create a Python environment for the experiment
python_env = Environment("penguin-env")
packages = CondaDependencies.create(conda_packages=['scikit-learn','ipykernel','matplotlib','pandas','pip'],
                                    pip_packages=['azureml-sdk','pyarrow'])
python_env.python.conda_dependencies = packages

# Create a script config
script_config = ScriptRunConfig(source_directory=training_folder,
                                script='train-penguins.py',
                                environment=python_env) 

# submit the experiment
experiment_name = 'train-penguins'
experiment = Experiment(workspace=ws, name=experiment_name)
run = experiment.submit(config=script_config)
RunDetails(run).show()
run.wait_for_completion()

Ater the Experiment has finished, I can go and view its details in [Azure Machine Learning Studio](https://ml.azure.com).

One of the outputs from the experiment was a trained model, which I can register in my workspace.

In [None]:
from azureml.core import Model

# Register the model
run.register_model(model_path='outputs/penguin-model.pkl', model_name='penguin_model',
                   tags={'Training context':'Ignite Demo'}, properties={'Accuracy': run.get_metrics()['Accuracy']})

With the model registered, I can deploy it to a web service so that software developers can use it in their applications.

(there's code to do that here, but I've already done it to save some time)

In [None]:
import os
from azureml.core.conda_dependencies import CondaDependencies 

folder_name = 'penguin_service'

# Create a folder for the web service files
service_folder = './' + folder_name
os.makedirs(service_folder, exist_ok=True)

print(folder_name, 'folder created.')

# Add the dependencies for our model (AzureML defaults is already included)
myenv = CondaDependencies()
myenv.add_conda_package('scikit-learn')

# Save the environment config as a .yml file
env_file = os.path.join(service_folder,"penguin_env.yml")
with open(env_file,"w") as f:
    f.write(myenv.serialize_to_string())
print("Saved dependency info in", env_file)

# Print the .yml file
with open(env_file,"r") as f:
    print(f.read())


# Set path for scoring script
script_file = os.path.join(service_folder,"score_penguins.py")

In [None]:
%%writefile $script_file
import json
import joblib
import numpy as np
from azureml.core.model import Model

# Called when the service is loaded
def init():
    global model
    # Get the path to the deployed model file and load it
    model_path = Model.get_model_path('penguin_model')
    model = joblib.load(model_path)

# Called when a request is received
def run(raw_data):
    # Get the input data as a numpy array
    data = np.array(json.loads(raw_data)['data'])
    # Get a prediction from the model
    predictions = model.predict(data)
    return predictions.tolist()

In [None]:
from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig
from azureml.core import Model

model = ws.models['penguin_model']
print(model.name, 'version', model.version)

# Configure the scoring environment
inference_config = InferenceConfig(runtime= "python",
                                   entry_script=script_file,
                                   conda_file=env_file)

deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)

service_name = "penguin-service"

service = Model.deploy(ws, service_name, [model], inference_config, deployment_config)

service.wait_for_deployment(True)
print(service.state)

The model is deployed to a web service, which has an HTTP endpoint that applications can use to consume it.

In [None]:
from azureml.core.webservice import Webservice

service = Webservice(workspace=ws, name='penguin-service')
endpoint = service.scoring_uri
print(endpoint)

Now that I have the endpoint, I can create a simple application that submits some observations of penguins, and gets back a predicted species for each observation.

In [None]:
import requests
import json

x_new = [[41.5,18.5,201,4000],
         [46.1,13.2,211,4500]]

# Convert the array to a serializable list in a JSON document
input_json = json.dumps({"data": x_new})

# Set the content type
headers = { 'Content-Type':'application/json' }

predictions = requests.post(endpoint, input_json, headers = headers)
predicted_classes = predictions.json()
penguin_classes = ['Amelie', 'Gentoo', 'Chinstrap']
for i in range(len(x_new)):
    print ("Penguin {}".format(x_new[i]), penguin_classes[predicted_classes[i]] )

## Learn More

All of the topics discussed in this session, and more, are covered in the Microsoft Azure Data Scientist certification track. You can learn more about the free learning options, and how to get certified at [https://docs.microsoft.com/learn/certifications/azure-data-scientist/](https://docs.microsoft.com/learn/certifications/azure-data-scientist/)