<h1>Fast Experimentation in Amazon SageMaker Studio Notebooks</h1>

In this notebok, we will demonstrate how you can train a Machine Learning model using SageMaker Studio and familiar libraries such as pandas and scikit-learn. We will also show you how you can experiment quickly and track your experiments using SageMaker studio capabilities.

We will be using the "AI4I 2020 Predictive Maintenance Dataset" from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset. The dataset contains information about machines which we will use to create and train a model that predicts whether a machine will fail or not (binary classification).

## Environment set up 

Let's start with the initial setup steps

In [None]:
!pip install xgboost

In [None]:
!pip install sagemaker-experiments

In [None]:
import sagemaker
import sys
import IPython

In [None]:
import sagemaker, pandas, numpy
print(sagemaker.__version__)
print(pandas.__version__)
print(numpy.__version__)

Next, we retreive information about the default Amazon S3 bucket for storing training data and the IAM role that provides the required permissions.

In [None]:
import boto3
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker_session.default_bucket()
prefix = 'sm-fast-iteration'

print(region)
print(role)
print(bucket_name)

Now let's download the dataset

In [None]:
import urllib
import os

data_dir = '/opt/ml/data'
if not os.path.exists(data_dir):
        os.makedirs(data_dir)
file_path = os.path.join(data_dir, 'predmain_raw_data_header.csv')
dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv"
urllib.request.urlretrieve(dataset_url, file_path)

We can also optionally upload our data to the Amazon S3 bucket we retrieved earlier so that other AWS Services and notebooks have access to the data.

In [None]:
raw_data_key = '{0}/data/raw'.format(prefix)
s3_raw_data = sagemaker_session.upload_data(file_path,bucket_name,key_prefix=raw_data_key)

# Data Preprocessing & Feature Engineering

### Data Exploration

Let's take a look at the shape of our dataset

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv(file_path)

print('The shape of the dataset is:', df.shape)

Now we know how many samples we have. Next, let's take a look at the records we have by printing the first 8 rows.

In [None]:
df.head(8)

We also want to check the data types for each column and identify any columns with missing values

In [None]:
df.describe()

Let's try to see what are possible values for the field "Machine failure" and how frequently they occur over the entire dataset

In [None]:
df['Machine failure'].value_counts()

In [None]:
import matplotlib.pyplot as plt

df['Machine failure'].value_counts().plot.bar()
plt.show()

We have discovered that the dataset is quite unbalanced however we are not going to try to balance it at this point.

In [None]:
import seaborn
import matplotlib.pyplot as plt

df1 = df.sample(frac =.1)
df1 = df1.drop(['UDI', 'TWF', 'HDF', 'PWF', 'OSF', 'RNF'], axis=1).select_dtypes(include='number')
df1.head()

In [None]:
df.info()

In [None]:
seaborn.pairplot(df1, hue='Machine failure', corner=True)
plt.show()

For the purpose of keeping the data exploration step short during the workshop, we are not going to execute additional queries. However, feel free to explore the dataset more if you have time.

<h2>Preprocessing and Feature Engineering</h2>

### Experiment set up

We will leverage Amazon SageMaker Experiments to track the experimentations we will be executing during 
training. To do so, we need to create an _experiment_ and a new _trial_ for that experiment. A trial is a collection of training steps involved in a single training job such as preprocessing, training, model evaluation, etc. A trial contains also metadata for inputs (e.g. algorithm, parameters, data sets) and outputs (e.g. models, checkpoints, metrics). Each stage in a trial constitutes a trial component. If  you would like to read more about SageMaker experiments, see also https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html

Let's import our utility script, which will help us easily manage our experiments.

In [None]:
import sys
sys.path.append("source_dir")
from experimentutils import *

We begin with creating an experiment, or loading one if it already exists.

In [None]:
experiment_name = createExperiment("sm-fast-iteration-exp", "ML development and fast iteration with SageMaker")

From now on, we will use the above experiment to start tracking our processing and training trials. Let's create a new trial and associate it with our experiment.

In [None]:
trial_name = createTrial(experiment_name, "exp-tracking-trial-xgboost",prefix)
print(trial_name)

### Data Processing

We are now ready to continue the with data processing and feature engineering tasks. We will hot encode some of the categorical columns and fill in some NaN values based on domain knowledge. Once the SKLearn fit() and transform() are done, we split our dataset into train & validation and then save the outputs to Amazon S3. We will capture this step as the first trial component of our trial. For more details on the CreateTrialComponent API call, check out https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrialComponent.html

In [None]:
output_path = "/opt/ml/processing"
model_path = "/opt/ml/model"

In [None]:
ratio = 0.1
%run -i source_dir/preprocessor.py --train-val-split-ratio $ratio --file-path $file_path --output-path $output_path --model-path $model_path  --s3-prefix $prefix

parameters={
            'ratio': {
                'NumberValue': ratio
            }
    }
process_trial_comp = createTrialComponent(trial_name,"trial-comp-preprocess", prefix, file_path, train_features_output_path, val_features_output_path, model_joblib_path, parameters)

Let's also take a look at our processed training dataset.

In [None]:
import pandas as pd
df = pd.read_csv(train_features_output_path)
df.head(10)

We can see that the categorical variables have been one-hot encoded, and you are free to check that we do not have NaN values anymore as expected.


### Experiment Analytics

We can visualize the experiment analytics either from Amazon SageMaker Studio Experiments plug-in or using the sagemaker SDK as per below.

In [None]:
from sagemaker.analytics import ExperimentAnalytics
experiment = ExperimentAnalytics(experiment_name=experiment_name)
experiment.dataframe()

## Model Training

In this part, we will use xgboost to train a simple binary classification model, using the pre-processed data generated in the previous step by the processing job. We will create a new trial component each time we start the training and will record the hyperparameter values and the results.

In [None]:
eta = 0.3
%run -i source_dir/xgboost_training.py --eta $eta
parameters={
            'eta': {
                'NumberValue': eta
            }
}
training_trial_comp = createTrialComponent(trial_name,"trial-comp-xgboost",prefix, file_path, 
                                           train_features_output_path, val_features_output_path, model_path, parameters)


### Experiment analytics

Again, you can visualize your latest experiment analytics either from Amazon SageMaker Studio Experiments plug-in or using the SDK from a notebook

In [None]:
from sagemaker.analytics import ExperimentAnalytics
experiment = ExperimentAnalytics(experiment_name=experiment_name)
experiment.dataframe()

### Using your model to generate predictions

Let's now use our model for inference.

In [None]:
df_test_features = pd.read_csv(test_features_output_path, header=None)
df_test_labels = pd.read_csv(train_labels_output_path, header=None)
test_X = df_test_features.values
test_y = df_test_labels.values.reshape(-1)
dtest = xgboost.DMatrix(test_X, label=test_y)

model_xgb_trial = xgboost.Booster()
model_xgb_trial.load_model(model_path)
test_predictions = model_xgb_trial.predict(dtest)

In [None]:
print ("===Metrics for Test Set===")
print('')
print (pd.crosstab(index=test_y, columns=np.round(test_predictions), 
                                 rownames=['Actuals'], 
                                 colnames=['Predictions'], 
                                 margins=True)
      )
print('')
rounded_predict = np.round(test_predictions)

accuracy = accuracy_score(test_y, rounded_predict)
precision = precision_score(test_y, rounded_predict)
recall = recall_score(test_y, rounded_predict)
print('')

print("Accuracy Model A: %.2f%%" % (accuracy * 100.0))
print("Precision Model A: %.2f" % (precision))
print("Recall Model A: %.2f" % (1 - recall))

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(test_y, test_predictions)
print("AUC A: %.2f" % (auc))


As SageMaker Experiments now supports common chart types to visualize model training results, we can track these granular metrics to our experiments.

In [None]:
import smexperiments
from smexperiments.tracker import Tracker
 
with Tracker.load(trial_component_name=training_trial_comp) as tracker:
    tracker.log_precision_recall(test_y, rounded_predict)
    tracker.log_roc_curve(test_y, rounded_predict)
    tracker.log_confusion_matrix(test_y, rounded_predict)

### Clean up step (Optional)

In [None]:
#  cleanup('ENTER_YOUR_EXPERIMENT_HERE')