# MLOps workshop with Amazon SageMaker

In this workshop we will demonstrate a journey to cloud-native machine learning starting from a more traditional approach to model development and training directly in Jupyter notebooks, to a managed data transformations and training with Amazon SageMaker. Once we will learn the concepts of cloud base processing and training, we will then build a fully automated pipelines with SageMaker Pipelines.

## Module 01: Transform the data and train a model inside a Jupyter notebook.

In this first notebook we will predict house prices based on the well-known [California housing dataset](http://lib.stat.cmu.edu/datasets/) with a simple regression model in Tensorflow 2. This public dataset contains 9 features regarding housing stock of towns in California area. Features include: average number of rooms, accessibility to radial highways, adjacency to a major river, etc.  

To begin, we'll import some necessary packages and set up directories for training and test data.  We'll also set up a SageMaker Session to perform various operations, and specify an Amazon S3 bucket to hold input data and output.  The default bucket used here is created by SageMaker if it doesn't already exist, and named in accordance with the AWS account ID and AWS Region.  

In [None]:
!pip install matplotlib seaborn scikit-learn -q

In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import glob
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sklearn.model_selection
from sklearn.preprocessing import StandardScaler

In [None]:
data_dir = os.path.join(os.getcwd(), 'data')
os.makedirs(data_dir, exist_ok=True)

train_dir = os.path.join(os.getcwd(), 'data/train')
os.makedirs(train_dir, exist_ok=True)

test_dir = os.path.join(os.getcwd(), 'data/test')
os.makedirs(test_dir, exist_ok=True)

raw_dir = os.path.join(os.getcwd(), 'data/raw')
os.makedirs(raw_dir, exist_ok=True)

batch_dir = os.path.join(os.getcwd(), 'data/batch')
os.makedirs(batch_dir, exist_ok=True)

## Exploratory Data Analysis (EDA)

According to The [State of Data Science 2020](https://www.anaconda.com/state-of-data-science-2020) survey, data management, exploratory data analysis (EDA), feature selection, and feature engineering accounts for more than 66% of a data scientist’s time.

Exploratory Data Analysis is an approach in analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.
EDA assists Data science professionals in various ways:

- Getting a better understanding of data.
- Identifying various data patterns.
- Getting a better understanding of the problem statement.

Numerical EDA gives you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. 
Visual EDA on the other hand will give you insight into features and target relationship and distribution.

First we'll load the California Housing dataset and explore the data.

## Download California Housing dataset

For this workshop, we will use the California housing dataset.

More info on the dataset:

- This dataset was obtained from the StatLib repository. http://lib.stat.cmu.edu/datasets/
- The target variable is the median house value for California districts.
- This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

In [None]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/california_housing/cal_housing.tgz .

In [None]:
!tar -zxf cal_housing.tgz 2>/dev/null

In [None]:
columns = [
    "longitude",
    "latitude",
    "housingMedianAge",
    "totalRooms",
    "totalBedrooms",
    "population",
    "households",
    "medianIncome",
    "medianHouseValue",
]
df = pd.read_csv("CaliforniaHousing/cal_housing.data", names=columns, header=None)

In [None]:
df.head()

### Numerical EDA

Check how big is dataset, how many and of what type features it has, and what is target.

In [None]:
df.info()

There are 9 attributes in each case of the dataset. They are:

1. longitude - block group longitude
2. latitude - block group latitude
3. housingMedianAge - median house age in block group
4. totalRooms - average number of rooms per household
5. totalBedrooms - average number of bedrooms per household
6. population - block group population
7. households - average number of household members
8. medianIncome - median income in block group
9. medianHouseValue - median value of owner-occupied homes 

Now, let's summarize the data to see the distribution of data

In [None]:
df.describe()

From the `df.describe()` We can see that the housing median age is around 28.639 years, looking at the mean.

#### Analyze median house age in block group

We will use `df.value_counts` that will return the unique rows for the feature `housingMedianAge`. This will help us to see how the houses age is distributed in our dataset

In [None]:
df.value_counts("housingMedianAge", sort=True)

### Visual EDA

Let's begin exploring the data.
We will see how each feature is distributed in the dataset, in bins of 50.

In [None]:
import matplotlib.pyplot as plt
df.hist(bins=50, figsize=(20, 15))
plt.show()

Let's focus on `medianHouseValue` feature

In [None]:
plt.figure(figsize=(16,7))

df['medianHouseValue'].hist(bins=100)
plt.xlabel("Median House Value", fontsize=14)
plt.ylabel("Houses", fontsize=13)
plt.xticks(rotation=0)
plt.title("Median House Value across the state of California (CA)", fontsize=15)
plt.show() 

We can see that there is a significant outlier here. Consider exploring it more. We will stop with EDA for now for the sake of time needed for the rest of the workshop.

# Dataset transformation

Next, we'll transform the dataset. In a typical SageMaker workflow, notebooks are only used for prototyping and can be run on relatively inexpensive and less powerful instances, while processing, training and model hosting tasks are run on separate, more powerful SageMaker-managed instances. 

We'll now save the raw feature data, and also save the labels for training and testing.

In [None]:
X = df[['longitude','latitude','housingMedianAge','totalRooms','totalBedrooms','population','households','medianIncome']]
Y = df[['medianHouseValue']]

In [None]:
print("Features:", list(X.columns))
print("Dataset shape:", X.shape)
print("Dataset Type:", type(X))
print("Label set shape:", Y.shape)
print("Label set Type:", type(X))

# We partition the dataset into 2/3 training and 1/3 test set.
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, Y, test_size=0.33)

np.save(os.path.join(raw_dir, 'x_train.npy'), x_train)
np.save(os.path.join(raw_dir, 'x_test.npy'), x_test)
np.save(os.path.join(raw_dir, 'y_train.npy'), y_train)
np.save(os.path.join(raw_dir, 'y_test.npy'), y_test)

Next, we'll execute the data preprocessing as shown below.

In [None]:
scaler = StandardScaler()
x_train = np.load(os.path.join(raw_dir, 'x_train.npy'))
scaler.fit(x_train)

In [None]:
input_files = glob.glob('{}/raw/*.npy'.format(data_dir))
print('\nINPUT FILE LIST: \n{}\n'.format(input_files))
for file in input_files:
    raw = np.load(file)
    # only transform feature columns
    if 'y_' not in file:
        transformed = scaler.transform(raw)
    if 'train' in file:
        if 'y_' in file:
            output_path = os.path.join(train_dir, 'y_train.npy')
            np.save(output_path, raw)
            print('Saved labeled training data in {}\n'.format(output_path))
        else:
            output_path = os.path.join(train_dir, 'x_train.npy')
            np.save(output_path, transformed)
            print('Saved transformed training data in {}\n'.format(output_path))
    else:
        if 'y_' in file:
            output_path = os.path.join(test_dir, 'y_test.npy')
            np.save(output_path, raw)
            print('Saved labeled test data in {}\n'.format(output_path))
        else:
            output_path = os.path.join(test_dir, 'x_test.npy')
            np.save(output_path, transformed)
            print('Saved transformed test data in {}\n'.format(output_path))

#  Training

Now that we've prepared a dataset, we can move on to model training.

In [None]:
import numpy as np
import os
import tensorflow as tf

tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)


def get_train_data(train_dir):
    x_train = np.load(os.path.join(train_dir, 'x_train.npy'))
    y_train = np.load(os.path.join(train_dir, 'y_train.npy'))
    print('x train', x_train.shape,'y train', y_train.shape)

    return x_train, y_train


def get_test_data(test_dir):
    x_test = np.load(os.path.join(test_dir, 'x_test.npy'))
    y_test = np.load(os.path.join(test_dir, 'y_test.npy'))
    print('x test', x_test.shape,'y test', y_test.shape)

    return x_test, y_test


def get_model():
    inputs = tf.keras.Input(shape=(8,))
    hidden_1 = tf.keras.layers.Dense(8, activation='tanh')(inputs)
    hidden_2 = tf.keras.layers.Dense(4, activation='sigmoid')(hidden_1)
    outputs = tf.keras.layers.Dense(1)(hidden_2)
    return tf.keras.Model(inputs=inputs, outputs=outputs)


In [None]:
x_train, y_train = get_train_data(train_dir)
x_test, y_test = get_test_data(test_dir)

device = '/cpu:0'
print(device)
batch_size = 128
epochs = 10
learning_rate = 0.01
print('batch_size = {}, epochs = {}, learning rate = {}'.format(batch_size, epochs, learning_rate))

with tf.device(device):
    model = get_model()
    optimizer = tf.keras.optimizers.SGD(learning_rate)
    model.compile(optimizer=optimizer, loss='mse')
    model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
              validation_data=(x_test, y_test))

    # evaluate on test set
    scores = model.evaluate(x_test, y_test, batch_size, verbose=2)
    print("\nTest MSE :", scores)


The unzipped archive should include the assets required by TensorFlow Serving to load the model and serve it, including a .pb file:  

In [None]:
model.save('model' + '/1')

Let's inspect the model files

In [None]:
!ls -R model

# Scoring the model

In [None]:
import numpy as np
import tensorflow as tf

model = tf.keras.models.load_model('model/1')

x_test = np.load(os.path.join(test_dir, 'x_test.npy'))
y_test = np.load(os.path.join(test_dir, 'y_test.npy'))
scores = model.evaluate(x_test, y_test, verbose=2)
print("\nTest MSE :", scores)