# Kubeflow Introduction

![Kubeflow Overview](img/kubeflow-overview.png)

# Kubeflow on AWS


### [Blog Post:  Securing and Scaling Kubeflow on AWS](https://aws.amazon.com/blogs/opensource/enterprise-ready-kubeflow-securing-and-scaling-ai-and-machine-learning-pipelines-with-aws/)

[![Kubeflow on AWS](img/kubeflow-aws-blog-post.png)](https://aws.amazon.com/blogs/opensource/enterprise-ready-kubeflow-securing-and-scaling-ai-and-machine-learning-pipelines-with-aws/)

# Kubeflow Fairing Introduction

Kubeflow Fairing is a Python package that streamlines the process of `building`, `training`, and `deploying` machine learning (ML) models in a hybrid cloud environment. By using Kubeflow Fairing and adding a few lines of code, you can run your ML training job locally or in the cloud, directly from Python code or a Jupyter notebook. After your training job is complete, you can use Kubeflow Fairing to deploy your trained model as a prediction endpoint.


# How does Kubeflow Fairing work

Kubeflow Fairing 
1. Packages your Jupyter notebook, Python function, or Python file as a Docker image
2. Deploys and runs the training job on Kubeflow or AI Platform. 
3. Deploy your trained model as a prediction endpoint on Kubeflow after your training job is complete.


# Goals of Kubeflow Fairing project

- Easily package ML training jobs: Enable ML practitioners to easily package their ML model training code, and their code’s dependencies, as a Docker image.
- Easily train ML models in a hybrid cloud environment: Provide a high-level API for training ML models to make it easy to run training jobs in the cloud, without needing to understand the underlying infrastructure.
- Streamline the process of deploying a trained model: Make it easy for ML practitioners to deploy trained ML models to a hybrid cloud environment.

In [1]:
!pip install kubeflow-fairing==0.7.1







[33mYou are using pip version 19.0.1, however version 20.2.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [None]:
# Restart the kernel to pick up pip installed libraries
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [1]:
import boto3

AWS_REGION_AS_SLIST=!curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/'
AWS_REGION = AWS_REGION_AS_SLIST.s
print('Region: {}'.format(AWS_REGION))

AWS_ACCOUNT_ID=boto3.client('sts').get_caller_identity().get('Account')
print('Account ID: {}'.format(AWS_ACCOUNT_ID))

S3_BUCKET='sagemaker-{}-{}'.format(AWS_REGION, AWS_ACCOUNT_ID)
print('S3 Bucket: {}'.format(S3_BUCKET))

Region: us-west-2
Account ID: 032934710550
S3 Bucket: sagemaker-us-west-2-032934710550


# Train in the Notebook

In [2]:
import os
import sys
from kubeflow import fairing
import tensorflow as tf
import numpy as np

def train():
    # Genrating random linear data 
    # There will be 50 data points ranging from 0 to 50 
    x = np.linspace(0, 50, 50) 
    y = np.linspace(0, 50, 50) 

    # Adding noise to the random linear data 
    x += np.random.uniform(-4, 4, 50) 
    y += np.random.uniform(-4, 4, 50) 

    n = len(x) # Number of data points 

    X = tf.placeholder("float") 
    Y = tf.placeholder("float")
    W = tf.Variable(np.random.randn(), name = "W") 
    b = tf.Variable(np.random.randn(), name = "b") 
    learning_rate = 0.01
    training_epochs = 1000
    
    # Hypothesis 
    y_pred = tf.add(tf.multiply(X, W), b) 

    # Mean Squared Error Cost Function 
    cost = tf.reduce_sum(tf.pow(y_pred-Y, 2)) / (2 * n)

    # Gradient Descent Optimizer 
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) 

    # Global Variables Initializer 
    init = tf.global_variables_initializer() 

    sess = tf.Session()
    sess.run(init) 
      
    # Iterating through all the epochs 
    for epoch in range(training_epochs): 
          
        # Feeding each data point into the optimizer using Feed Dictionary 
        for (_x, _y) in zip(x, y): 
            sess.run(optimizer, feed_dict = {X : _x, Y : _y}) 
          
        # Displaying the result after every 50 epochs 
        if (epoch + 1) % 50 == 0: 
            # Calculating the cost a every epoch 
            c = sess.run(cost, feed_dict = {X : x, Y : y}) 
            print("Epoch", (epoch + 1), ": cost =", c, "W =", sess.run(W), "b =", sess.run(b)) 
      
    # Storing necessary values to be used outside the Session 
    training_cost = sess.run(cost, feed_dict ={X: x, Y: y}) 
    weight = sess.run(W) 
    bias = sess.run(b) 

    print('Weight: ', weight, 'Bias: ', bias)
    
train()

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
[W 200823 00:26:43 deprecation:323] From /opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Colocations handled automatically by placer.
[W 200823 00:26:43 deprecation:323] From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use tf.cast instead.


Epoch 50 : cost = 6.9106646 W = 0.9939265 b = -1.3805451
Epoch 100 : cost = 6.729674 W = 0.9885039 b = -1.1166812
Epoch 150 : cost = 6.5867496 W = 0.9836412 b = -0.880068
Epoch 200 : cost = 6.474154 W = 0.9792808 b = -0.66789377
Epoch 250 : cost = 6.385714 W = 0.97537076 b = -0.4776305
Epoch 300 : cost = 6.316477 W = 0.9718645 b = -0.30701768
Epoch 350 : cost = 6.262489 W = 0.9687204 b = -0.15402502
Epoch 400 : cost = 6.2205887 W = 0.96590096 b = -0.01683316
Epoch 450 : cost = 6.18825 W = 0.9633727 b = 0.10618996
Epoch 500 : cost = 6.163463 W = 0.9611056 b = 0.21650802
Epoch 550 : cost = 6.1446204 W = 0.9590726 b = 0.31543303
Epoch 600 : cost = 6.1304455 W = 0.9572496 b = 0.40414116
Epoch 650 : cost = 6.119924 W = 0.9556148 b = 0.48368692
Epoch 700 : cost = 6.112249 W = 0.9541489 b = 0.55501795
Epoch 750 : cost = 6.106786 W = 0.95283437 b = 0.6189822
Epoch 800 : cost = 6.103023 W = 0.9516556 b = 0.67633915
Epoch 850 : cost = 6.1005616 W = 0.9505986 b = 0.72777486
Epoch 900 : cost = 6.0

# Train on the Kubeflow Cluster

We will show you how to run the training job in the EKS Kubeflow cluster. We use `ECR` as our container image registry.

In [3]:
# Authenticate ECR
# This command retrieves a token that is valid for a specified registry for 12 hours, 
# and then it prints a docker login command with that authorization token. 
# Then we executate this command to login ECR
!eval $(aws ecr get-login --no-include-email --region=$AWS_REGION)

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


Ignore any error message below. We first check if the ECR repository already exists before creating it. 

In [4]:
# Create an ECR repository in the same region
!aws ecr describe-repositories --repository-names fairing-job --region=$AWS_REGION || aws ecr create-repository --repository-name fairing-job --region=$AWS_REGION


An error occurred (RepositoryNotFoundException) when calling the DescribeRepositories operation: The repository with name 'fairing-job' does not exist in the registry with id '032934710550'
{
    "repository": {
        "repositoryArn": "arn:aws:ecr:us-west-2:032934710550:repository/fairing-job",
        "registryId": "032934710550",
        "repositoryName": "fairing-job",
        "repositoryUri": "032934710550.dkr.ecr.us-west-2.amazonaws.com/fairing-job",
        "createdAt": 1598142420.0,
        "imageTagMutability": "MUTABLE",
        "imageScanningConfiguration": {
            "scanOnPush": false
        }
    }
}


In [None]:
# Setting up AWS Elastic Container Registry (ECR) for storing output containers
# You can use any docker container registry instead of ECR
DOCKER_REGISTRY = '{}.dkr.ecr.{}.amazonaws.com'.format(AWS_ACCOUNT_ID, AWS_REGION)

fairing.config.set_builder('append', base_image='tensorflow/tensorflow:1.14.0-py3', registry=DOCKER_REGISTRY, push=True)
fairing.config.set_deployer('job')
    
if __name__ == '__main__':
    cluster_train = fairing.config.fn(train)
    cluster_train()
    

[I 200823 00:26:59 config:125] Using preprocessor: <kubeflow.fairing.preprocessors.function.FunctionPreProcessor object at 0x7f93d3c58f60>
[I 200823 00:26:59 config:127] Using builder: <kubeflow.fairing.builders.append.append.AppendBuilder object at 0x7f93d3c58c88>
[I 200823 00:26:59 config:129] Using deployer: <kubeflow.fairing.deployers.job.job.Job object at 0x7f93d3c58ac8>
[W 200823 00:26:59 append:50] Building image using Append builder...
[I 200823 00:26:59 base:107] Creating docker context: /tmp/fairing_context_v51ulj41
[W 200823 00:26:59 base:94] /opt/conda/lib/python3.6/site-packages/kubeflow/fairing/__init__.py already exists in Fairing context, skipping...
[I 200823 00:26:59 docker_creds_:234] Loading Docker credentials for repository 'tensorflow/tensorflow:1.14.0-py3'
[W 200823 00:27:00 append:54] Image successfully built in 0.9988130710007681s.
[W 200823 00:27:00 append:94] Pushing image 032934710550.dkr.ecr.us-west-2.amazonaws.com/fairing-job:C6B4B72F...
[I 200823 00:27:00

# See the Completed Job in the Kubeflow Cluster
Re-run the cell above, if you don't see `fairing-job` below.  The fairing-job will get cleaned up after a few seconds.

In [7]:
!kubectl get pod

NAME                          READY   STATUS      RESTARTS   AGE
fairing-builder-vcnzc-g5rzg   0/1     Completed   0          12m
notebook-0                    2/2     Running     0          21m
