# Training on CHURN Dataset using Tensorflow Operator

## Prerequisites
Before we proceed, let's check that we're using the right image, that is, [TensorFlow](https://www.tensorflow.org/api_docs/) is available:

In [None]:
#! pip3 list | grep tensorflow 
! pip3 install --user tensorflow==2.4.0
! pip3 install --user ipywidgets nbconvert
!python -m pip install --user --upgrade pip
!pip3 install pandas scikit-learn keras tensorflow-datasets --user

To package the trainer in a container image, we shall need a file (on our cluster) that contains the code as well as a file with the resource definitition of the job for the Kubernetes cluster:

In [None]:
TRAINER_FILE = "tfjobchurn.py"
KUBERNETES_FILE = "tfjob-churn.yaml"

We also want to capture output from a cell with [`%%capture`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cellmagic-capture) that usually looks like `some-resource created`.
To that end, let's define a helper function:

In [None]:
import re

from IPython.utils.capture import CapturedIO


def get_resource(captured_io: CapturedIO) -> str:
    """
    Gets a resource name from `kubectl apply -f <configuration.yaml>`.

    :param str captured_io: Output captured by using `%%capture` cell magic
    :return: Name of the Kubernetes resource
    :rtype: str
    :raises Exception: if the resource could not be created
    """
    out = captured_io.stdout
    matches = re.search(r"^(.+)\s+created", out)
    if matches is not None:
        return matches.group(1)
    else:
        raise Exception(f"Cannot get resource as its creation failed: {out}. It may already exist.")

## How to Load and Inspect the Data

In [None]:
import pandas as  pd

data = pd.read_csv("https://raw.githubusercontent.com/AdeloreSimiloluwa/Artificial-Neural-Network/master/data/Churn_Modelling.csv")
data.head()

## How to Train the Model in the Notebook
We want to train the model in a distributed fashion, we put all the code in a single cell.
That way we can save the file and include it in a container image:

In [None]:
%%writefile $TRAINER_FILE
import argparse
import logging
import json
import os
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import numpy as np
import pandas as pd
# splitting the data
from sklearn.model_selection import train_test_split
# Standardization - feature scaling
from sklearn.preprocessing import StandardScaler
# data encoding
from sklearn.preprocessing import LabelEncoder

import tensorflow_datasets as tfds
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.layers import Dense, Flatten 
from tensorflow.keras.optimizers import SGD, Adam, RMSprop

logging.getLogger().setLevel(logging.INFO)




def make_datasets_unbatched():
  data = pd.read_csv("https://raw.githubusercontent.com/AdeloreSimiloluwa/Artificial-Neural-Network/master/data/Churn_Modelling.csv")

  #preprocessing
  X = data.iloc[:, 3:-1]
  y = data.iloc[:,-1:]

  # encoding country
  encoder_X_1= LabelEncoder()
  X.iloc[:,1] = encoder_X_1.fit_transform(X.iloc[:,1])

  # encoding gender
  encoder_X_2= LabelEncoder()
  X.iloc[:,2] = encoder_X_2.fit_transform(X.iloc[:,2])

  # we would also use the dummy variable because they are norminal variables
  dummy = pd.get_dummies(X["Geography"], prefix = ['Geography'],drop_first=True)
  X=pd.concat([X,dummy], axis = 1)
  X=X.drop(columns = ['Geography'], axis = 1)
    
  # split the data
  X_train,X_test,y_train,y_test = train_test_split( X,y, test_size=0.2, random_state = 10)
  train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
  test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test))
  train = train_dataset.cache().shuffle(2000).repeat()
  return train, test_dataset


def model(args):
  model = models.Sequential()
  model.add(Dense(units =9, activation='relu', input_dim=11))
  model.add(Dense(units =9, activation='relu'))
  model.add(Dense(units =1, activation='sigmoid'))

  model.summary()
  opt = args.optimizer
  model.compile(optimizer=opt,
                loss='binary_crossentropy',
                metrics=['accuracy'])
  tf.keras.backend.set_value(model.optimizer.learning_rate, args.learning_rate)
  return model


def main(args):
  # MultiWorkerMirroredStrategy creates copies of all variables in the model's
  # layers on each device across all workers
  strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
      communication=tf.distribute.experimental.CollectiveCommunication.AUTO)
  logging.debug(f"num_replicas_in_sync: {strategy.num_replicas_in_sync}")
  BATCH_SIZE_PER_REPLICA = args.batch_size
  BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

  # Datasets need to be created after instantiation of `MultiWorkerMirroredStrategy`
  train_dataset, test_dataset = make_datasets_unbatched()
  train_dataset = train_dataset.batch(batch_size=BATCH_SIZE)
  test_dataset = test_dataset.batch(batch_size=BATCH_SIZE)

  # See: https://www.tensorflow.org/api_docs/python/tf/data/experimental/DistributeOptions
  options = tf.data.Options()
  options.experimental_distribute.auto_shard_policy = \
        tf.data.experimental.AutoShardPolicy.DATA

  train_datasets_sharded  = train_dataset.with_options(options)
  test_dataset_sharded = test_dataset.with_options(options)

  with strategy.scope():
    # Model building/compiling need to be within `strategy.scope()`.
    multi_worker_model = model(args)

  # Keras' `model.fit()` trains the model with specified number of epochs and
  # number of steps per epoch. 
  multi_worker_model.fit(train_datasets_sharded,
                         epochs=50,
                         steps_per_epoch=30)
  
  eval_loss, eval_acc = multi_worker_model.evaluate(test_dataset_sharded, 
                                                    verbose=0, steps=10)

  # Log metrics for Katib
  logging.info("loss={:.4f}".format(eval_loss))
  logging.info("accuracy={:.4f}".format(eval_acc))


if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument("--batch_size",
                      type=int,
                      default=32,
                      metavar="N",
                      help="Batch size for training (default: 128)")
  parser.add_argument("--learning_rate", 
                      type=float,  
                      default=0.1,
                      metavar="N",
                      help='Initial learning rate')
  parser.add_argument("--optimizer", 
                      type=str, 
                      default='adam',
                      metavar="N",
                      help='optimizer')

  parsed_args, _ = parser.parse_known_args()
  main(parsed_args)

That saves the file as defined by `TRAINER_FILE` but it does not run it.

Let's see if our code is correct by running it from within our notebook:

In [None]:
%run $TRAINER_FILE --optimizer 'adam'

## How to Create a Docker Image Manually


The Dockerfile looks as follows:

```
  
FROM tensorflow/tensorflow:2.4.0
RUN pip install tensorflow_datasets pandas scikit-learn keras
COPY tfjobchurn.py /
ENTRYPOINT ["python", "/tfjobchurn.py", "--batch_size", "64", "--learning_rate", "0.1", "--optimizer", "adam"]
```


Then it's easy to push images to your container registry:

```bash
docker build -t <docker_image_name_with_tag> .
docker push <docker_image_name_with_tag>
```

The image is available as `mavencodev/tf_jobchurn:1.0` in case you want to skip it for now.

## How to Create a Distributed `TFJob`
For large training jobs, we wish to run our trainer in a distributed mode.
Once the notebook server cluster can access the Docker image from the registry, we can launch a distributed PyTorch job.

The specification for a distributed `TFJob` is defined using YAML:

In [None]:
%%writefile $KUBERNETES_FILE
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "churn"
  namespace: demo01 # your-user-namespace
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - name: tensorflow
            # modify this property if you would like to use a custom image
            image: mavencodev/tf_jobchurn:1.0
            command:
                - "python"
                - "/tfjobchurn.py"
                - "--batch_size=64"
                - "--learning_rate=0.1"
                - "--optimizer=adam"

Let's deploy the distributed training job:

In [None]:
%%capture tf_output --no-stderr
! kubectl create -f $KUBERNETES_FILE

In [None]:
TF_JOB = get_resource(tf_output)

To see the job status, use the following command:

In [None]:
! kubectl describe $TF_JOB

You should now be able to see the created pods matching the specified number of workers.

In [None]:
! kubectl get pods -l job-name=churn

In case of issues, it may be helpful to see the last ten events within the cluster:

```bash
! kubectl get events --sort-by='.lastTimestamp' | tail
```

To stream logs from the worker-0 pod to check the training progress, run the following command:

In [None]:
! kubectl logs -f churn-worker-0

To delete the job, run the following command:

In [None]:
! kubectl delete $TF_JOB

Check to see if the check to see if the pod is still up and running 

In [None]:
! kubectl -n demo01 logs -f churn