## Tabular Classifier with auto Deep Learning
This solution will evaluate between several deep learning models of various architectures  on the user provided data. It will identify the best performing deep learning model architecture  on the basis of validation metric for tabular classification. This will reduce the  time and effort for the model building task for a data scientist. This solution automates several of deep learning tasks in data science.

### Contents

1. [Set up the environment](#Set-up-the-environment)
1. [Usage Instructions](#Usage-Instructions)
1. [Upload the data for training](#Upload-the-data-for-training)
1. [Run Training Job](#Run-Training-Job)
1. [Live Inference Endpoint](#Live Inference)
1. [Batch Transform Job](#Batch-Transform-Job)
1. [Output Interpretation](#Output-Interpretation)



<img src="images/Flow_diagram.JPG">

### Prerequisite

To run this algorithm you need to have access to the following AWS Services:
- Access to AWS SageMaker and the model package.
- An S3 bucket to specify input/output.
- Role for AWS SageMaker to access input/output from S3.

### Input format
#### Input:
Name of the file: <b>train.csv</b><br>
This file contains historical incidents that have been resolved. The solution uses the following incident specific inputs to derive specific productivity measures such as efficiency, experience and workload management across incident types for incident managers to make the predictions.<br><br>

</ul>
<li>  ID: Unique identifier for the request- alphanumeric e.g. INC0001029696</li>
<li> Reported_Day: The day of the week in number (Preferred format: 1-7)</li>

<li> prod_cat: First level category for requests e.g. Miscellaneous_Instance_Database_SQL Server Database</li>
<li> Country: Country of origin of request, Preferred format: USA)</li>
<li> Detailed_Description: Free Text Describing the problem in users works</li>
<li> Priority: Status of the request e.g. Low/Medium/High</li>
<li> Impact: High/Medium/Low
</ul><br>
NOTE:
</ul>
<li>Not all requests are mandatory. Optional Fields :Prod_Cat, Detailed_Description,prod_cat</li>

</ul>




## Set up the environment
Here we specify a bucket to use and the role that will be used for working with SageMaker.

In [58]:
# S3 prefix
prefix = 'tabular-classifier'

# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

## Create the session
The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [59]:
import sagemaker as sage
from time import gmtime, strftime

sess = sage.Session()

## Upload the data for training
When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using classification dataset, which we have included.

We can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket.

In [60]:
data_location= 's3://mphasis-marketplace/tabular_data/input/sample_train_data.zip'
data_location

's3://mphasis-marketplace/tabular_data/input/sample_train_data.zip'

## Create an estimator and fit the model
In order to use SageMaker to fit our algorithm, we'll create an Estimator that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training:
- The container name. This is constructed as in the shell commands above.
- The role. As defined above.
- The instance count which is the number of machines to use for training.
- The instance type which is the type of machine to use for training.
- The output path determines where the model artifact will be written.
- The session is the SageMaker session object that we defined above

Then we use fit() on the estimator to train against the data that we uploaded above.

In [61]:
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/tabular-data-f:latest'.format(account, region)
tree = sage.estimator.Estimator(image,
                       role, 3, 'ml.c4.2xlarge',
                      output_path="s3://{}/output".format(sess.default_bucket()),
                       sagemaker_session=sess)

tree.fit(data_location)


2021-03-24 14:35:57 Starting - Starting the training job...
2021-03-24 14:36:21 Starting - Launching requested ML instancesProfilerReport-1616596557: InProgress
......
2021-03-24 14:37:22 Starting - Preparing the instances for training......
2021-03-24 14:38:26 Downloading - Downloading input data...
2021-03-24 14:38:42 Training - Downloading the training image......
2021-03-24 14:39:52 Training - Training image download completed. Training in progress.[34m2021-03-24 14:39:48.480924: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory[0m
[34m2021-03-24 14:39:48.480958: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.[0m
[34mStarting the training.[0m
[34m2021-03-24 14:39:53.248668: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices

## Hosting your model
You can use a trained model to get real time predictions using HTTP endpoint. Follow these steps to walk you through the process.


In [62]:
training_job_name = tree.latest_training_job.name
attached_tree = sage.estimator.Estimator.attach(training_job_name)


2021-03-24 14:40:42 Starting - Preparing the instances for training
2021-03-24 14:40:42 Downloading - Downloading input data
2021-03-24 14:40:42 Training - Training image download completed. Training in progress.
2021-03-24 14:40:42 Uploading - Uploading generated training model
2021-03-24 14:40:42 Completed - Training job completed



### Deploy the model
Deploying the model to SageMaker hosting just requires a deploy call on the fitted model. This call takes an instance count, instance type, and optionally serializer and deserializer functions. These are used when the resulting predictor is created on the endpoint.

In [63]:

from sagemaker.predictor import csv_serializer
predictor = attached_tree.deploy(4, 'ml.m4.xlarge', serializer=csv_serializer,endpoint_name='tabular-classifier')

-------------!

## Choose some data and use it for a prediction


In [64]:
test_data  = 's3://mphasis-marketplace/tabular_data/input/sample_test_data.zip'




In [44]:
predictions = predictor.predict(test_data)



The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [45]:
print(predictions)

b'Please send a proper format file'


### Output

Output files contains column predicted Group, which has the predicted class

In [65]:
transform_output_folder = "batch-transform-output"
output_path="s3://{}/{}".format(sess.default_bucket(), transform_output_folder)

transformer = tree.transformer(instance_count=1,
                               instance_type='ml.m4.xlarge',
                               output_path=output_path)

In [66]:
transformer.transform(test_data, content_type='application/zip')
transformer.wait()
print("Batch Transform output saved to " + transformer.output_path)

.................................
[34mStarting the inference server with 4 workers.[0m
[34m[2021-03-24 17:36:00 +0000] [13] [INFO] Starting gunicorn 20.0.4[0m
[34m[2021-03-24 17:36:00 +0000] [13] [INFO] Listening at: unix:/tmp/gunicorn.sock (13)[0m
[34m[2021-03-24 17:36:00 +0000] [13] [INFO] Using worker: gevent[0m
[34m[2021-03-24 17:36:00 +0000] [16] [INFO] Booting worker with pid: 16[0m
[34m[2021-03-24 17:36:00 +0000] [17] [INFO] Booting worker with pid: 17[0m
[34m[2021-03-24 17:36:00 +0000] [24] [INFO] Booting worker with pid: 24[0m
[34m[2021-03-24 17:36:00 +0000] [25] [INFO] Booting worker with pid: 25[0m
[34m2021-03-24 17:36:00.908501: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory[0m
[34m2021-03-24 17:36:00.908558: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not 

#### Inspect the Batch Transform Output in S3

In [48]:
from urllib.parse import urlparse

parsed_url = urlparse(transformer.output_path)
bucket_name = parsed_url.netloc
file_key = '{}/{}.out'.format(parsed_url.path[1:], "sample_test.zip")



s3_client = sess.boto_session.client('s3')

response = s3_client.get_object(Bucket = sess.default_bucket(), Key = file_key)
response_bytes = response['Body'].read()
print(response_bytes)

b'Please send a proper format file'


### View Output
Lets read results of above transform job from s3 files and print output

In [30]:
s3_client = sess.boto_session.client('s3')
s3_client.download_file(sess.default_bucket(), "{}/test.csv.out".format(transform_output_folder), '/tmp/test.csv.out')
with open('/tmp/test.csv.out') as f:
    results = f.readlines() 
##print("Transform results: \n{}".format(''.join(results)))
string_final = ''.join(results)

print(string_final)

with open("Output.txt", "w") as text_file:
    text_file.write(string_final)

ID,Reported_Day,prod_cat,Country,Priority,Impact,Incident_Type,Reported_Source,Predicted Group
INC000014022289,3,prod_cat -1,Country-1,Low,4-Minor,Incident_Type-1,Phone,Target-7
INC000014060316,2,prod_cat -1,Country-1,Low,1-Minor,Incident_Type-2,Phone,Target-4
INC000013880496,3,prod_cat -2,Country-1,Low,2-Minor,Incident_Type-1,Phone,Target-4

