## <center>Tutorial: ML Model Deployment on AWS</center> 
<center>By: You Chen</center> 


For details about AWS CLI: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html  
For details about AWS Lambda: https://docs.aws.amazon.com/en_us/lambda/?id=docs_gateway  
For details about deploying Lambda in JAVA: https://docs.aws.amazon.com/lambda/latest/dg/lambda-java.html

----------------------

<h5>Building a deployment package with Maven</h5>

<br>Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information. For detail about how to set up **Maven** : http://maven.apache.org/.</br> 

<br> A deployment package is a ZIP(JAR) archive that contains your compiled function code and dependencies. You can upload the package directly to Lambda, or you can use an Amazon S3 bucket, and then upload it to Lambda. If the deployment package is larger than 50 MB, you must use Amazon S3.  
AWS Lambda provides the following libraries for Java functions:</br>

* [<strong>com.amazonaws:aws-lambda-java-core (required)</strong>](https://github.com/aws/aws-lambda-java-libs/tree/master/aws-lambda-java-core) – Defines handler method interfaces and the context object that the runtime passes to the handler. If you define your own input types, this is the only library you need.

* [<strong>com.amazonaws:aws-lambda-java-events</strong>](https://github.com/aws/aws-lambda-java-libs/tree/master/aws-lambda-java-events) – Input types for events from services that invoke Lambda functions.

* [<strong>com.amazonaws:aws-lambda-java-log4j2</strong>](https://github.com/aws/aws-lambda-java-libs/tree/master/aws-lambda-java-log4j2) – An appender library for Log4j 2 that you can use to add the request ID for the current invocation to your function logs.

These libraries are available through [Maven central repository](https://search.maven.org/search?q=g:com.amazonaws). Add them to your build definition ( _pom.xml_ ) as follows.
```xml
  <dependencies>
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>aws-lambda-java-core</artifactId>
      <version>X.X.X</version>
    </dependency>
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>aws-lambda-java-events</artifactId>
      <version>X.X.X</version>
    </dependency>
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>aws-lambda-java-log4j2</artifactId>
      <version>X.X.X</version>
    </dependency>
  </dependencies>
```

<br>Other dependencies (required).</br>
``` xml
<dependencies>
		<dependency>
			<groupId>com.google.code.gson</groupId>
			<artifactId>gson</artifactId>
			<version>X.X.X</version>
		</dependency>
        <dependency>
            <groupId>ai.h2o</groupId>
            <artifactId>h2o-genmodel</artifactId>
            <version>X.X.X</version>
        </dependency>
</dependencies>
```
<br>Use the [Maven Shade plugin](https://maven.apache.org/plugins/maven-shade-plugin/). The plugin creates a JAR file that contains the compiled function code and all of its dependencies.</br>

``` xml
    <plugins>
        <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.2.2</version>
        <configuration>
          <createDependencyReducedPom>false</createDependencyReducedPom>
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
```


<h5>ML Model Project structure</h5>
<br>This model lambda function can take a json string as input, the json string (RequestList) is a list of flight's info. It will generate H2O RowData, and setup the model, feed the RowData into generated model and calculate the prediction value, put the resultant prediction back into RowData, and return it as RequestList.</br>

![Project1](https://github.com/AnnaChenU/AWS-lambda-MLmodel/raw/AnnaChenU-patch-imgs/mvn_struct.PNG)


``` java
GrossVolumeModel     // package (namespace), can be any cutomized name
    GBM_4_AutoML_*   // the H2O output POJO model file (can be a different name)
    Predictor        // Java class, configure ML model, provide predict function to update data in RequestList
    RequestList      // Java class, as input and also output of handlerRequest method
    VolumeHandler    // class implement com.amazonaws.services.lambda.runtime.RequestHandler (lambda function)

pom.xml          // project configuration file
```
---------------------------------

#### Predictor.java
```java
package GrossVolumeModel;

import hex.genmodel.GenModel;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.RowData;
import hex.genmodel.easy.exception.PredictException;
import hex.genmodel.easy.prediction.RegressionModelPrediction;


import java.lang.reflect.Method;
import java.util.List;

public class Predictor {
    private RequestList requestList;
    EasyPredictModelWrapper model;

    public Predictor(RequestList requestList) {
        this.requestList = requestList;

        // Should better use 'Reflection' to instantiate model object in future......
        GenModel rawModel = new GBM_4_AutoML_20200414_192014();  
        this.model = new EasyPredictModelWrapper(new EasyPredictModelWrapper.Config()
                .setModel(rawModel).setConvertUnknownCategoricalLevelsToNa(true));
    }

    public EasyPredictModelWrapper getModel() {
        return model;
    }

    public RequestList getRequestList() {
        return requestList;
    }

    public void predictVolume() {
        RowData row = new RowData();
        List<RowData> inputs = this.requestList.getRequestItem();
        for (RowData input : inputs) {
            try {
                RegressionModelPrediction p = model.predictRegression(input);
                input.put("finalVolume_pred", "" + p.value);   // should have a ConfigurationManager module in future...
            } catch (PredictException e) {
                e.printStackTrace();
            }
        }
    }
}

```

----------------------------------------------------------------------------------------------------------------

#### RequestList.java
```java
package GrossVolumeModel;
import java.util.List;
import com.google.gson.annotations.Expose;
import com.google.gson.annotations.SerializedName;
import hex.genmodel.easy.RowData;

public class RequestList {
    @SerializedName("requestItem")
    @Expose
    private List<RowData> requestItem = null;


    public RequestList() { }

    /**
     * @param requestItem
     */
    public RequestList(List<RowData> requestItem) {
        super();
        this.requestItem = requestItem;
    }

    public List<RowData> getRequestItem() {
        return requestItem;
    }

    public void setRequestItem(List<RowData> requestItem) {
        this.requestItem = requestItem;
    }
}
```
----------------------------------------------

#### VolumeHandler.java
```java
package GrossVolumeModel;

import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
import com.amazonaws.services.lambda.runtime.LambdaLogger;

import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.FieldNamingPolicy;


public class VolumeHandler implements RequestHandler<RequestList, RequestList> {
    Gson gson = new GsonBuilder().setFieldNamingPolicy(FieldNamingPolicy.UPPER_CAMEL_CASE).create();
    //private static final String modelClassName = "GBM_4_AutoML_20200414_192014";
    //private static final String targetName = "FinalVolume_predicted";

    @Override
    public RequestList handleRequest(RequestList event, Context context) {
        LambdaLogger logger = context.getLogger();
        logger.log("EVENT: " + gson.toJson(event));
        logger.log("EVENT TYPE: " + event.getClass().toString());

        Predictor predictor = new Predictor(event);
        predictor.predictVolume();                                            // update RequestList
        logger.log("RESULT: " + gson.toJson(predictor.getRequestList()));

        return predictor.getRequestList();
    }
}
```
---------------------

### pom.xml
``` xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.example</groupId>
    <artifactId>mlModel</artifactId>
    <packaging>jar</packaging>
    <name>mlModel</name>
    <version>1.0.0</version>

    <properties>
        <java.version>1.8</java.version>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <sl4j.version>1.6.1</sl4j.version>
        <environment>local</environment>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <build.version>1.0.0</build.version>
        <maven.build.timestamp.format>yyyy-MM-dd'T'HH:mm:ss.SSS'Z'</maven.build.timestamp.format>
        <build-number>1.0.0</build-number>
        <maven.build.timestamp.format>yyyy-MM-dd'T'HH:mm:ss.SSS'Z'</maven.build.timestamp.format>
    </properties>

    <build>
        <plugins>
            <plugin>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.22.2</version>
                <configuration>
                    <argLine>-Xmx1024m</argLine>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.2</version>
                <configuration>
                    <createDependencyReducedPom>false</createDependencyReducedPom>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer implementation="com.github.edwgiz.maven_shade_plugin.log4j2_cache_transformer.PluginsCacheFileTransformer">
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
                <dependencies>
                    <dependency>
                        <groupId>com.github.edwgiz</groupId>
                        <artifactId>maven-shade-plugin.log4j2-cachefile-transformer</artifactId>
                        <version>2.13.0</version>
                    </dependency>
                </dependencies>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
            </plugin>
        </plugins>

        <extensions>
            <extension>
                <groupId>org.springframework.build</groupId>
                <artifactId>aws-maven</artifactId>
                <version>5.0.0.RELEASE</version>
            </extension>
        </extensions>
    </build>

    <dependencies>
        <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-lambda-java-core -->
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-lambda-java-core</artifactId>
            <version>1.2.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3 -->
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-lambda-java-log4j2</artifactId>
            <version>1.1.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-lambda-java-events -->
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-lambda-java-events</artifactId>
            <version>2.2.7</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
		<dependency>
			<groupId>com.google.code.gson</groupId>
			<artifactId>gson</artifactId>
			<version>2.8.6</version>
		</dependency>
        <!-- https://mvnrepository.com/artifact/ai.h2o/h2o-genmodel -->
        <dependency>
            <groupId>ai.h2o</groupId>
            <artifactId>h2o-genmodel</artifactId>
            <version>3.24.0.2</version>
        </dependency>
    </dependencies>
</project>
```


<h5>Deploy ML model lambda function and setup API Gateway trigger.</h5>  


_<strong>1.Deploy on AWS website:</strong>_
<ul>

* Run <code><strong>mvn package</strong></code> command at the project root folder (_pom.xml_ should also be in this folder) this command will build jar file in <code>target</code> folder.
* Go to AWS S3 website
    * <strong>Create bucket</strong>, upload the */ mlMode / target / <code>mlModel-1.0.0.jar</code> file
* Go to AWS Lambda website
    * <strong>Create function</strong>, set runtime as Java 8, create or link execution role (CloudWatch recommanded)
    * Choose <strong>Action -> Upload a file from S3 bucket</strong>, then paste your model jar file S3 link URL
    * Choose <strong>Add trigger</strong>, select <strong>API Gateway</strong>, add your API (API Gateway supports two types of RESTful APIs: HTTP APIs and REST APIs)
* Test configuration:  (RequestList object) all keys should begin with an <strong>uppercase</strong> letter except 'id'. Any keys which are not contained in model POJO will be <strong>ignored</strong> during making prediction.
    ```json
{
  "requestItem": [
    {
      "id": "1",
      "BookedWeight": "3000.0",
      "BookedVolume": 10.00,
      "Hamster": "小土豆(˘•ω•˘)",      // this will be ignored during prediction
      ...
    },
    {
      "id": "2",
      "BookedWeight": "2020.0",
      "BookedVolume": 7.11,
      ...
    }]
}
    ```
 </ul>
 
Setup API Gateway:
<ul>
    
* Go to AWS API Gateway website
    * <strong>Build HTTP API or REST API</strong>, click <strong>Actions</strong> to create resource or method, finally deploy it
</ul>

To call API locally, you need API key and API endpoint. Those can be found under your API Gateway trigger details in your lambda function webpage.
 
 
<br></br>
 
_<strong>2.Deploy function by AWS CLI Lambda API (recommanded):</strong>_

<br>Intall <code><strong>AWS CLI</strong></code> first: [AWS Command Line Interface](https://aws.amazon.com/cli/)</br>
<br>Check AWS documents section: [Uploading a deployment package with the Lambda API](https://docs.aws.amazon.com/lambda/latest/dg/java-package.html#java-package-maven)</br>
<br>Updates a Lambda function's code: The function's code is locked when you publish a version. You can't modify the code of a published version, only the unpublished version. [Document](https://docs.aws.amazon.com/lambda/latest/dg/API_UpdateFunctionCode.html)</br>  
 
 In future, write cmd script to update function with aws cli, as the last step of training pipline. 

 

<h4>FileParser project structure</h4>
<br>This lambda function is triggered when new input csv files uploaded to specific source bucket, and it will fecth the csv file, read the input stream, convert input stream to RequestList object, then invoke model lambda function while send the RequestList in the invocation payload, get the response and save the result into another csv file, finally upload this file to destination bucket.</br>

![Project2](https://github.com/AnnaChenU/AWS-lambda-MLmodel/raw/AnnaChenU-patch-imgs/fileParser.PNG)



``` java
VolumeInvoker     // package (namespace), can be any cutomized name
    FileConverter    // Java Class, can take InputStream into RequestList
    LambdaInvoker    // Java class, configure invocation, invoke mlModel lambda and get the response Json string
    LambdaInvokerConfiguration   // (not complete yet) Java class, 
                                 // in future can be used to configure different model as well as 
                                 // different API Gateway, S3 (based on prediction target)
    RequestList      // Java class, as input and also output of handlerRequest method
    S3Reader         // class implement com.amazonaws.services.lambda.runtime.RequestHandler  (lambda function)

pom.xml          // project configuration file
```
-----------------------


#### FileConverter.java
```java
package VolumeInvoker;

import com.opencsv.CSVReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;

/*
FileConverter can take InputStream into RequestList
 */
public class FileConverter {
    private InputStream objectData;
    private RequestList requestList;

    public FileConverter(InputStream objectData) {
        this.objectData = objectData;
        this.requestList = new RequestList();
    }

    public InputStream getObjectData() {
        return objectData;
    }

    public RequestList getRequestList() {
        return requestList;
    }

    public void setObjectData(InputStream objectData) {
        this.objectData = objectData;
    }

    // convert data in csv, to RequestList
    public void convertCSV2List() {
        try {
            Map<Integer, String> header = new HashMap<>();

            InputStreamReader isr = new InputStreamReader(this.objectData, "UTF-8");
            CSVReader csvReader = new CSVReader(isr);

            String[] line = csvReader.readNext();                       // file header (column's names)
            for (int i = 0; i < line.length; i++) {
                header.put(i, line[i]);                                 // map column name to index
            }

            while ((line = csvReader.readNext()) != null) {
                Map<String, Object> row = new HashMap<>();              // {"Column_name" : "data"}

                for (int i = 0; i < line.length; i++) {
                    String key = header.get(i).substring(0,1).toUpperCase() + header.get(i).substring(1);
                    row.put(key, line[i]);                      // map data to associated column name
                }
                this.requestList.getRequestItem().add(row);
            }

            csvReader.close();

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

```
---------

#### LambdaInvoker.java
```java
package VolumeInvoker;

import com.amazonaws.AmazonClientException;
import com.amazonaws.services.lambda.AWSLambda;
import com.amazonaws.services.lambda.AWSLambdaClientBuilder;
import com.amazonaws.services.lambda.model.InvocationType;
import com.amazonaws.services.lambda.model.InvokeRequest;
import com.amazonaws.services.lambda.model.InvokeResult;
import com.amazonaws.services.lambda.runtime.LambdaLogger;
import com.amazonaws.services.lambda.runtime.events.S3Event;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.event.S3EventNotification;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.S3Object;
import com.google.gson.Gson;

import java.io.InputStream;
import java.nio.charset.Charset;

public class LambdaInvoker {
    private String response;
    LambdaInvokerConfiguration config = new LambdaInvokerConfiguration();
    Gson gson = new Gson();

    public LambdaInvoker() {
        response = null;
    }

    public String getResponse() {
        return response;
    }

    public void setResponse(String response) {
        this.response = response;
    }

    public LambdaInvokerConfiguration getConfig() {
        return config;
    }

    public void setConfig(LambdaInvokerConfiguration config) {
        this.config = config;
    }

    // Fetch the data from newly updated file in S3,
    // parse the data and generate RequestList object,
    // Invoke ML lambda with serialized RequestList,
    public String getResponse(S3Event s3event, LambdaLogger logger) {
        // get the final S3 event
        S3EventNotification.S3EventNotificationRecord record = s3event.getRecords().get(0);                              
        String srcBucket = record.getS3().getBucket().getName();
        String srcKey = record.getS3().getObject().getKey().replace('+', ' ');       // file name

        logger.log("srcBucket: " + srcBucket);
        logger.log("srcKey (file): " + srcKey);

        if (srcBucket.equals(config.SRC_BUCKET)) {
            try {
                AmazonS3 s3Client = AmazonS3ClientBuilder.standard().build();
                
                // get object file using source bucket and srcKey name
                S3Object s3Object = s3Client.getObject(new GetObjectRequest(srcBucket, srcKey));    
                InputStream objectData = s3Object.getObjectContent();     // get content of the file

                logger.log("Lambda function s3Reader is invoked:" + s3event.toJson());
                try {
                    FileConverter converter = new FileConverter(objectData);
                    
                    // generate ML lambda function input, call convertCSV2List() method
                    converter.convertCSV2List();                          
                    
                    RequestList requestItems = converter.getRequestList();

                    logger.log("REQUEST ITEM: " + gson.toJson(requestItems));

                    // invoke ML lambda function, call invokeLambda() method
                    this.response = invokeLambda(gson.toJson(requestItems));   

                    logger.log("RESPONSE ITEM: " + response);
                } catch (Exception e) {
                    logger.log(e.getMessage());
                    e.printStackTrace();
                }
            } catch (AmazonClientException e) {
                logger.log(e.getLocalizedMessage());
                e.printStackTrace();
            }
        }
        return this.response;      // response is serialized RequestList string
    }


    /** Invoke ML model lambda function
     * @param payload = gson.toJson(requestItems);
     * @return return string is a String has same format with RequestList which contain the prediction result 
     */
    private String invokeLambda(String payload) {

        InvokeRequest lmbRequest = new InvokeRequest()
                .withFunctionName(config.FUNCTION_NAME)
                .withPayload(payload)
                .withInvocationType(InvocationType.RequestResponse);

        AWSLambda lambda = AWSLambdaClientBuilder.standard().build();

        InvokeResult lmbResult = lambda.invoke(lmbRequest);

        return new String(lmbResult.getPayload().array(), Charset.forName("UTF-8"));
    }
}

```
------------

#### LambdaInvokerConfiguration.java
```java
package VolumeInvoker;

// this class is incomplete...
// I write this way just for convenience...  

public class LambdaInvokerConfiguration {

    public final String REGION = "US_WEST_2";

    public final String AWS_ACCESS_KEY_ID = "";

    public final String AWS_SECRET_ACCESS_KEY = "";

    public final String FUNCTION_NAME = "arn:aws:lambda:us-west-2:460908697650:function:GrossVolumePredict";

    public final String DST_BUCKET = "cargo.ml.resource";

    public final String SRC_BUCKET = "cargo.ml.resource";

    public final String API_ENDPOINT = "https://h6ofj21659.execute-api.us-west-2.amazonaws.com/Dev/GrossVolumePredict";

    public final String API_KEY = "hwPdf0Leum9pJce3g5wSZ9rmjptbEpI091hwsFDX";

    public final String volume_outputKey = "_Volume_Prediction.csv";   // object file to upload

}

```
----------------

#### RequestList.java
(same with the former one)  
```java
```
-----------

#### S3Reader.java
```java
package VolumeInvoker;

import com.amazonaws.services.lambda.runtime.RequestHandler;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ObjectMetadata;


import com.amazonaws.services.lambda.runtime.LambdaLogger;
import com.amazonaws.services.lambda.runtime.events.S3Event;
import com.amazonaws.services.lambda.runtime.Context;

import com.google.gson.Gson;
import com.opencsv.*;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.HttpClientBuilder;
import org.joda.time.DateTime;

import java.io.*;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.*;

public class S3Reader implements RequestHandler<S3Event, String> {
    LambdaInvoker invoker = new LambdaInvoker();

    @Override
    public String handleRequest(S3Event s3event, Context context) {
        LambdaLogger logger = context.getLogger();

        // parse input file, invoke model lambda with serialized payload, call getResponse() method
        String textToUpload = invoker.getResponse(s3event, logger);  
        logger.log("Response " + textToUpload);

        ObjectMetadata meta = null;
        InputStream is = null;
        byte[] bytes = textToUpload.getBytes(StandardCharsets.UTF_8);
        is = new ByteArrayInputStream(bytes);

        //set meta information about text to be uploaded
        meta = new ObjectMetadata();
        meta.setContentLength(bytes.length);
        meta.setContentType("text/csv");

        // upload the file by specifying destination bucket name, file name, 
        // input stream having content to be uploaded along with meta information
        AmazonS3 s3Client = AmazonS3ClientBuilder.standard().build();
        s3Client.putObject(invoker.config.DST_BUCKET, 
                           DateTime.now().toString() + invoker.config.volume_outputKey,
                           is, 
                           meta); 
        logger.log("File uploaded.");

        return "200 OK";
    }
}

```
---------

### pom.xml
``` xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.rms</groupId>
    <artifactId>predictionInvoker</artifactId>
    <packaging>jar</packaging>
    <name>ModelInvoker</name>
    <version>1.0.0</version>

    <properties>
        <java.version>1.8</java.version>
        <sl4j.version>1.6.1</sl4j.version>
        <environment>local</environment>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <build.version>1.0.0</build.version>
        <maven.build.timestamp.format>yyyy-MM-dd'T'HH:mm:ss.SSS'Z'</maven.build.timestamp.format>
        <build-number>1.0.0</build-number>
        <maven.build.timestamp.format>yyyy-MM-dd'T'HH:mm:ss.SSS'Z'</maven.build.timestamp.format>
    </properties>

    <build>
        <plugins>
            <plugin>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.22.2</version>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.2</version>
                <configuration>
                    <createDependencyReducedPom>false</createDependencyReducedPom>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer implementation="com.github.edwgiz.maven_shade_plugin.log4j2_cache_transformer.PluginsCacheFileTransformer">
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
                <dependencies>
                    <dependency>
                        <groupId>com.github.edwgiz</groupId>
                        <artifactId>maven-shade-plugin.log4j2-cachefile-transformer</artifactId>
                        <version>2.13.0</version>
                    </dependency>
                </dependencies>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
            </plugin>
        </plugins>

        <extensions>
            <extension>
                <groupId>org.springframework.build</groupId>
                <artifactId>aws-maven</artifactId>
                <version>5.0.0.RELEASE</version>
            </extension>
        </extensions>
    </build>

    <dependencies>
        <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-lambda-java-core -->
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-lambda-java-core</artifactId>
            <version>1.2.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3 -->
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-lambda-java-log4j2</artifactId>
            <version>1.1.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-lambda-java-events -->
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-lambda-java-events</artifactId>
            <version>2.2.7</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3 -->
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-java-sdk-s3</artifactId>
            <version>1.11.578</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.opencsv/opencsv -->
        <dependency>
            <groupId>com.opencsv</groupId>
            <artifactId>opencsv</artifactId>
            <version>5.1</version>
        </dependency>
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-java-sdk-lambda</artifactId>
            <version>1.11.24</version>
        </dependency>
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-java-sdk-s3</artifactId>
            <version>1.11.779</version>
        </dependency>
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.8.6</version>
        </dependency>
    </dependencies>
</project>
```

<h5>Deploy FileParser lambda function and setup S3 event trigger.</h5>  

* Same as above (execution role needs <code>CloudWatch and S3FullAccess policy</code> (full access is just for covenient...TBD))
* Add S3 event trigger:
    * Choose <strong>Add trigger</strong>, select <strong>S3</strong>, under <code>Event Type</code> select <strong>All object create events / PUT</strong>, customize <code>Prefix</code> and <code>Suffix</code> (optional). The source bucket name is required to fetch correct file, so here the bucket name should be exactly what you write in your project code
* Add AWS Lamda destination
    * Under <code>Destination type</code> select <strong>Lambda function</strong> and choose the destination function (here is your mlModel lambda)
 

* Test configuration: refer to <code>s3-put Event template</code>.  

_<strong>Also, recommand deploying function by AWS CLI Lambda API</strong>_ [Uploading a deployment package with the Lambda API](https://docs.aws.amazon.com/lambda/latest/dg/java-package.html#java-package-maven)


<h4> The relationship between above two labmda functions: </h4>

![Pipeline](https://github.com/AnnaChenU/AWS-lambda-MLmodel/raw/AnnaChenU-patch-imgs/2lambda.PNG)

--------------------------------------------------------------------------------------------------------------------------------

## <center>Tutorial: Build Training Pipeline With AWS</center> 

#### <strong>Training piple:</strong>
* 1. Upload new training input data on specific source <code>training input S3 bucket</code>;


* 2. <code>S3 PUT event</code> trigger <code>ML_training_ec2Trigger</code> lambda function;


* 3. <code><strong>ML_training_ec2Trigger</strong></code> starts specific <code><strong>ec2 instance</strong></code> (Linux instance which contains your training script as well as your maven model project);
    * Once the <code>ec2 instance</code> finished initialization, get the _IP address_
    
    * Invoke destination <code>MLtraining</code> lambda function with _IP address_ as payload
    
    * Once invoke <code>MLtraining</code> lambda, without waiting for response, return ''invocation success'' message


* 4. <code><strong>ML_training_ec2Trigger</strong></code> invokes <code><strong>MLtraining</strong></code> lambda
    * (This function needs a <code>paramikoPackage</code> <strong>function layer</strong> to use <code>-ssh</code> command)
    
    * (This function also needs a <code>private key pair</code> for ssh to connect ec2, so you need to hold your key in a private <code>key S3 bucket</code>)
    
    * Once this function is invoked, it will download your private key file and use <code>paramiko</code> to <code>ssh</code> connect your running <code>ec2</code> by <code>IP address</code>
    
    * After connection is built, it will execute command to initialize a shell session, and start execute a bash script on that ec2, then detach the current ssh session. (which make sure script will keep executing on ec2, but this lambda function just return its response and cool down.)
    
    * The bash script  <code><strong>deployMode.sh</strong></code> commands:
        * a). Download new input training data file from <code>training input S3 bucket</code>;
        
        * b). Python run <code>*training.py</code> file with new input csv file as <code>argument</code>;
        
        * c). Once generated POJO file, add _"package XX;"_ to the first line (packege name is same with your model lambda package name);
        
        * d). Move POJO into your <code>*/{projectFolder}/src/main/java/{packageName}/</code> folder (is also where your lambda function locates);
        
        * e). Run <code>mvn clean package -f */{projectFolder}/pom.xml > {AnyWhereYouLike}/mvn_log.txt </code> (Package your lambda function project, meanwhile save the maven execution log file, checking the log file later can help you know whether it builds successfully or not.)
        
        * f). Run <code>aws s3 cp */{projectFolder}/target/{Lambda}.jar s3://{modelBucket}/{lambda_Name}.jar</code> (Upload lambda function jar file to your <code>model S3 bucket</code>). Also upload any files you want here.
        
        
* 5. When <code><strong>deployModel.sh</strong></code> finished, some log files will be uploaded to <code>Log files S3 Bucket</code>, this <code>S3 PUT event</code> trigger <code><strong>stop_ec2</strong></code> lambda function to stop my running ec2 instance;
        

![Pipeline](https://github.com/AnnaChenU/AWS-lambda-MLmodel/blob/AnnaChenU-patch-imgs/pipeline.PNG?raw=true)

### The following 3 lambda functions runtime needs to be <code>Python 3.6 </code>
--------------

### <strong><h4>ML_training_ec2Trigger Lambda</h4></strong>
* Execution role: AmazonEC2FullAccess, AmazonS3FullAccess  (TBD) 
* Trigger:
    * Event type: S3 ObjectCreated
* Destination:
    * arn: aws:lambda:us-west-2:{awsSourceAccountNumber}:function:<code>MLtraining</code>
    
``` json
{
  "Version": "2012-10-17",
  "Id": "default",
  "Statement": [
    {
      "Sid": "lambda-6dd23a55-4102-41a6-b159-9c9a08a3edda",
      "Effect": "Allow",
      "Principal": {
        "Service": "s3.amazonaws.com"
      },
      "Action": "lambda:InvokeFunction",
      "Resource": "arn:aws:lambda:us-west-2:460908697650:function:ML_training_ec2Trigger",
      "Condition": {
        "StringEquals": {
          "AWS:SourceAccount": "460908697650"
        },
        "ArnLike": {
          "AWS:SourceArn": "arn:aws:s3:::cargo.ml.training"
        }
      }
    }
  ]
}
```
----------------

##### ec2_trigger.py (script name)
``` python
import boto3
import json
import time

region = 'us-west-2'
instances = ['i-025d579ae9136aa40']       # your Linux EC2 instance id list

def trigger_handler(event, context):    
    ec2 = boto3.client('ec2', region_name=region)
    ec2.start_instances(InstanceIds=instances)   
    time.sleep(3.0)                       # wait for ec2 starting
    print ("instances: ", str(instances))
    
    # get IP addresses of EC2 instances
    client = boto3.client('ec2')
    
    # you need to define a tag {"Environment": "Dev"} before calling this
    instDict=client.describe_instances(Filters=[{'Name':'tag:Environment','Values':['Dev']}]) 
    print("Instances Dictionary: ", instDict)
    hostList=[]

    for r in instDict['Reservations']:
        for inst in r['Instances']:
            print ("Instance Details: ", inst)
            ipaddress = inst.get(u'PublicIpAddress')
            if ipaddress is None:
                print("No key.")
            hostList.append(inst['PublicIpAddress'])
            
    print("Host List: ", hostList)

    #Invoke worker function for each IP address
    print("Invoking......")
    client = boto3.client('lambda')
    for host in hostList:
        print ("Invoking worker_function on " + host)
        payload='{"IP":"'+ host +'"}'
        print(payload)
        
        invokeResponse=client.invoke(
            FunctionName='MLtraining',      # destination function 
            InvocationType='Event',
            LogType='Tail',
            Payload=payload                 # IP address
        )
        print ("response: ", invokeResponse)

    return{
        'message' : "Trigger function finished "
    }
```
------------------

### <strong><h4>MLtraining Lambda</h4></strong>

* Execution role: AmazonEC2FullAccess, AmazonS3FullAccess  (TBD)
* Layers:
    * paramikoPackage (use for ssh command)
* Trigger:
    * None
* Destination:
    * None
    
About how to create <code>paramiko</code> function layer: https://www.linuxschoolonline.com/how-to-solve-unable-to-load-module-when-using-paramiko-package-in-a-lambda-function/  
I choose to install <code>paramiko</code> on my Amazon Linux EC2 instance, and package it and deploy it on EC2 by AWS CLI. So you also need to install AWS CLI on your EC2 instance first. (https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html#cliv2-linux-install)  
Also you can consider useing paramiko layer which deployed by someone else. https://github.com/jetbridge/paramiko-lambda-layer (Region "us-west-2")


--------------

##### TrainingInvoker.py (script name)
``` python
import boto3
import paramiko
import sys

def ec2_training_invoker(event, context):
    print (event)  # event here is the IP address of running ec2
    s3_client = boto3.client('s3')
    # download private key file from secure S3 bucket
    s3_client.download_file('ec2.invoker', 'mykey.pem', '/tmp/mykey.pem')  # (bucketName, fileName, downloadTargetName)

    k = paramiko.RSAKey.from_private_key_file("/tmp/mykey.pem")
    c = paramiko.SSHClient()
    c.set_missing_host_key_policy(paramiko.AutoAddPolicy())

    host=event['IP']
    print("Connecting to " + host)
    c.connect( hostname = host, username = "ec2-user", pkey = k )
    print("Connected to " + host)

    commands = [
        # aws s3 cp s3://{scriptBucket}/deploy.sh /Application/cargoRM/dev/deployModel.sh
        "source ~/.bashrc",                                 # set environment variables
        "nohup ./Application/cargoRM/dev/deployModel.sh &"  # execute bash script and detach the running process
        ]

    for command in commands:
        print("Executing {}".format(command))
        stdin , stdout, stderr = c.exec_command(command)
        print(stdout.read())
        print(stderr.read())

    return
    {
        'message' : "Script execution completed. See Cloudwatch logs for complete output"
    }
    
   ```
   -----------------

### <strong><h4>stop_ec2 Lambda</h4></strong>

* Execution role: AmazonEC2FullAccess, AmazonS3FullAccess  (TBD)

* Trigger:
    * arn:aws:s3:::training.log ----ObjectCreatedByPut
* Destination:
    * None

-----------

##### stop_ec2_handler.py
``` python
import json
import boto3 

instances = ['i-025d579ae9136aa40']     # EC2 instance id list

def lambda_handler(event, context):
    ec2 = boto3.client('ec2', region_name='us-west-2')    
    ec2.stop_instances(InstanceIds=instances)    
    print ('stopped instances: ' + str(instances))
    return {
        'statusCode': 200,
        'body': json.dumps('Shut down running EC2!')
    }
```
----------

## Set up Amazon Linux EC2 instance

#### 1. Launch Amazon Linux EC2 instance:
Refer to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/launching-instance.html  
In Step 5: _"Add Tags"_  
* <code>{'Name':'tag:Environment','Values':['Dev']}</code>  (You can define it or edit it after launching.)

In Step 6: _"Configure Security Group"_  
* ![inbound](https://github.com/AnnaChenU/AWS-lambda-MLmodel/raw/AnnaChenU-patch-imgs/inbound.PNG)

In Step 7: _"Review Instance Launch and Select Key Pair"_
* Remember save your private Key Pair in a safe location. 

#### 2. Connect to Linux instance
Refer to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstances.html  


#### 3. Configure EC2 instance
Details can be found :https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Configure_Instance.html

* <strong>a).</strong> Install <code><strong>python</strong></code> 3.6 on your EC2 (python 3+ needed, not python 2)
    ``` bash
    [ec2-user@ip-****~]$ sudo yum install python36 -y
    ```
    If need to make a symbolic link, refer to: https://sixbyseven.dev/how-to-install-python-3-x-on-amazon-ec2-instance/  
    
    Add the executable path,  <code>~/.local/bin</code>, to your <code>PATH</code> variable:
    Find your shell's profile script in your user folder:
    ``` bash
    [ec2-user@ip-****~]$ ls -a ~
    ```  
    
    Add an export command to your profile script (<code>.bash_profile</code>). The following example adds the path represented by <code>LOCAL_PATH</code> to the current <code>PATH</code> variable>.
    ```shell
    export PATH=LOCAL_PATH:$PATH
    ```  
    
    Load the profile script described in the first step into your current session:
    ```shell
    [ec2-user@ip-****~]$ source ~/{PROFILE_SCRIPT}
    ```
    
    Install and use virtualenv <strong>(optional)</strong>:
    ``` bash
    [ec2-user@ip-****~]$ sudo pip3 install virtualenv
    ```  
    
    Create virtual environment:
    ```shell
    [ec2-user@ip-****~]$ virtualenv {your_project_name}
    ```
    
    Activate environment:  
    ```shell
    [ec2-user@ip-****~]$ source {your_project_name}/bin/activate
    ```
    
    To activate the virtual environment automatically when you log in, add it to the <code>~/.bashrc</code> file:
    
    ```shell
    [ec2-user@ip-****~]$ echo "source ${HOME}/{your_project_name}/env/bin/activate" >> ${HOME}/.bashrc
    ```
    
     Source the <code>~/.bashrc</code> file in your home directory to reload your environment's bash environment. Reloading automatically activates your virtual environment. The prompt reflects the change (env). This change also applies to any future SSH sessions.
     
     ```shell
     [ec2-user@ip-****~]$ source ~/.bashrc

     ```
     
     To deactivate your environment:
    ```shell
    [ec2-user@ip-****~]$ deactivate
    ```

* Python install requiremets packages: 
    Create your requirements.txt file first:
    ``` shell
    [ec2-user@ip-****~]$ vim requirements.txt
    ```  
    
    Insert below into your <code>requirements.txt</code> (versions are optional):
    ```
    appdirs==1.4.4
    bcrypt==3.1.7
    boto3==1.13.24
    botocore==1.16.24
    certifi==2020.4.5.1
    cffi==1.14.0
    chardet==3.0.4
    colorama==0.4.3
    cryptography==2.9.2
    distlib==0.3.0
    docutils==0.15.2
    filelock==3.0.12
    future==0.18.2
    h2o==3.30.0.3
    idna==2.9
    importlib-metadata==1.6.0
    importlib-resources==1.5.0
    jmespath==0.10.0
    joblib==0.15.1
    numpy==1.18.4
    paramiko==2.7.1
    pycparser==2.20
    PyNaCl==1.4.0
    python-dateutil==2.8.1
    requests==2.23.0
    s3transfer==0.3.3
    scikit-learn==0.23.1
    scipy==1.4.1
    six==1.15.0
    tabulate==0.8.7
    threadpoolctl==2.0.0
    urllib3==1.25.9
    virtualenv==20.0.21
    zipp==3.1.0
    ```
    
    Pip install all packages:
    
    ``` bash
    [ec2-user@ip-****~]$ pip3 install -r requirements.txt 
    ```

* <strong>b).</strong> Install <strong>JDK</strong> and <strong>Maven</strong> on EC2:

    Refer to https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth-connect-prerq.html
    
    (Can't remember details......)
    
##### .bash_profile
```bash
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/.local/bin:$HOME/bin

# java env variable

JAVA_HOME="/usr/java/jdk-14.0.1"

PATH=$JAVA_HOME/bin:$PATH

export PATH

```  
---------

##### .bashrc

```bash
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi

# User specific aliases and functions

export JAVA_HOME="/usr/lib/jvm/java-1.8.0"

PATH=$JAVA_HOME/bin:$PATH

export M2_HOME=/usr/share/apache-maven/apache-maven-3.6.3
export M2=$M2_HOME/bin
export MAVEN_OPTS="-Xmx1048m -Xms256m -XX:MaxPermSize=312M"
export PATH=$M2:$PATH

```
---------

* <strong>c).</strong> Create <code>GrossVolumeTraining.py script</code>:
     ``` bash
    [ec2-user@ip-****~]$ cd ./Application/cargoRM/dev/  
   [ec2-user@ip-**** dev]$ vim GrossVolumeTraining.py
    ```
    ---------

#### GrossVolumeTraining.py
``` python
try:
    import h2o
    from h2o.estimators.gbm import H2OGradientBoostingEstimator
    import random
    import sys
    import json
    import boto3
    import logging
    from botocore.exceptions import ClientError
    from botocore.exceptions import NoCredentialsError
    from datetime import date

except Exception as e:
    print("some modules missing {}".format(e))

for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)
logging.basicConfig(filename="./training_logs.log",
                    level=logging.DEBUG,
                    format='%(asctime)s:%(levelname)s:%(message)s',
                    datefmt='%m/%d/%Y %I:%M:%S %p')
logger = logging.getLogger()

ACCESS_KID = ""
ACCESS_KEY = ""


class GrossVolumeTraining(object):
    def __init__(self, file_path):
        
        # start H2O and generated log files, set up the max memory size
        h2o.init(log_dir='./h2o_logs', log_level='DEBUG', max_mem_size='1g')  

        h2o.remove_all()
        
        self.split_seed = random.randrange(sys.maxsize)

        # Should better use configuration module and configuration files to initialize different training process,
        # so that can improve code reusability
        
        self.file_path = file_path  # input file path
        
        self.y = 'FinalVolume'  # response column
        
        self.col_headers = ['Id', 'FinalWeight', 'FinalVolume', 'BookedWeight', 'BookedVolume', 'DepartureDayOfYear',
                            'ArrivalDayOfYear', 'DepartureWeek', 'ArrivalWeek', 'DepartureWeekDay', 'ArrivalWeekDay',
                            'Origin', 'Destination', 'FlightNumber', 'Suffix', 'Equipment', 'EquipmentInHouse',
                            'TransportMode', 'Distance', 'CaptureWeightCapacity', 'CaptureVolumeCapacity', 'LegNumber',
                            'Piecese', 'ChargeableWeight', 'Ndo', 'StatusCode', 'PartShipmentIndicator', 'NetCharge',
                            'AllotmentCode', 'SpaceAllocationCode', 'Agent', 'POS', 'ProductCode', 'Density', 'Days']

        self.col_types = ['numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                          'numeric', 'numeric', 'numeric', 'enum', 'enum', 'enum', 'enum', 'enum', 'enum', 'enum',
                          'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'enum',
                          'numeric', 'numeric', 'enum', 'enum', 'enum', 'enum', 'enum', 'numeric', 'numeric']

        self.gbm_params = {'model_id': 'GrossVolumeModel_gbm',    # POJO target folder
                           'nfolds': 0,
                           'keep_cross_validation_models': False,
                           'keep_cross_validation_predictions': True,
                           'keep_cross_validation_fold_assignment': False,
                           'score_each_iteration': False,
                           'score_tree_interval': 5,
                           'fold_assignment': 'AUTO',
                           'fold_column': None,
                           'response_column': 'FinalVolume',
                           'ignored_columns': ['C1',
                                               'TransportMode',
                                               'ProductCode',
                                               'SpaceAllocationCode'],
                           'ignore_const_cols': True,
                           'offset_column': None,
                           'weights_column': None,
                           'balance_classes': False,
                           'class_sampling_factors': None,
                           'max_after_balance_size': 5.0,
                           'max_confusion_matrix_size': 20,
                           'max_hit_ratio_k': 0,
                           'min_rows': 1.0,
                           'nbins': 20,
                           'nbins_top_level': 1024,
                           'nbins_cats': 1024,
                           'r2_stopping': 1.7976931348623157e+308,
                           'stopping_rounds': 3,
                           'stopping_metric': 'RMSE',
                           'stopping_tolerance': 0.0010019164953871014,
                           'max_runtime_secs': 658812317797974.0,
                           'seed': 6785859302972478466,
                           'build_tree_one_node': False,
                           'learn_rate': 0.05,
                           'learn_rate_annealing': 1.0,
                           'distribution': 'gaussian',
                           'quantile_alpha': 0.5,
                           'tweedie_power': 1.5,
                           'huber_alpha': 0.9,
                           'checkpoint': None,
                           'sample_rate': 0.7,
                           'sample_rate_per_class': None,
                           'col_sample_rate_change_per_level': 1.0,
                           'col_sample_rate_per_tree': 0.7,
                           'min_split_improvement': 1e-05,
                           'histogram_type': 'AUTO',
                           'max_abs_leafnode_pred': 1.7976931348623157e+308,
                           'pred_noise_bandwidth': 0.0,
                           'categorical_encoding': 'AUTO',
                           'calibrate_model': False,
                           'calibration_frame': None,
                           'custom_metric_func': None,
                           'custom_distribution_func': None,
                           'export_checkpoints_dir': None,
                           'monotone_constraints': None,
                           'check_constant_response': True,
                           'max_depth': 6,
                           'ntrees': 1235}

    def get_data(self):     # another way is getting input file directly from S3 bucket using boto3
        df_raw = h2o.import_file(self.file_path, parse=False)
        setup = h2o.parse_setup(df_raw,
                                destination_frame="training.hex",
                                header=1,
                                column_names=self.col_headers,
                                column_types=self.col_types)
        df = h2o.parse_raw(h2o.parse_setup(df_raw),
                           id='training.csv',
                           first_line_is_header=1)

        logger.info("Input dataframe: ", df)
        return df
    

    def split_dataframe(self, df, ratios=[.9], seed=None):   
        train, test = df.split_frame(ratios=ratios, seed=seed)
        return train, test

    def train_gbm(self):
        dt_all = self.get_data()

        dt_train, dt_test = self.split_dataframe(dt_all, seed=self.split_seed)

        y = self.y  # response column

        final_gbm = H2OGradientBoostingEstimator(**self.gbm_params)

        logger.debug("# Start training......")
        final_gbm.train(y=y, training_frame=dt_train, validation_frame=dt_test)

        logger.debug('# Downloading model as pojo file......')
        final_gbm.download_pojo(self.gbm_params['model_id'])  # POJO target folder

        logger.info("Final gbm model: ", final_gbm)

        logger.debug("# Training done.")

        #self.upload_model()

    def upload_model(self, dst_bucket='cargo.volume.ml.model', file_name=None):    # I uploaed by AWS CLI in bash script
        logger.debug("# Uploading model to ", dst_bucket, " bucket......")

        if file_name is None:
            file_name = self.gbm_params['model_id'] + date.today().strftime('%m-%d-%Y')
        try:
            boto3.setup_default_session(region_name='us-west-2')
            s3_client = boto3.client('s3', aws_access_key_id=ACCESS_KID, aws_secret_access_key=ACCESS_KEY)
            response = s3_client.upload_file(self.gbm_params['model_id'], dst_bucket, file_name)
        except ClientError or NoCredentialsError as ex:
            logger.error(ex)
            return False
        return True


def main():
    if len(sys.argv) < 2:
        raise ValueError("VALUE ERROR: Missing training dataset, which must be a csv file.")


if __name__ == "__main__":
    obj = GrossVolumeTraining(sys.argv[1])
    obj.train_gbm()
    h2o.shutdown()

```

--------------

<strong>
    This <code>Application/cargoRM/dev/deploy.sh</code> is not good...  
    
A good strategy is to put <code>deploy.sh</code> (maybe <code>training.py</code> as well) in a S3 bucket, then download in your "MLtraining" lambda function.  
    
And model deployment command need modification. Refer to "deploying function by AWS CLI Lambda API Uploading a deployment package with the Lambda API".
</strong>

#### deploy.sh
``` bash
#!/bin/bash

# Download input file
aws s3 cp s3://{DataBucket}/train_volume.csv Application/cargoRM/dev/deploy.sh &&

# Training and updating POJO model
python Application/cargoRM/dev/GrossVolumeTraining.py Application/cargoRM/dev/data/train_volume.csv &&

# Move POJO into maven src folder.
sed -i '1s/^/package GrossVolumeModel;/' Application/cargoRM/dev/GrossVolumeModel_gbm/GrossVolumeModel_gbm.java &&
cp Application/cargoRM/dev/GrossVolumeModel_gbm/GrossVolumeModel_gbm.java Application/cargoRM/dev/mlModel/src/main/java/GrossVolumeModel/ &&

# Maven packaging whole model
mvn clean package -f Application/cargoRM/dev/mlModel/pom.xml > Application/cargoRM/dev/mvn_log.txt &&

# Uploading model to S3
aws s3 cp Application/cargoRM/dev/mlModel/target/mlModel-1.0.0.jar s3://cargo.volume.ml.model/VolumePredictModel/mlModel-1.0.0.1.jar  &&

# Uploading traning log to S3
zip Application/cargoRM/dev/h2o_logs.zip Application/cargoRM/dev/h2o_logs
aws s3 cp Application/cargoRM/dev/h2o_logs.zip s3://{trainingLogBucket}/h2o_logs.zip
aws s3 cp Application/cargoRM/dev/mvn_log.txt s3://{trainingLogBucket}/mvn_log.txt
aws s3 cp Application/cargoRM/dev/training_logs.txt s3://{trainingLogBucket}/training_logs.txt
```
------------

## <center>Tutorial: ML Model Deployment on Valohai</center> 

#### 1. Install and configure Docker:
https://docs.docker.com/get-started/

#### 2. Build docker image:
([A good tutorial about how to build a machine learning docker image](https://towardsdatascience.com/build-a-docker-container-with-your-machine-learning-model-3cf906f5e07e))

Docker official document about how to build and run image: https://docs.docker.com/get-started/part2/

Build your image under app root folder. (we only need a runtime evironment to run our training script)
```
- app-name
     |-- requirements.txt  
     |-- Dockerfile
```

##### Dockerfile
``` dockerfile
FROM python:3.6-stretch
MAINTAINER Hamster Potato

# install build utilities
RUN apt-get update && \
	apt-get install -y gcc make apt-transport-https ca-certificates build-essential &&\
	apt-get install -y openjdk-8-jdk && \
	apt-get install -y ant && \
	apt-get clean;

# Fix certificate issues
RUN apt-get update && \
	apt-get install ca-certificates-java && \
	apt-get clean && \
	update-ca-certificates -f;

# Setup JAVA_HOME -- useful for docker commandline
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JAVA_HOME

# check our python environment
RUN python3 --version
RUN pip3 --version

# set the working directory for containers
WORKDIR  /usr/src/CargoML-img

# Installing python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
```

##### requirements.txt

```text
requests==2.22.0
colorama==0.4.3
future==0.18.2
tabulate >=0.7.5
pandas==1.0.1
numpy==1.18.1
boto3==1.12.39
botocore==1.15.39
h2o==3.30.0.3
```

#### 3. Push Docker image onto Docker Hub:
https://docs.docker.com/get-started/part3/

Download above image:   (find on: https://hub.docker.com/repository/docker/chenuuu5/cargo-ml-runtime)
```shell
docker push chenuuu5/cargo-ml-runtime:{tagname}
```

#### 4. Install and configure Valohai:
https://valohai.com/get-started/?tab=local

#### 5. Create execution:

* <strong>Create Execution</strong>

    1. <strong>Settings</strong> --> <strong>Repository</strong> 
    
    Link to your GitHub "https://github.com/AnnaChenU/AWS-lambda-MLmodel.git "
    
    This repository contains traning script (<code>GrossVolumeTraining.py</code>), valohai configuration file (<code>valohai.yaml</code>), model package (<code>{root_folder}</code>)  
    
    2. <strong>Data</strong> --> <strong>Upload</strong> 
    
    Upload your training data. 
    
    After uploading csv file, copy the <code>datum url</code>, add into your <code>valohai.yml</code> file.
    
    3. <strong>Execution</strong> --> <strong>Fetch repository</strong> 
    
    Configure execution, based on following <code>valohai.yml</code> file.
    
    (step2 "package maven project" is incorrect... while no question with step1 "Execute ml")
    
##### (Details about valohai configuration: https://docs.valohai.com/valohai-yaml/index.html)
    
##### valohai.yml
```yml
- step:
    name: Execute ml
    image: chenuuu5/cargo-ml-runtime:latest
    command: python3 GrossVolumeTraining.py
    inputs:
      - name: training_sample_input
        default: datum://0172fe4a-987e-bbce-1651-b19a23d28f79
    environment: aws-eu-west-1-g2-2xlarge


- step:
    name: package maven project
    image: maven
    command:
      - cp ${VH_INPUTS_DIR}/model/GrossVolumeModel_gbm.java ${VH_REPOSITORY_DIR}/src/main/java/GrossVolumeModel/GrossVolumeModel_gbm.java 
      - sed -i '1s/^/package GrossVolumeModel;/' ${VH_REPOSITORY_DIR}/src/main/java/GrossVolumeModel/GrossVolumeModel_gbm.java
      - mvn clean package -f ${VH_REPOSITORY_DIR}/pom.xml > ${VH_OUTPUTS_DIR}/mvn_log.txt
      - cp ${VH_REPOSITORY_DIR}/target/mlModel-1.0.0.jar ${VH_OUTPUTS_DIR}/mlModel-1.0.0.jar
    inputs:
        - name: model
          default: datum://0172fe69-525e-ba45-fb26-af80212c92b0
    environment: aws-eu-west-1-g2-2xlarge
       
- pipeline:
    name: Training pipeline
    nodes:
      - name: execute-ml
        type: execution
        step: Execute ml
      - name: package-jar
        type: execution
        step: package maven project
    edges:
      - [execute-ml.output.*, package-jar.input.model]
```

------------

#### GrossVolumeTraining.py

(In order to run on valohai, I do some modification in my training script. Difference is about "get_data" and "dowload_pojo" part.)
``` python
try:
    import os
    import os.path
    from os import path
    import h2o
    from h2o.estimators.gbm import H2OGradientBoostingEstimator
    import random
    import sys
    import json
    import boto3
    from botocore.exceptions import ClientError
    from botocore.exceptions import NoCredentialsError
   

except Exception as e:
    print("some modules missing {}".format(e))

INPUT_PATH = os.getenv('VH_INPUTS_DIR', '.inputs/')    # valohai environment variable
OUTPUT_PATH = os.getenv('VH_OUTPUTS_DIR', '.outputs/')


class GrossVolumeTraining(object):
    def __init__(self):
        h2o.init(max_mem_size='1G')
        h2o.remove_all()
        # self.file_path = file_path
        self.split_seed = random.randrange(sys.maxsize)
        self.y = 'FinalVolume'
        self.col_headers = ['C1', 'FinalWeight', 'FinalVolume', 'BookedWeight', 'BookedVolume', 'DepartureDayOfYear',
                            'ArrivalDayOfYear', 'DepartureWeek', 'ArrivalWeek', 'DepartureWeekDay', 'ArrivalWeekDay',
                            'Origin', 'Destination', 'FlightNumber', 'Suffix', 'Equipment', 'EquipmentInHouse',
                            'TransportMode', 'Distance', 'CaptureWeightCapacity', 'CaptureVolumeCapacity', 'LegNumber',
                            'Piecese', 'ChargeableWeight', 'Ndo', 'StatusCode', 'PartShipmentIndicator', 'NetCharge',
                            'AllotmentCode', 'SpaceAllocationCode', 'Agent', 'POS', 'ProductCode', 'Density', 'Days']
        self.col_types = ['numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                          'numeric', 'numeric', 'numeric', 'enum', 'enum', 'enum', 'enum', 'enum', 'enum', 'enum',
                          'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'enum',
                          'numeric', 'numeric', 'enum', 'enum', 'enum', 'enum', 'enum', 'numeric', 'numeric']
        self.gbm_params = {'model_id': 'GrossVolumeModel_gbm',
                           'nfolds': 0,
                           'keep_cross_validation_models': False,
                           'keep_cross_validation_predictions': True,
                           'keep_cross_validation_fold_assignment': False,
                           'score_each_iteration': False,
                           'score_tree_interval': 5,
                           'fold_assignment': 'AUTO',
                           'fold_column': None,
                           'response_column': 'FinalVolume',
                           'ignored_columns': ['C1',
                                               'TransportMode',
                                               'ProductCode',
                                               'SpaceAllocationCode'],
                           'ignore_const_cols': True,
                           'offset_column': None,
                           'weights_column': None,
                           'balance_classes': False,
                           'class_sampling_factors': None,
                           'max_after_balance_size': 5.0,
                           'max_confusion_matrix_size': 20,
                           'max_hit_ratio_k': 0,
                           'min_rows': 1.0,
                           'nbins': 20,
                           'nbins_top_level': 1024,
                           'nbins_cats': 1024,
                           'r2_stopping': 1.7976931348623157e+308,
                           'stopping_rounds': 3,
                           'stopping_metric': 'RMSE',
                           'stopping_tolerance': 0.0010019164953871014,
                           'max_runtime_secs': 658812317797974.0,
                           'seed': 6785859302972478466,
                           'build_tree_one_node': False,
                           'learn_rate': 0.05,
                           'learn_rate_annealing': 1.0,
                           'distribution': 'gaussian',
                           'quantile_alpha': 0.5,
                           'tweedie_power': 1.5,
                           'huber_alpha': 0.9,
                           'checkpoint': None,
                           'sample_rate': 0.7,
                           'sample_rate_per_class': None,
                           'col_sample_rate_change_per_level': 1.0,
                           'col_sample_rate_per_tree': 0.7,
                           'min_split_improvement': 1e-05,
                           'histogram_type': 'AUTO',
                           'max_abs_leafnode_pred': 1.7976931348623157e+308,
                           'pred_noise_bandwidth': 0.0,
                           'categorical_encoding': 'AUTO',
                           'calibrate_model': False,
                           'calibration_frame': None,
                           'custom_metric_func': None,
                           'custom_distribution_func': None,
                           'export_checkpoints_dir': None,
                           'monotone_constraints': None,
                           'check_constant_response': True,
                           'max_depth': 6,
                           'ntrees': 1235}

        
    # I'm not sure if it can directly download input csv file by boto3 and store in {VH_INPUTS_DIR}  
    def get_data(self, src_bucket="cargo.ml.training", obj_name="training_sample.csv"):
        # boto3.setup_default_session(region_name='us-west-2')
        # s3_client = boto3.client('s3', aws_access_key_id=ACCESS_KID, aws_secret_access_key=ACCESS_KEY)
        input_path = os.path.join(INPUT_PATH, 'training_sample_input/training_sample.csv')
        # s3_client.download_file(src_bucket, obj_name, input_path)

        df_raw = h2o.import_file(input_path, parse=False)
        setup = h2o.parse_setup(df_raw,
                                destination_frame="training.hex",
                                header=1,
                                column_names=self.col_headers,
                                column_types=self.col_types)
        df = h2o.parse_raw(h2o.parse_setup(df_raw),
                           id='training.csv',
                           first_line_is_header=1)
        print("Input dataframe: ", df)
        return df

    def split_dataframe(self, df, ratios=[.9], seed=None):
        train, test = df.split_frame(ratios=ratios, seed=seed)
        return train, test

    def train_gbm(self):
        dt_all = self.get_data()
        dt_train, dt_test = self.split_dataframe(dt_all, seed=self.split_seed)
        y = self.y  # response column
        final_gbm = H2OGradientBoostingEstimator(**self.gbm_params)
        print("# Start training......")
        final_gbm.train(y=y, training_frame=dt_train, validation_frame=dt_test)
        print('# Downloading model as pojo file......')
        try:
            print("Dowloaing pojo.")
            final_gbm.download_pojo(os.path.join(OUTPUT_PATH, 'gbm_model'))
            if path.exists(os.path.join(OUTPUT_PATH, 'gbm_model.java')):
                print("Pojo downloaded successfully")
            else:
                print("No pojo!!!!!!!!!")
        except Exception as ex:
            print("POJO downloading error: {}".format(ex))
        print(final_gbm)
        print("# Training done.")


def main():
    print("Start!")


if __name__ == "__main__":
    obj = GrossVolumeTraining()
    obj.train_gbm()
    h2o.shutdown()
```
---------