Toby Adjuik
adjuiktoby@gmail.com

Department of Biosystems and Agricultural Engineering

University of Kentucky

February 10, 2022

# Random Forest Model deployment in Amazon Sagemaker.

In this notebook, I show how I trained and deployed a machine learning model using AWS SageMaker.
This notebook was created and run in an Amazon Sagemaker notebook instance. This demonstration is from a project I worked on titled "Machine Learning Approach to Simulating CO2 Fluxes in Cropping Systems".

#### Resources needed
1. A dataset: The dataset used in this project was the GRACEnet dataset.The
main purpose of the GRACEnet database is to aggregate information from many studies so
that methods for quantifying GHG emissions and other environmental impacts of cropped
and grazed systems can be developed, and to provide scientific evidence for carbon trading
programs that can help reduce GHG emissions.

The original uncleaned and unprocessed data used in this project can be found at: https://data.nal.usda.gov/dataset/gracenet-greenhouse-gas-reduction-through-agricultural-carbon-enhancement-network

2. An algorithm: I used the Random Forest algorithm in scikit-learn provided by Amazon SageMaker to train the model using the GRACEnet dataset to predict the CO2 flux in the cropping systems.

#### Resources from Amazon SageMaker

A few resources needed for storing your data and running the code in Amazon SageMaker:

1. An Amazon Simple Storage Service (Amazon S3) bucket to store the training data and the model artifacts that Amazon SageMaker creates when it trains the model.

2. An Amazon SageMaker notebook instance to prepare and process data and to train and deploy a machine learning model.

3. A Jupyter notebook to use with the notebook instance to prepare your training data and train and deploy the model.

Detailed description of this work is available in our paper: Adjuik, T.A., Davis, S.C. Machine Learning Approach to Simulate Soil CO2 Fluxes under Cropping Systems. Agronomy 2022, 12, 197, doi: https://doi.org/10.3390/agronomy12010197

### Fetching the dataset

In [29]:
import pandas as pd # library for data manipulation and analysis.
import numpy as np # library for scientific computing, provides high-performance, easy to use structures and data analysis tools. 
import matplotlib.pyplot as plt # Plotting library
import seaborn as sns # data visualization library
sns.set(style="darkgrid")

#Uploading a cleaned version of the dataset. To learn more about the data cleaning and manipulation steps for this data, see
# my repository titled "MLSoilCO2Flux"
data = pd.read_csv("New_GHG_Data_Deployment.csv")

In [30]:
data.shape

(7863, 11)

In [31]:
# Create dataframes for Target and Predictors
 
x_data = pd.DataFrame(np.c_[data ['Air_Temp_DEGC'],data ['Soil_Temp_DEGC'],data ['Soil_Classification'],data ['Crop'],data ['Fert_Ammend_Class']], columns = ['Air_Temp_DEGC','Soil_Temp_DEGC','Soil_Classification','Crop','Fert_Ammend_Class'])
y_data = data ['Carbon_dioxide'] # Create dataset with CO2 flux

### Preparing the data

In [32]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.20, random_state=10)                                                    

In [33]:
print("The shape of the X_trainset is",x_train.shape) # The shape of the X_trainset
print("The shape of the y_trainset is", y_train.shape) # The shape of the y_trainset

print("The shape of the X_testset is", x_test.shape) # The shape of the X_testset
print("The shape of the y_testset is", y_test.shape) # The shape of the y_testset

The shape of the X_trainset is (6290, 5)
The shape of the y_trainset is (6290,)
The shape of the X_testset is (1573, 5)
The shape of the y_testset is (1573,)


In [34]:
x_train.head()

Unnamed: 0,Air_Temp_DEGC,Soil_Temp_DEGC,Soil_Classification,Crop,Fert_Ammend_Class
16,0.0,0.0,6.0,0.0,3.0
1858,12.97,0.0,1.0,1.0,2.0
211,0.0,0.0,6.0,0.0,0.0
4241,0.0,20.388889,8.0,1.0,3.0
6214,0.0,0.0,4.0,4.0,3.0


In [35]:
#Get variable/feature names
feature_names=list(x_data.columns.values)
feature_names

['Air_Temp_DEGC',
 'Soil_Temp_DEGC',
 'Soil_Classification',
 'Crop',
 'Fert_Ammend_Class']

In [36]:
trainX = pd.DataFrame(x_train, columns=feature_names)# Asssign feature names to the training data set
trainX['target'] = y_train

testX = pd.DataFrame(x_test, columns=feature_names)# Asssign feature names to the testing data set
testX['target'] = y_test

In [37]:
trainX.head()

Unnamed: 0,Air_Temp_DEGC,Soil_Temp_DEGC,Soil_Classification,Crop,Fert_Ammend_Class,target
16,0.0,0.0,6.0,0.0,3.0,896.902655
1858,12.97,0.0,1.0,1.0,2.0,2453.816056
211,0.0,0.0,6.0,0.0,0.0,636.547423
4241,0.0,20.388889,8.0,1.0,3.0,4494.138
6214,0.0,0.0,4.0,4.0,3.0,0.0


In [38]:
trainX.to_csv('GHG_train.csv')# Save the training dataset
testX.to_csv('GHG_test.csv')# Save the testing dataset

### Data Ingestion
The training and testing files are then copied to S3 for Amazon SageMaker's managed training to pickup.This should be within the same region as the Notebook Instance, training, and hosting.

In [39]:
import datetime
import tarfile

import boto3 # AWS SDK for python. Provides low-level access to AWS services
from sagemaker import get_execution_role
import sagemaker

m_boto3 = boto3.client('sagemaker') 

sess = sagemaker.Session()

region = sess.boto_session.region_name

bucket = 'bucket-name'  #  Bucket is a logical unit of storage in AWS S3

print('Using bucket ' + bucket)

Using bucket toby-s3-bucket-name


In [40]:
# send data to S3. SageMaker will take training data from s3
trainpath = sess.upload_data(
    path='GHG_train.csv', bucket=bucket,
    key_prefix='sagemaker/sklearncontainer')

testpath = sess.upload_data(
    path='GHG_test.csv', bucket=bucket,
    key_prefix='sagemaker/sklearncontainer')

### Prepare a Scikit-learn Training Script

Here, we write the scitkit-learn training script that will be used to train the random forest model.
The training script is similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables. For example:

- SM_MODEL_DIR:  A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

- SM_OUTPUT_DATA_DIR: A string representing the filesystem path to write output artifacts to. Output artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.

Supposing two input channels, ‘train’ and ‘test’, were used in the call to the Scikit-learn estimator’s fit() method, the following will be set, following the format “SMCHANNEL[channel_name]”:

- SM_CHANNEL_TRAIN: A string representing the path to the directory containing data in the ‘train’ channel

- SM_CHANNEL_TEST: Same as above, but for the ‘test’ channel.

In [43]:
%%writefile script.py

import argparse
import os
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import joblib
from sklearn.metrics import explained_variance_score, r2_score



# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf

if __name__ =='__main__':

    print('extracting arguments')
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument('--n-estimators', type=int, default=100)
    parser.add_argument('--max_leaf_nodes', type=int, default=10)
    

#Fit model to training set
    

    # Data, model, and output directories
   
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='GHG_train.csv')
    parser.add_argument('--test-file', type=str, default='GHG_test.csv')
    
    
    args, _ = parser.parse_known_args()
    
    print('reading data')
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))


    
    print('building training and testing datasets')
    attributes = ['Air_Temp_DEGC', 'Soil_Temp_DEGC', 'Soil_Classification',
       'Crop', 'Fert_Ammend_Class']
    X_train = train_df[attributes]
    X_test = test_df[attributes]
    y_train = train_df['target']
    y_test = test_df['target']
    
    # train
    print('training model')
    model = RandomForestRegressor(
        n_estimators=args.n_estimators,
        max_leaf_nodes =args.max_leaf_nodes,
        n_jobs=-1)
    
    model.fit(X_train, y_train)
     
    # persist model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print('model persisted at ' + path)
    
    # print explained_variance_score 
    print('validating model')
    predictions = model.predict(X_test)
    print("Explained Variance Score is " + str(explained_variance_score(y_test, predictions).round(2)))
    print("R2 score : %.2f" % r2_score(y_test,predictions))

Overwriting script.py


### Local Training
Script arguments allows us to remove from the script any SageMaker-specific configuration, and run locally

In [73]:
! python script.py --n-estimators 100 \
                   --max-leaf-nodes 10 \
                   --model-dir ./ \
                   --train ./ \
                   --test ./ \
                   

extracting arguments
reading data
building training and testing datasets
training model
model persisted at ./model.joblib
validating model
Explained Variance Score is 0.72
R2 score : 0.72


### SageMaker Training - Launching a training job with the Python SDK
### Create an Estimator
You run Scikit-learn training scripts on SageMaker by creating SKLearn Estimators. Call the fit method on a SKLearn Estimator to start a SageMaker training job. The following code sample shows how you train a custom Scikit-learn script named “script.py”, passing in two hyperparameters ('n-estimators', 'max_leaf_nodes'), and using two input channel directories (‘train’ and ‘test’).

In [74]:
#We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn
sklearn_estimator = SKLearn(
    entry_point='script.py',
    role = get_execution_role(),
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    framework_version='0.23-1',
    base_job_name='rf-scikit',
    hyperparameters = {'n-estimators': 500,
                       'max_leaf_nodes': 16 
                       })

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [75]:
# launch training job, with asynchronous call
sklearn_estimator.fit({'train':trainpath, 'test': testpath}, wait=False)

### Deploy to a real-time endpoint
#### Deploy with Python SDK
An Estimator could be deployed directly after training, with an Estimator.deploy() but here we are using the more extensive process of creating a model from s3 artifacts, that could be used to deploy a model that was trained in a different session or even out of SageMaker.

After the estimator finishes training, we deploy it to a SageMaker endpoint:

In [84]:
sklearn_estimator.latest_training_job.wait(logs='None')
artifact = m_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']

print('Model artifact persisted at ' + artifact)



2022-02-08 19:54:25 Starting - Preparing the instances for training
2022-02-08 19:54:25 Downloading - Downloading input data
2022-02-08 19:54:25 Training - Training image download completed. Training in progress.
2022-02-08 19:54:25 Uploading - Uploading generated training model
2022-02-08 19:54:25 Completed - Training job completed
Model artifact persisted at s3://sagemaker-us-east-1-422337765573/rf-scikit-2022-02-08-19-49-16-202/output/model.tar.gz


In [90]:
predictor = sklearn_estimator.deploy(instance_type='ml.m4.xlarge',initial_instance_count=1)

------!

### Invoke with the Python SDK

In [86]:
print(predictor.predict(testX[feature_names]))

[4067.34661881 7211.95004853  385.64980311 ... 4104.57903344  385.64980311
  385.64980311]


In [87]:
predictions = predictor.predict(testX[feature_names])# Making predictions with the trained model

In [88]:
from sklearn.metrics import r2_score
print("R2 score : %.2f" % r2_score(testX['target'],predictions))

R2 score : 0.77


As you can see from the evaluation metrics, the trained model attained an R2 of 0.77. This score can be improved with furthur
hyperparameter tuning, however I chose to use two hyperparameters ('n-estimators', 'max_leaf_nodes). It is possible to implement 
automatic hyperparameter tuning using "GridSearch" or "RandomSearch" packages provided in scikit learn.

### Deleting the endpoint !

In [89]:
sagemaker.Session().delete_endpoint(predictor.endpoint)
# bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
# bucket_to_delete.objects.all().delete()

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


If you want to come back to this notebook after you deployed the SageMaker endpoint, you can use the following snippet of code to invoke it:

In [None]:
# sklearn_predictor = Predictor(endpoint_name='randomforestregressor-endpoint',
#                               sagemaker_session=sess,
#                               serializer=NumpySerializer(),
#                               deserializer=NumpyDeserializer())


### Conclusions

In this project,
I demonstrated the application of Random Forest to predict
soil CO2 fluxes with open source data from the GRACEnet database. Prediction R2 value was 0.77

#### Summary steps

1. Upload  and prepare data
2. Ingested Data
3. Trained the RF Model
4. Launched a training job with the Python SDK
5. Deployed the Model to Amazon SageMaker.
6. Validated the Model



#### References
1. Sriramya Kannepalli (February 10, 2022) Random Forest and XGBoost on Amazon SageMaker and AWS Lambda:
https://medium.com/analytics-vidhya/random-forest-and-xgboost-on-amazon-sagemaker-and-aws-lambda-29abd9467795


2. AmazonSagemaker Examples (February 10, 2022) Develop, Train, Optimize and Deploy Scikit-Learn Random Forest:
https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/scikit_learn_randomforest/Sklearn_on_SageMaker_end2end.ipynb