# Custom Classification Model with SKLearn using SageMaker
---
## Introduction
This notebook demonstrates how to run a training job with a custom script in SageMaker using the SKLearn framework. The custom script also contains *helper functions* which handle custom prediction capabilities

We will use the [Iris Dataset](https://archive.ics.uci.edu/ml/datasets/iris). We will try to classify the iris flower to one of three possible classes based on 4 features: sepal length, sepal width, petal length, and petal width

---
## Prerequisites
- Retrieve the default role assigned to the SageMaker Studio. We will use this to interact with other SageMaker and AWS services
- Set the S3 bucket to use for storing the training artifacts

In [None]:
!pip install -U sagemaker

In [None]:
pass

## Data Ingestion
Let's download the dataset and read the first 20 rows

In [None]:
%%time
pass

In [None]:
# Print size of dataset
pass

## Data Preprocessing
We need to convert the labels from string to integers. Then split the data into train and test. We'll use an 80-20 split. We'll then save these datasets into individual CSV files and upload them to S3 to be used by the training script

In [None]:
%%time
pass

Split the dataset and save to S3

In [None]:
%%time
pass

## Training
Now we'll run the actual training. First, we create a `train.py` file which we'll pass to the Estimator as the entry point. This file will contain the custom SKLearn training script

In [None]:
%%writefile train.py

import argparse
import json
import os

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

import joblib

if __name__ == "__main__":
    # Define arguments. These are passed to the script via the 'hyperparameters' property of the Estimator
    parser = argparse.ArgumentParser()
    
    parser.add_argument("--solver", type=str, default="lbfgs")
    args, _ = parser.parse_known_args()
    
    # Get environment variables
    data_dir = os.environ.get("SM_CHANNEL_TRAIN")
    model_dir = os.environ.get("SM_MODEL_DIR")
    
    # Read datasets
    X_train = pd.read_csv(os.path.join(data_dir, "train_features.csv"))
    y_train = pd.read_csv(os.path.join(data_dir, "train_labels.csv"))
    y_train = y_train.values.ravel()
    
    # Train model
    model = LogisticRegression(solver=args.solver)
    print("Training Logistic Regression Model")
    model.fit(X_train, y_train)
    
    # Evaluate
    score = model.score(X_train, y_train)
    print("SCORE: %.2f" % (score))
    
    # Print summary report
    print(classification_report(y_train, model.predict(X_train)))
    
    # Save model
    print("Saving model")
    model_path = os.path.join(model_dir, "model.joblib")
    joblib.dump(model, model_path)

##############################################
#  MODEL SERVING FUNCTIONS
##############################################
    
def model_fn(model_dir):
    """
        This function is executed before predictions to read the model
    """
    model = joblib.load(os.path.join(model_dir, "model.joblib"))
    return model

def input_fn(request_body, request_content_type):
    """
        This function parses the input data passed when predicting
        
        Currently, we only support JSON content type
    """
    if request_content_type == "application/json":
        payload = json.loads(request_body)
        return payload["input"]
    else:
        raise ValueError(f"'{request_content_type}' not supported by script! Only 'application/json' contents are supported")

def predict_fn(input_data, model):
    """
        This function is executed to get a model prediction when .predict is called
    """
    predictions = model.predict(input_data)
    return predictions

def output_fn(prediction, content_type):
    """
        This function performs any post-processing on the predictions before being returned
        to the endpoint invoker        
    """
    
    resp = {"prediction": int(prediction[0])}
    return resp

Create an SKLearn Estimator instance to run the training job with our custom script

In [None]:
%%time
pass

## Deploy Model as Endpoint
Create an endpoint to host the model. Use the Estimator to create the endpoint as we have a custom script to be used when making predictions

In [None]:
pass

Test the endpoint with one sample to see if it's working

In [None]:
%%time
pass

## Cleanup
If the model endpoints are no longer in use, make sure to delete them

In [None]:
pass