## Writing a SageMaker SkLearn Estimator 

To deploy a custom SkLearn model through sagemaker. We will write a function `train.py` that class the SkLearn Estimator. The following code is highly based on the code from the project of Case Studies section. 

```python
import argparse
import os
import json 
import pandas as pd
import joblib

from sklearn.linear_model import BayesianRidge


# Provided model load function (Taken from Case Studies Project)
def model_fn(model_dir):
    """Load model from the model_dir. This is the same model that is saved
    in the main if statement.
    """
    print("Loading model.")
    
    # load using joblib
    model = joblib.load(os.path.join(model_dir, "model.joblib"))
    print("Done loading model.")
    
    return model

if __name__ == '__main__':
    
    
    # Initialize an ArgumentParser
    parser = argparse.ArgumentParser()

    # SageMaker parameters, like the directories for training data and saving models; set automatically
    # Do not need to change
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    
    # Add model parameters 
    parser.add_argument('-p', '--classifier_params', type=str, default='{}', required=True,
                        help='Classifier Params (default: "")')
    # NOTE: A dictionary should be passed as a string. See more here: 
    #https://stackoverflow.com/questions/18608812/accepting-a-dictionary-as-an-argument-with-argparse-and-python
                        
    # args holds all passed-in arguments
    args = parser.parse_args()

    # Read in csv training file
    training_dir = args.data_dir
    train_data = pd.read_csv(os.path.join(training_dir, "train.csv"), header=None, names=None)

    # Labels are in the first column
    train_y = train_data.iloc[:,0]
    train_x = train_data.iloc[:,1:]
    
    # Load params 
    mdl_args = json.loads(args.classifier_params)
    
    # Define model 
    model = BayesianRidge(**mdl_args)    
    
    # Train Model
    model.fit(train_x, train_y)

    # Save the trained model
    joblib.dump(model, os.path.join(args.model_dir, "model.joblib"))
```

## Train and deploy the Model

In [26]:
from sagemaker.sklearn.estimator import SKLearn 
import sagemaker

# Define session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

sklearn_estimator = SKLearn(
                    entry_point="train.py",
                    source_dir='source_sklearn', 
                    role=role, 
                    train_instance_count=1, 
                    train_instance_type='ml.c4.xlarge', 
                    sagemaker_session=sagemaker_session, 
                    framework_version='0.23-1', 
                    hyperparameters={
                        'p':str('{"compute_score":true,"normalize":true}')
                        }, 
                    output_path=f's3://{bucket}/{prefix}'
                    )

In [27]:
%%time 

# Train estimator with data uploaded to s3 
sklearn_estimator.fit({'train':f's3://{bucket}/{prefix}/train.csv'})

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


2020-08-12 00:15:59 Starting - Starting the training job...
2020-08-12 00:16:03 Starting - Launching requested ML instances......
2020-08-12 00:17:21 Starting - Preparing the instances for training......
2020-08-12 00:18:07 Downloading - Downloading input data...
2020-08-12 00:18:37 Training - Downloading the training image...
2020-08-12 00:19:30 Uploading - Uploading generated training model
2020-08-12 00:19:30 Completed - Training job completed
[34m2020-08-12 00:19:17,887 sagemaker-training-toolkit INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-08-12 00:19:17,889 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-08-12 00:19:17,898 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-08-12 00:19:18,234 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-08-12 00:19:18,250 sagemaker-training-toolkit INFO     No GPUs detected 

In [29]:
%%time
# Deploy mmodel
predictor = sklearn_estimator.deploy(instance_type="ml.t2.medium", initial_instance_count=1)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


---------------------!CPU times: user 386 ms, sys: 8.39 ms, total: 394 ms
Wall time: 10min 32s


## Testing different Offers 

Suppose that a Starbuck's Project Manager comes to us with three different offers (that target three different customer segments) that they would like to assess their completition rate `CR`. Once our model is deployed, we can use it to predict which one has the highest `Completition Rate (CR)` in order to make a recommendation to the Starbuck's Project Manager. 

Based on what he told us about the offers, we resume each one of them in the following, additionnally we represent each offer0s characteristics as a one dimensional array.

* **Bogo for Top Income and Recent Users**. As we're focusing on Top Income Users, we assume that the population that the variable `OC_T` is a little bit more representative in our completition population, image above a 12%, and users who has a median antiquity of 15 months. Additionnaly as they are top income user we assume that the difficulty rate `DFR` is near 60%. Our vector of data will be like 

```python
X_test_offer1 = np.array([1,0.50,np.sqrt(14),1.0,80,35,58,30,12,15])
```

* **Bogo for Standard Income and Old Users**. We consider an old user as those with average antiquity higher than 24 months.

```python
X_test_offer2 = np.array([1,0.80,np.sqrt(7),1.0,50,45,55,35,10,24])
```

* **Bogo for High Income and all users**. We've seen before that High Income users tendo to complete more often the offers than Top or Standard Income, those we assume that this group represent nearly 70% of the completition sample. 

```python
X_test_offer3 = np.array([1,0.60,np.sqrt(12),1.0,50,45,65,27,8,20])
```

In [50]:
# Generate random data 
X_test_offer1 = np.array([1,0.60,np.sqrt(14),1.0,80,35,58,30,12,15]).reshape(1, -1)
X_test_offer2 = np.array([1,0.80,np.sqrt(7),1.0,50,45,55,35,10,24]).reshape(1, -1)
X_test_offer3 = np.array([1,0.60,np.sqrt(12),1.0,50,45,65,27,8,20]).reshape(1, -1)

# Make predictions
offer1_pred = predictor.predict(X_test_offer1)
offer2_pred = predictor.predict(X_test_offer2)
offer3_pred = predictor.predict(X_test_offer3)

# Print results 
print(f'Offer 1  predicted CR: {offer1_pred[0]*100}', '\n',
      f'Offer 2  predicted CR: {offer2_pred[0]*100}', '\n',
      f'Offer 3  predicted CR: {offer3_pred[0]*100}') 

Offer 1  predicted CR: 33.992025821141816 
 Offer 2  predicted CR: 78.07046826678777 
 Offer 3  predicted CR: 75.67366728846704


We see that **offer 2 has the highest predicted CR**. As we remember from the data, the higest completition rate of the past offers was above 65%, that is, if we decide to go with offer two or offer three we will expect it to be a total success compare to prior offers. 

Before, recommending to go with the offer 2, we should take into consideration some limitations that are inherent to offers, for example, it may be that offer 3 is cheaper and faster to deply as offer two, or that it is not valid for all the Starbucks stores, imagine that only High Income people go to a certain Starbucks because it is in an exclusive residential zone. Thus, we should take into consideration these limitations, before giving a final recommendation. 

## Further Go 

Naturally, there were some things that could also be tested and try out. For example, more feature engineer or approaching the problem by predicting the Attractiveness Rate `AR`, or testing out some other models. 

At the end, the data spoke by itself, that is, it was not difficult to visualize the hidden relationships between the variables and most importantly to frame the problem as predicting the Completition Rate. 

## Clean Resources

In [51]:
# Delete endpoint 
predictor.delete_endpoint()

In [53]:
import boto3

# Delete all resources
bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'D676A6717FC15C00',
   'HostId': 'DTtJda5ii8u1hc82vTXvv1KxlNp/wJT7A43agaJBc+loYayilYS006nhZh2p+xhv0cKjyLHaBCE=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'DTtJda5ii8u1hc82vTXvv1KxlNp/wJT7A43agaJBc+loYayilYS006nhZh2p+xhv0cKjyLHaBCE=',
    'x-amz-request-id': 'D676A6717FC15C00',
    'date': 'Wed, 12 Aug 2020 01:53:56 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker-scikit-learn-2020-08-12-00-15-59-775/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2020-08-11-23-49-14-814/source/sourcedir.tar.gz'},
   {'Key': 'capstone_project/sagemaker-scikit-learn-2020-08-12-00-15-59-775/debug-output/training_job_end.ts'},
   {'Key': 'capstone_project/user_purchase.csv'},
   {'Key': 'capstone_project/transcript.json'},
   {'Key': 'sagemaker-scikit-learn-2020-08-12-00-06-28-056/source/sourcedir.t