In [None]:
import dask.dataframe as dd
import pandas as pd
import boto3
import os

from dask_cloudprovider.aws import FargateCluster
from dask.distributed import Client

# modeling
import dask_ml
# from dask_ml. import 

# Building Our Model

Now that we have our imports in, lets get started on the fun stuff!

## Defining AWS ECS+Fargate Cluster
We'll need to define our underlying resources where our code will run on. I'm using a temporary cluster, much like that which will run our managed work loads via prefect when automated. Here, we will define a cluster, spin up the resources, and use it for the duration of the model building process. 

One important thing to note is that without defining the needed items such as the cluster to be used or the security groups or IAM credentials to use, Dask-ClusterProvider would actually define those resources itself without requiring our input. While thats convenient for quick projects, I'm going to intentionally avoid that by using some elbow grease for security's sake. I plan to use this in the future, and the less gaps potentially creating risk are ideal for that. Also note that because we are interfacing with AWS S3 storage to store and version our models and feature engineering pipelines, we'll need to make sure that we have defined the correct permissions to allow that in our AWS infrastructure. This will be an additional "Task IAM Role" that we must define, giving the process the right to connect to our buckets. This will also be useful when our workload in Prefect must obtain those serialized objects as well.

In [None]:
# define our model developement cluster
cluster = FargateCluster(
    n_workers = 4,
    fargate_use_private_ip = False, # I don't really feel like going through the trouble of making this private
    worker_cpu = 1024,
    worker_mem = 4096,
    vpc = os.getenv('AWS_VPC_ID'),
    subnets = [os.getenv('AWS_PRIVATE_SUBNET_1'),os.getenv('AWS_PRIVATE_SUBNET_2')], # use the public subnets from creating them in the IaC
    security_groups = [os.getenv('AWS_DASK_SECURITY_GROUP'), os.getenv('AWS_PRIVATE_SUBNET_SECURITY_GROUP')] 
    cluster_arn = os.getenv('AWS_ECS_CLUSTER_ARN'),
    execution_role_arn = os.getenv('AWS_ECS_EXEC_ROLE_ARN'),
    task_role_arn = os.getenv('AWS_ECS_TASK_ROLE_ARN'),
    scheduler_timeout = '15 minutes', # a bit longer because we're just in "dev" right now
    image = 'daskdev/dask:2021.2.0', 
    environment = {
        'EXTRA_CONDA_PACKAGES': '',
        'EXTRA_PIP_PACKAGES': 'dask-ml, boto3, '
    }
)

We'll need to create our dask.distributed Client object for establishing a connection to the Fargate Cluster we're spinning up. Assure this is closed when done using it with `cluster.close()`

In [None]:
client = Client(cluster)

## Loading Data

## Feature Engineering

## Modeling Data

## Testing Inference

# MLOps

Its like DevOps, but with Machine Learning! Now comes the more boring part for a lot of folks. To have a robust, advanced, and mature data science approach you must invest time and thought into how your team will approach such advanced - and sometimes convoluted - processes such as deploying models and managing the models predictive accuracy among your systems, otherwise known as [model decay](https://towardsdatascience.com/concept-drift-and-model-decay-in-machine-learning-a98a809ea8d4?gi=1cb3decbf414).

This is where tools like Amazon SageMaker, SAS VDMML, Azure Machine Learning, etc will come into play. Tools like these can be indespensible for some established, large teams that need extra capabilities to encourage more agile approaches across several projects and sub-teams. For my project, I've decided to avoid that for now, instead opting for more simplicity and budget-friendly methods like simply using an S3 bucket to manage my feature engineering and model repository needs. Its important to understand all the tradeoffs with seemingly simple decisions like that before fully determining what strategy your team should follow. Some things to keep in mind aare the following:
1. Cost
2. Data Science Dept/Team Size, Structure, Available Skillsets, and Goals
3. Infrastructural Requirements -> Staffing, Security, and Integration Needs.
4. Model Deployment Frequencies and Types -> Batch, Web Service, Event Stream, etc.



## Serializing Feature Processing & Trained Model
To save a trained model, we'll need to use [serialization](https://en.wikipedia.org/wiki/Serialization). On this project I will be using joblib to get the "job" done. Just had to sneak that pun in, I hope I didn't lose anyone from it! Other popular serialization libraries in the Python ecosystem are [marshmallow](https://marshmallow.readthedocs.io/en/stable/) and [pickle](https://docs.python.org/3/library/pickle.html), if you'd like to explore other serializer options out there as well. Pickle is going to be your simplest one out there, but its a hoot to optimize for given its underlying C code.

## Writing Model To Versioned S3 Bucket

Using buckets created within the Pulumi IaC, we're going to save both our feature engineering items as well as our model. 

One thing that's important for a rock-solid workflow is to build & define a standard model versioning practice, and to do so, we'll need to be skillful with how we save these serialized objects to the cloud... whatever we are using as our model "repository".

## Clean Up
Don't forget to close out the termporary dask cluster!

In [None]:
cluster.close()