# Scheduling Machine Learning Pipelines using Apache Airflow

In this workshop, you will use Airflow to schedule a basic machine learning pipeline. The workshop consists of 3 assignments.

1. Schedule a basic 'hello world' example on Airflow
2. Schedule a machine learning pipeline on Airflow
3. Improve the the pipeline by creating your own custom Airflow operator

## Assignment 1: Hello World

In this assignment, we are going to schedule a simple workflow on Airflow to get used with the concepts. The code below defines a DAG (directed acyclic graph) in Airflow. Inspect it and learn what the individual components mean.

In [1]:
import datetime
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator

# defines default arguments, which are added to every operator of a DAG (if not overriden)
default_args = {
    'start_date': datetime.datetime(2019, 10, 1)
}

# defines a DAG
# Any operator initialized inside this context manager will be added to the DAG
with DAG(
    dag_id='hello_world', # Each DAG has a unique ID
    schedule_interval='0 0 * * *', # Each DAG has a schedule defnied using the Cron language. This DAG will run once a day at midnight.
    default_args=default_args
):
    # Defnies a Python operator, which can execute any python callable
    print_hello_operator = PythonOperator(
        task_id='print_hello', # Every task within a DAG should have a unique ID
        python_callable=print, # The python callable that is used by the operator
        op_args=['hello'] # The arguments passed to the python callable
    )
    
    def print_world(): 
        print('world')

    print_world_operator = PythonOperator(
        task_id='print_world',
        python_callable=print_world, # In this operator, we used our own custom function
    )

    # Airflow offers many different types of operators. 
    # In this task, we use a Bash operator to, again, print something.
    print_airflow_operator = BashOperator(
        task_id='print_airflow',
        bash_command='echo airflow'
    )

    # The bitshift operator '>>' is overloaded by Airflow.
    # It is used to define the dependencies between tasks in a DAG.
    # In this case, print_world will run when print_hello is finished.
    # print_airflow will run as well when print_hello is finished.
    # Airflow could be configured to run these tasks in parallel, as they do not need to wait for each other.
    print_hello_operator >> print_world_operator
    print_hello_operator >> print_airflow_operator

This is everything we need to do to define a DAG that can be scheduled on Airflow. Because DAGs are defined using Python, we have a lot of freedom in how we want to design our DAG. We could, for example, dynamically create tasks by looping over lists. Furthermore, defining your DAGs as code makes it it easy to keep track on their version in a source code management system. 

The next step is to actually run this example on Airflow. The Airflow scheduler periodically scans a folder, dubbed 'the DagBag', for files that define DAGs. There is a folder in this jupyter notebook server, called `dags`, which is also present at the airflow scheduler via a network file system. Any python file that we put there, will be picked up by the scheduler.

- Copy the code snippet above. Go to the dags folder in the file explorer, press the 'New' button in the upper right corner, and create a **Text File**. Name it `hello_world.py`. The name of the file is flexible, but should end with `.py`. Paste the copied code inside.
- Go to port 8080 of your personal load balancer (you are now on 8888). Your DAG should appear within a few minutes.
- The DAG is turned off at first. Turn it on and see what happens.

## Assignment 2: Machine Learning Pipeline

In this assignment, we will create an Airflow DAG to schedule a basic machine learning pipeline. The pipeline will use the famous Iris dataset and consists of 2 steps:

1. Preprocess the dataset by adding some new features
2. Train a predictive model on the dataset

The goal is to schedule this training pipeline on a regular interval. This makes sure your model gets updated frequently with the latest data. In this example, we will use the same dataset for every run, but in reality you would like to use a new dataset every time the pipeline is run. This can be done using Airflow's [templating](https://airflow.apache.org/macros.html) mechanism, but is beyond the scope of this workshop.

You are provided with 2 scripts located in the folder `transform_scripts`, called. Each script transforms an input file and stores it in an output file. The locations of the input and output files are provided as arguments to the scripts. The first script, `preprocess.py`, takes a CSV with raw training data, and outputs a CSV with preprocessed training data. The second script, `train.py`, takes a CSV with preprocessed training data, and outputs a pickled machine learning model. Also, you are provided with an S3 bucket containing our raw training data. Our DAG should to the following:

1. Retrieve the raw training data from S3, apply the `preprocess.py` transform script to it, and send the preprocessed CSV back to S3
2. Retrieve the preprocessed training data from S3, apply the `train.py` transform script to it, and send the pickled model to S3