## Airflow Introduction 

#### 1. Design Conepts
Airflow is a platform to programmatically author, schedule and monitor workflows.  

Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap.

Airflow also has a rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. 

#### 2. DAG
Airflow relies on the concept of DAG. In mathematics and computer science, a directed acyclic graph, is a finite directed graph with no directed cycles. 

DAG is **one directional**, that is, it consists of finitely many vertices and edges, with each edge directed from one vertex to another, such that there is no way to start at any vertex v and follow a consistently-directed sequence of edges that eventually loops back to v again. Equivalently, a DAG is a directed graph that has a topological ordering, a sequence of the vertices such that every edge is directed from earlier to later in the sequence.

A DAG has many desired properties working a a graph problems. For instance, a DAG can be topologically sorted. 

#### 3. Operator 

In Airflow all workflows are DAGs. A Dag consists of operators. An operator defines an individual task that needs to be performed. There are different types of operators available. Operator determines what actually gets done. Once an operator is instantiated, it is referred to as a "task". An operator describes a single task in a workflow. 

Categories of operator
 
1. **Sensor**: A type of operator that will keep running until a certain criteria is met. Example include waiting for a certain time, external file, or upstream data source.   
    e.g. Hdfs Sensor: wait for a file or folder to land in HDFS  
    e.g. NamedHivePartitionSensor: Check whether the most recent partition of a Hive table 
2. **Operator**: triggers a certain action (e.g. run a bash command, execute a python function, or execute a Hive query, etc.   
    e.g. PythonOperator   
    e.g. BashOperator  
3. **Transfer**: Move data from one location to another  
    e.g. MySQLtoHiveTransfer 


#### 4. Example
Install airflow using:
```
pip install apache-airflow

```
First, import all the needed airflow files
```
from datetime import timedelta

import airflow 
from airflow import DAG 
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
import os 

```
Airflow allows one to supply default arguments. Here in the example, we included ```owner```, ```start_date```, ```concurrency``` and ```retries```. 

```
default_args = {
    'owner': 'airflow',
    'start_date': dt.datetime(2018, 9, 24, 10, 00, 00),
    'concurrency': 1,
    'retries': 0
}
```

```start_date``` tells you when the workflow starts, 
```concurrency``` tells one to dictate the number of processes needs to be used running multiple DAGs. 
```retries``` tells users how many time the process will retry after it fails. A more detailed documutation for default arguments can be found here https://airflow.apache.org/docs/stable/concepts.html

Then, we can create a DAG using this simple default argument set:
```
with DAG('my_simple_dag',
         default_args=default_args,
         schedule_interval='*/10 * * * *',
         ) as dag:
    opr_hello = BashOperator(task_id='say_Hi',
                             bash_command='echo "Hi!!"')

    opr_greet = PythonOperator(task_id='greet',
                               python_callable=greet)
    opr_sleep = BashOperator(task_id='sleep_me',
                             bash_command='sleep 5')

    opr_respond = PythonOperator(task_id='respond',
                                 python_callable=respond)
opr_hello >> opr_greet >> opr_sleep >> opr_respond
```

Here in the example we have four operators, namely:

```opr_hello``` A bash operator  
```opr_greet``` A python operator   
```opr_sleep``` A bash operator  
```opr_respond``` A a python operator  

Each operator corresponds to a task in the workflow. Once they are initiated as an instance, they become part of the workflow. 

They are then connected by bitwise operator ```>>``` to indicate the order of execution. Here the order is dictated by 
```opr_hello >> opr_greet >> opr_sleep >> opr_respond```, indicating the order is 
opr_hello -> opr_greet -> opr_sleep -> opr_respond
