# Airflow

<a href=https://airflow.apache.org><img src=images/Airflow_logo.png width=400></a>

Up to this point, you have learnt a lot of things about data: Extracting the data, Transforming the data, and Loading the data. As mentioned multiple times, this "ETL" is the "core" of Data Engineering. However, all these operations work in tandem to create a workflow.

A workflow is a series of steps that are executed in a specific order. For example, to extract data from a source, transform it, and load it into a target, the following steps are required in order:

1. __Step 1:__ __Extract__ data using, for example, the web-scraping skills you have acquired.
2. __Step 2:__ __Transform__ the data using, for example, data-cleaning skills in pandas.
3. __Step 3:__ __Load__ the data into a target, for example, a database located in your local environment or a remote environment.<br><br>

<p align="center">
    <img src="images/WorkFlow1.png" width="500"/>
</p>

Workflows can also be very helpful when developing a ML model. For example, you might want to train a model with data, but do not exactly know which algorithm to use. In that case, these steps should be followed in order:

1. __Extract__ data using, for example, the webscraping skills you learnt
2. __Transform__ the data using, for example, the data cleaning skills in pandas
3. __Train__ multiple models using the data, and obtain the accuracy of each model
4. __Choose__ the model with the highest accuracy

<p align="center">
    <img src="images/WorkFlow2.png" width="600"/>
</p>

# Airflow as a workflow manager

Apache Airflow is a task-orchestration tool that allows you to define a series of tasks to be executed in a specific order. The tasks can be run in a distributed manner using Airflow's scheduler.

In Airflow you use Directed Acyclic Graphs (DAGs) to define a workflow. Each node in the DAG corresponds to a task, and they will be connected to one another.

Installing airflow would be as simple as running `pip install apache-airflow`, however, that migh cause dependency errors. Thus, in order to prevent those errors, run the following commands in your __`terminal`__ . 
<details>
  <summary><font size="+2">IMPORTANT: For Windows Users</font></summary>
  
  If you are on Windows make sure to download Ubuntu from the Microsoft store and install it. Then, update everything: `sudo apt update && sudo apt upgrade` and install python3-pip: `sudo apt-get install python3-pip`. Then you can follow the instructions below.

</details>

At the time of writing, the version of Airflow is 2.1.3, if you are going to use a different version, change it in the following code:

```
export AIRFLOW_HOME=~/airflow

AIRFLOW_VERSION=2.1.3
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
```

Now you are storing the values of Airflow and your Python version in two variables that are going to be used in the following command:

```CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"```


Now, you will get the corresponding version of Airflow from their github repo:

```pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"```


Airflow is ready for use. It is a good idea to initialise your database at this point. The database will contain metadata and host the DAGs you create:

`airflow db init`

Enter your credentials:

```
airflow users create \
    --username <your_username> \
    --firstname <your_firstname> \
    --lastname <your_lastname> \
    --role Admin \
    --email <your_email>
```

To confirm that everything works, run

`airflow webserver --port 8080`

This will start a new server at your localhost at port 8080. Notably, even if you start the server, your scheduled DAGs will not be monitored. To do this, the scheduler is required. Open a new terminal and run

`airflow scheduler`

If you receive a Warning message, don't worry, it won't affect your airflow current performance. Now, we are ready to start using the UI

If you go to your browser and visit `localhost:8080`, your output should be similar to that in the figure:

![](images/Airflow.png)

The image above depicts the Airflow UI. Here, you can see the DAGs that have been created, and so far, you will only see some examples and tutorials created by the Airflow team. Let's explore it a little bit.

# Airflow UI

Inside the UI, you can explore the metadata of the DAGs, such as the name (or ID), owner, status of previous runs of the whole DAG or of specific tasks inside the DAG, frequency (in the Schedule column), and run history.

For more details, click on the DAG. As an example, we observe the `example_bash_operator` DAG.

<p align="center">
    <img src="images/AirFlow2.png" width="500"/>
</p>

In the DAG, we can observe the structure, average run time of each task, Gantt chart of the DAG to determine if there are overlapping tasks, details of the DAG, and code that generated the DAG. Since this DAG has not been run, there is no info about previous runs. We can, however, take a look at the code. Before that, we observe the `Graph View` tab, which displays the same information as the `Tree View` tab, but rearranged:

<p align="center">
    <img src="images/AirFlow4.png" width="500"/>
</p>



Observe that there are several Nodes, each one representing a task. Additionally, their dependencies, e.g. `run_after_loop`, will not start until all `runme_x` haven't finished. 

To understand the working mechanism, we run a single task.

1. In the Airflow UI, enable the `example_bash_operator` DAG. 
2. Click the DAG to view its status. You should see two runs, which is because (as we will see later) these examples were set to run 2 days ago, whereas the schedule depicts that it is meant to run once every day. Thus, two runs are appropriate.
3. Inside the runs, there are different statuses. In this case, we see 'success' and 'skipped'. Note that they are meant to be skipped.
<p align="center">
    <img src="images/AirFlow4.png" width="500"/>
</p>

4. Next, we examine its flow by triggering an event. First, click Auto-refresh to see updates in real-time, and subsequently, click the Play button:

<p align="center">
    <img src="images/AirFlow_clip.gif" width="500"/>
</p>

Pretty cool, isn't it? We can also see the durantion of each task, and when each run took place. But I will let you explore more on that in the UI. For now, let's take a look at the code. If you click on the `Code` tab, you will see this:


In [None]:
"""Example DAG demonstrating the usage of the BashOperator."""

from datetime import timedelta

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.dummy import DummyOperator
from airflow.utils.dates import days_ago

args = {
    'owner': 'airflow',
}

with DAG(
    dag_id='example_bash_operator',
    default_args=args,
    schedule_interval='0 0 * * *',
    start_date=days_ago(2),
    dagrun_timeout=timedelta(minutes=60),
    tags=['example', 'example2'],
    params={"example_key": "example_value"},
) as dag:

    run_this_last = DummyOperator(
        task_id='run_this_last',
    )

    # [START howto_operator_bash]
    run_this = BashOperator(
        task_id='run_after_loop',
        bash_command='echo 1',
    )
    # [END howto_operator_bash]

    run_this >> run_this_last

    for i in range(3):
        task = BashOperator(
            task_id='runme_' + str(i),
            bash_command='echo "{{ task_instance_key_str }}" && sleep 1',
        )
        task >> run_this

    # [START howto_operator_bash_template]
    also_run_this = BashOperator(
        task_id='also_run_this',
        bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',
    )
    # [END howto_operator_bash_template]
    also_run_this >> run_this_last

# [START howto_operator_bash_skip]
this_will_skip = BashOperator(
    task_id='this_will_skip',
    bash_command='echo "hello world"; exit 99;',
    dag=dag,
)
# [END howto_operator_bash_skip]
this_will_skip >> run_this_last

if __name__ == "__main__":
    dag.cli()
```

If the tasks were to print out something to the console, we can confirm that on the Log tab of each task. For example, consider the `also_run_this` task. It is a BashOperator object that will print out `run_id={{ run_id }} | dag_run={{ dag_run }}`. Go to the `Graph View` tab, and click on the `also_run_this` task. In the next window, click `Log`. Observe the output:

<p align="center">
    <img src="images/AirFlow_log.png" width="500"/>
</p>

### Summary
This was a simple walkthrough of the Airflow UI. 
We saw that

- The workflow is represented by a DAG
- Each node in the DAG corresponds to a task
- Each DAG has a schedule that sets the frequency of runs
- Tasks can be triggered by previous tasks
- Each task corresponds to an operator object
- We saw BashOperator, which execute a bash script
- We saw DummyOperator, which according to the documentation _'Operator that does literally nothing. It can be used to group tasks in a DAG.'_

We will see how to create more operators, for example, a PythonOperator, in the next section. First, let's get some practice defining a DAG with the operators we have seen so far.

# Creating Your First DAG

First of all, make sure you followed all steps so far. If that's the case, you should have a folder in your home directory named airflow. _Check it by running running the following cell. If no error is thrown, you are good to go_
<details>
  <summary><font size="+1">IMPORTANT: For Windows Users</font></summary>
  
  If you are on Windows make sure to check it on the wsl terminal. You can simply type `ls ~` and check if there is a folder

</details>

In [1]:
from os.path import expanduser
import os

home = expanduser("~")
airflow_dir = os.path.join(home, 'airflow')
assert os.path.isdir(airflow_dir)

Inside that directory, you have to add a new folder named `dags`. Airflow will look into that directory to check if the DAGs you create through Python. Now, the example DAGs you are using are in your PATH directory, but new DAGs you create should be placed in `~/airflow/dags/`. _You can actually change the path where Airflow will look for new DAGs in the airflow.cfg file_

<details>
  <summary><font size="+1">IMPORTANT: For Windows Users</font></summary>
  
  If you are on Windows, go to the wsl console move to `cd ~/airflow`, and create the dags folder
</details>

In [2]:
from os.path import expanduser
from pathlib import Path
home = expanduser("~")
airflow_dir = os.path.join(home, 'airflow')
Path(f"{airflow_dir}/dags").mkdir(parents=True, exist_ok=True)

The Python files you create must be stored in that folder. The file should contain the DAG with the desired arguments. The arguments can be passed to the context manager and a dictionary.

In the context manager, simply define the tasks, don't implement any logical flow. As saw above, tasks are defined by operators

In [None]:
from airflow.models import DAG
from datetime import datetime
from datetime import timedelta
from airflow.operators.bash_operator import BashOperator

In [None]:

default_args = {
    'owner': 'Ivan',
    'depends_on_past': False,
    'email': ['ivan@theaicore.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'start_date': datetime(2020, 1, 1),
    'retry_delay': timedelta(minutes=5),
    'end_date': datetime(2022, 1, 1),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'wait_for_downstream': False,
    # 'dag': dag,
    # 'trigger_rule': 'all_success'
}


In [None]:
with DAG(dag_id='test_dag',
         default_args=default_args,
         schedule_interval='*/1 * * * *',
         catchup=False,
         tags=['test']
         ) as dag:
    # Define the tasks. Here we are going to define only one bash operator
    test_task = BashOperator(
        task_id='write_date_file',
        bash_command='cd ~/Desktop && date >> ai_core.txt',
        dag=dag)
    

This example can be found in the `examples` folder, under the name `dag_test.py`. Copy the example to your `dags` folder in your airflow directory.

<details>
  <summary><font size="+1">IMPORTANT: For Windows Users</font></summary>
  
  If you are on Windows, copy the file using the command line and use the `cp` command to copy the files to `cd ~/airflow/dags`. If you struggle with these commands, and you want to copy everything manually, follow these instructions to find the folder that stores the files from the Ubuntu console.
</details>

Once the file is in the airflow directory, you can run it by running the following command (if you haven't started the scheduler yet, run `airflow scheduler -D`):

`airflow dags unpause test_dag`

If you prefer that these DAGs appear in the UI, add them by running the following command:

`airflow db init`

So you can manage them in the UI.

## Tasks Dependencies

We have created a task; however, since a workflow is composed of more than one task, we add more tasks. If the tasks are specified, they will be executed in sequence, in no specific order (to change how they are executed, change the executor in the airflow.cfg file). However, you can specify their order by setting dependencies between them.

Setting dependencies is quite simple. You can specify the tasks and thereafter 'connect' them using the bit-shift operator. For example, to run task `runme_1` after task `runme_0`, do the following:

`task_0 >> task_1` or `task_0.set_downstream(task_1)` or `task_1 << task_0` or `task_1.set_upstream(task_0)`.

Evidently, there are many ways to set the dependencies. Thus, simply pick the one that works for you.

If you intend to run both `task_1` and `task_2` after `task_0`, do the following:

`task_0 >> [task_1, task_2]`.

If you intend to run `task_2` only after the completion of `task_0` and `task_1`, do the following:
```
task_0 >> task_2
task_1 >> task_2
```

Finally, it is also possible to set sequential dependencies between tasks. For example, if you intend to run `task_2` after `task_1`, and `task_1` after `task_0`, do the following:

`task_0 >> task_1 >> task_2`

The example below shows a DAG with four tasks:

1. date_task: A BashOperator that appends the current date into a file
2. add_task: A BashOperator that stages the file created by date_task
3. commit_task: A BashOperator that commits the file staged by add_task
4. push_task: A BashOperator that pushes the committed file to a remote repository

In [None]:
from airflow.models import DAG
from datetime import datetime
from datetime import timedelta
from airflow.operators.bash_operator import BashOperator

default_args = {
    'owner': 'Ivan',
    'depends_on_past': False,
    'email': ['ivan@theaicore.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'start_date': datetime(2020, 1, 1), # If you set a datetime previous to the curernt date, it will try to backfill
    'retry_delay': timedelta(minutes=5),
    'end_date': datetime(2022, 1, 1),
}
with DAG(dag_id='test_dag_dependencies',
         default_args=default_args,
         schedule_interval='*/1 * * * *',
         catchup=False,
         tags=['test']
         ) as dag:
    # Define the tasks. Here we are going to define only one bash operator
    date_task = BashOperator(
        task_id='write_date',
        bash_command='cd ~/Desktop/Weather_Airflow && date >> date.txt',
        dag=dag)
    add_task = BashOperator(
        task_id='add_files',
        bash_command='cd ~/Desktop/Weather_Airflow && git add .',
        dag=dag)
    commit_task = BashOperator(
        task_id='commit_files',
        bash_command='cd ~/Desktop/Weather_Airflow && git commit -m "Update date"',
        dag=dag)
    push_task = BashOperator(
        task_id='push_files',
        bash_command='cd ~/Desktop/Weather_Airflow && git push',
        dag=dag)
    
    date_task >> add_task >> commit_task
    add_task >> push_task
    commit_task >> push_task

In the last part of the DAG, you can observe the dependencies between the tasks. Definitely, you can set them all in tandem; however, in this case, we will discuss how to set the dependencies in different ways.

<p align="center">
    <img src="images/AirFlow_Dependencies.png" width="500"/>
</p>

After running it, you will find that your repo updates every minute.

<p align="center">
    <img src="images/AirFlow_GitHub.png" width="500"/>
</p>

## Try it out

1. Create a new remote repository in your GitHub account. 
2. You will eventually use if for storing weather data, so name your repository accordingly.
3. Clone the repository to your local machine.
4. Copy the DAG file `dag_test_dependencies.py` to the folder `dags` in your local machine.
5. Change the file according to the name of your repository and the directory you cloned it to.
6. Unpause the DAG by running `airflow dags unpause dag_test_dependencies` or by going to the `DAGS` tab in the UI and clicking on the `dag_test_dependencies` DAG.

As you start creating DAGs, you might forget which one are active. Good thing is that airflow has many commands to check your works in the command line. If you type `airflow -h` you can see all comands.

In [1]:
%%bash
airflow -h

usage: airflow [-h] GROUP_OR_COMMAND ...

positional arguments:
  GROUP_OR_COMMAND

    Groups:
      celery         Celery components
      config         View configuration
      connections    Manage connections
      dags           Manage DAGs
      db             Database operations
      jobs           Manage jobs
      kubernetes     Tools to help run the KubernetesExecutor
      pools          Manage pools
      providers      Display providers
      roles          Manage roles
      tasks          Manage tasks
      users          Manage users
      variables      Manage variables

    Commands:
      cheat-sheet    Display cheat sheet
      info           Show information about current Airflow and environment
      kerberos       Start a kerberos ticket renewer
      plugins        Dump information about loaded plugins
      rotate-fernet-key
                     Rotate encrypted connection credentials and variables
      scheduler      Start a scheduler instance
      sync-p

Thus, you can look at the dags by running

In [1]:
%%bash
airflow dags list

'airflow' is not recognized as an internal or external command,
operable program or batch file.


# Airflow Variables

As you probably observed, in the dependencies, we had to constantly pass the file path to the BashOperator, which is not efficient. To improve efficiency, we define a variable that contains the path to the directory in which the file is stored.

Airflow provides a channel to define variables from the UI or the command line. In this case, we will only use the UI. The variables you include in the UI will then be available in the Python code.

Hence, open your UI, click on 'Admin' and subsequently on 'Variables'.

<p align="center">
    <img src="images/AirFlow_variables.png" width="500"/>
</p>

In the next window, you can add your variables. You can import a file from your computer or click the `+` sign to add a variable manually.


<p align="center">
    <img src="images/AirFlow_variables2.png" width="500"/>
</p>

In the next window, you can add the name of the variable in the Key and the value in the Value.

<p align="center">
    <img src="images/AirFlow_variables3.png" width="500"/>
</p>

After clicking `Save`, the variable will be stored in the Airflow database, and you will be able to use it in your Python code. To access it, you must import the class Variable:
```
from airflow.models import Variable

weather_dir = Variable.get("weather_dir")
```

Now, you will be able to use that variable in your Python code. If you look at the script in `dag_test_variables.py`, you will see that we are using the variable `weather_dir` to define the path to the file.

# Python Operators

As mentioned above, it is possible to run bash commands in each task. However, Airflow is not limited to these commands. You can also create PythonOperators for any task, as long as they are contained in a Python function. As an example, we create a PythonOperator that will extract information about events that occurred 'On this day' in the past.

The first thing you have to do is creating a function that uses requests and bs4 to download that information from Wikipedia. 

In [6]:
from bs4 import BeautifulSoup
from os.path import expanduser
import requests

def get_ul(url: str):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    return soup.find('ul')

def get_today_events(url: str, file_dir: str):
    ul_soup = get_ul(url)
    for li in ul_soup.find_all('li'):
        write_file(li.text, file_dir)

def write_file(li: str, file_dir: str):
    with open(file_dir, 'a') as f:
        f.write(li)
        f.write('\n')

home = expanduser("~")
desktop_dir = os.path.join(home, 'Desktop/test_2.txt')
get_today_events('https://en.wikipedia.org/wiki/Wikipedia:On_this_day/Today', desktop_dir)

These functions can be passed to the PythonOperator as arguments. Here, we will pass the function, `get_today_events`, to the PythonOperator. Note that two fields must be included in the PythonOperator:

1. The task id
2. The Python function to be executed
3. The arguments of the function (optional)

Notably, although functions are conventionally situated at the top of code, in this case, they are specified right before the PythonOperator.

### Example

Create a PythonOperator that will download the events that occurred 'on this day', as shown above. The file will be uploaded to a remote repository.

1. Create a new remote repository on GitHub.
2. Clone the repository to your local machine.
3. Add a variable in the Airflow UI to set the path to the remote repository.
4. Create the DAG, where you will call the function. Thereafter, stage the changes, commit them, and push them to the remote repository. The DAG should run daily.
5. Move the DAG file to the `dags` folder in your local machine.
6. Test the DAG by running `airflow dags test <Name of your DAG>`.

You have a small template in the examples folder

# Xcom: Connecting Tasks

When working with a Python script, you may need to pass information from one function to another. When working with tasks in Airflow, you can achieve this using the Xcom feature.

Xcom will store the variables in a special database called XCom. You can read more about XCom in the [official documentation](https://airflow.apache.org/concepts.html#xcom). You can store those variables as the tasks are running, and when they are finished, you can retrieve them and pass them to the next tasks. Take a look at the next code (contained in `dag_test_xcom.py`),  passes information between PythonOperators.


In [None]:
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from random import uniform
from datetime import datetime

default_args = {
    'start_date': datetime(2020, 1, 1)
}


def training_model(ti):
    accuracy = uniform(0.1, 10.0) # In this case, the accuracy is a random number
                                  # but when you use it with your ML models, you
                                  # can call your real models inside
    print(f'model\'s accuracy: {accuracy}')
    ti.xcom_push(key='model_accuracy', value=accuracy)


def choose_best_model(ti):
    fetched_accuracy = ti.xcom_pull(
                            key='model_accuracy',
                            task_ids=['training_model_A'])
    print(f'choose best model: {fetched_accuracy}')


with DAG('test_dag_xcom',
         schedule_interval='@daily',
         default_args=default_args,
         catchup=False) as dag:

    downloading_data = BashOperator(
        task_id='downloading_data',
        bash_command='sleep 3'
    )
    training_model_task = [
        PythonOperator(
            task_id=f'training_model_{task}',
            python_callable=training_model
        ) for task in ['A', 'B', 'C']]

    choose_model = PythonOperator(
        task_id='choose_model',
        python_callable=choose_best_model
    )
    downloading_data >> training_model_task >> choose_model

This is just an example of a use case. Eventually, you will employ it for actual data retrieval. As you can observe, in the functions you create, ti.xcom_push is employed to pass information to the next task, while ti.xcom_pull is employed to retrieve it. As shown in the following graph, the input is passed to each of the models, and their results are passed to a model chooser.

<p align="center">
    <img src="images/AirFlow_Xcom.png" width="500"/>
</p>

Once you run it, you will see these Xcoms in the UI:

<p align="center">
    <img src="images/AirFlow_Xcom2.png" width="500"/>
</p>

Finally, in the next window, you should see the results of the Xcoms.

<p align="center">
    <img src="images/AirFlow_Xcom3.png" width="500"/>
</p>

Observe the names of the Xcoms. There is a name for each of the models that have been run, and there is one that is called `return_value` (In fact, there are many `return_value`s). These Xcoms correspond to the BashOperators that have been created, which _by default_ will push their output, so any task in the DAG can retrieve it.

# Summary

- When dealing with tasks working in parallel or sequence, you will need to establish a workflow.
- Airflow allows you to define your tasks in a way that is easy to understand and maintain.
- Airflow orchestrates these tasks using DAGs.
- You can define your DAGs in a python script. Each task is defined by an Operator, which can be a PythonOperator, BashOperator, etc.
- The Airflow UI allows you to configure how these tasks will run. 
- The UI can also show the progress of the tasks.
