# 4. Airflow

Up to this point, you have learnt a lot of things about data: Extracting the data, Transforming the data, and Loading the data. As mentioned multiple times, this is the "core" of Data Engineering (ETL). However, all these operations work in tandem to create a workflow.

A workflow is a series of steps that are executed in a specific order. For example, if you want to extract data from a source, transform it, and load it into a target, you will need to follow the steps in the following order:

1. __Extract__ data using, for example, the webscraping skills you learnt
2. __Transform__ the data using, for example, the data cleaning skills in pandas
3. __Load__ the data into a target, for example, a database located in your local environment or in a remote environment.

<p align="center">
    <img src="images/WorkFlow1.png" width="500"/>
</p>

Workflows can also be very helpful when you want to develop a ML model. For example, you might want to train a model on the data, but don't exactly know which algorithm to use. In this case, you can follow the steps in the following order:

1. __Extract__ data using, for example, the webscraping skills you learnt
2. __Transform__ the data using, for example, the data cleaning skills in pandas
3. __Train__ multiple models using the data, and obtain the accuracy of each model
4. __Choose__ the model with the highest accuracy

<p align="center">
    <img src="images/WorkFlow2.png" width="600"/>
</p>

# Airflow as a workflow manager

Apache Airflow is a task orchestration tool that allows you to define a series of tasks that are executed in a specific order. Tasks can be run in a distributed manner using Airflow's scheduler.

Remember DAGs? In this case we are going to use them to define a workflow. Each node in the DAG corresponds to a task, and they will be connected to one another.

Installing airflow would be as simple as running `pip install apache-airflow`, however, that migh cause dependency errors. Thus, in order to prevent those errors, run the following commands in your __`terminal`__ (gitbash if you are on Windows). At the time of writing, the version of Airflow is 2.1.2, if you are going to use a different version, change it in the following code:

```
export AIRFLOW_HOME=~/airflow

AIRFLOW_VERSION=2.1.2
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
```

Now you are storing the values of Airflow and your Python version in two variables that are going to be used in the following command:

```CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"```


Now, you will get the corresponding version of Airflow from their github repo:

```pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"```


Now, you can start using airflow. It is a good idea to initialize your database now. This database will contain metadata and it will host the DAGs you create:

`airflow db init`

And pass your credentials:

```
airflow users create \
    --username <your_username> \
    --firstname <your_firstname> \
    --lastname <your_lastname> \
    --role Admin \
    --email <your_email>
```

Let's check everything went right. In the terminal, run:

`airflow webserver --port 8080`

This will start a new server at your localhost at port 8080. It is important to notice that, even if you start the server, your scheduled DAGs won't be monitored. To do so, we need to kick off the scheduler, so open a new terminal and run:

`airflow scheduler`

If you receive a Warning message, don't worry, it won't affect your airflow current performance. Now, we are ready to start using the UI

Now, if you go to your browser and visit: `localhost:8080`, you should be able to see something like this:

![](images/Airflow.png)

The image above depicts the Airflow UI. Here, you can see the DAGs that have been created, and so far, you will only see some examples and tutorials created by the Airflow team. Let's explore it a little bit.

# Airflow UI

Inside the UI, you can explore the metadata of the DAGs, such as the name (or ID), the owner, the status of previous runs of the whole DAG or of specific tasks inside the DAG, its frequency (in the Schedule column), and when it was ran the last time.

You can see more details by clicking on the DAG. Let's observe the `example_bash_operator` DAG

<p align="center">
    <img src="images/AirFlow2.png" width="500"/>
</p>

In the DAG you can see its structure, the average time it took for running each task, the Gantt chart of the DAG to check if there are overlapping tasks, the details of the DAG, and the code that generated the DAG. We haven't ran this DAG yet, so there is no info about previous runs. We can, however, take a look at the code. Before, that, let's observe the `Graph View` tab, which will contain the same as the `Tree View` tab, but rearranged:

<p align="center">
    <img src="images/AirFlow4.png" width="500"/>
</p>



Observe that we have several Nodes, each one representing a task. You can also observe their dependencies, for example, `run_after_loop` won't start until all `runme_x` haven't finished. 

Let's run a single task to see how it works.

1. In the Airflow UI enable the `example_bash_operator` DAG. 
2. Click the DAG to see its status. You should see that there are two runs, this is because (as we will see later) these examples are set to be ran 2 days ago. If you observe the schedule, it is meant to run once every day, so two runs make sense!
3. Inside those runs, there are different status, in this case, we can see 'success' and 'skipped'. Don't worry, they are meant to be skipped.
<p align="center">
    <img src="images/AirFlow4.png" width="500"/>
</p>

4. Let's see its flow by triggering an event. First click Auto-refresh to see updates in real time, and then, click the Play button:

<p align="center">
    <img src="images/AirFlow_clip.gif" width="500"/>
</p>

Pretty cool, isn't it? We can also see the durantion of each task, and when each run took place. But I will let you explore more on that in the UI. For now, let's take a look at the code. If you click on the `Code` tab, you will see this:


In [None]:
"""Example DAG demonstrating the usage of the BashOperator."""

from datetime import timedelta

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.dummy import DummyOperator
from airflow.utils.dates import days_ago

args = {
    'owner': 'airflow',
}

with DAG(
    dag_id='example_bash_operator',
    default_args=args,
    schedule_interval='0 0 * * *',
    start_date=days_ago(2),
    dagrun_timeout=timedelta(minutes=60),
    tags=['example', 'example2'],
    params={"example_key": "example_value"},
) as dag:

    run_this_last = DummyOperator(
        task_id='run_this_last',
    )

    # [START howto_operator_bash]
    run_this = BashOperator(
        task_id='run_after_loop',
        bash_command='echo 1',
    )
    # [END howto_operator_bash]

    run_this >> run_this_last

    for i in range(3):
        task = BashOperator(
            task_id='runme_' + str(i),
            bash_command='echo "{{ task_instance_key_str }}" && sleep 1',
        )
        task >> run_this

    # [START howto_operator_bash_template]
    also_run_this = BashOperator(
        task_id='also_run_this',
        bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',
    )
    # [END howto_operator_bash_template]
    also_run_this >> run_this_last

# [START howto_operator_bash_skip]
this_will_skip = BashOperator(
    task_id='this_will_skip',
    bash_command='echo "hello world"; exit 99;',
    dag=dag,
)
# [END howto_operator_bash_skip]
this_will_skip >> run_this_last

if __name__ == "__main__":
    dag.cli()
```

If the tasks were supposed to print out something to the console, we can check that on the Log tab of each task. For example, look at the `also_run_this` task, it is a BashOperator object that will print out `run_id={{ run_id }} | dag_run={{ dag_run }}`. Go to the `Graph View` tab and click on the `also_run_this` task, and in the next window click `Log`. Observe the output:

<p align="center">
    <img src="images/AirFlow_log.png" width="500"/>
</p>

This was a simple walkthrough to show the Airflow UI. You saw that:

- The workflow is represented by a DAG
- Each node in the DAG corresponds to a task
- Each DAG has a schedule that sets the frequency of runs
- Tasks can be triggered by previous tasks
- Each task corresponds to an operator object
- We saw BashOperator, which execute a bash script
- We saw DummyOperator, which according to the documentation _'Operator that does literally nothing. It can be used to group tasks in a DAG.'_

We will see how to create more operators, for example, a PythonOperator, in the next section. First, let's get some practice defining a DAG with the operators we have seen so far.

# Creating Your First DAG

First of all, make sure you followed all steps so far. If that's the case, you should have a folder in your home directory named airflow. _Check it by running running the following cell. If no error is thrown, you are good to go_

In [None]:
from os.path import expanduser
import os

home = expanduser("~")
airflow_dir = os.path.join(home, 'airflow')
assert os.path.isdir(airflow_dir)

Inside that directory, you have to add a new folder named `dags`. Airflow will look into that directory to check if the DAGs you create through Python. Now, the example DAGs you are using are in your PATH directory, but new DAGs you create should be placed in `~/airflow/dags/`. _You can actually change the path where Airflow will look for new DAGs in the airflow.cfg file_

In [1]:
from airflow.models import DAG

In [2]:
from os.path import expanduser
from pathlib import Path
home = expanduser("~")
airflow_dir = os.path.join(home, 'airflow')
Path(f"{airflow_dir}/dags").mkdir(parents=True, exist_ok=True)

The Python files you create have to be stored in that folder. The file should contain the DAG with the desired arguments. The arguments can be passed in the context manager and in a dictionary.

In the context manager, simply define the tasks, don't implement any logical flow. As saw above, tasks are defined by operators

In [None]:
from airflow.models import DAG
from datetime import datetime
from datetime import timedelta
from airflow.operators.bash_operator import BashOperator

In [None]:

default_args = {
    'owner': 'Ivan',
    'depends_on_past': False,
    'email': ['ivan@theaicore.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'start_date': datetime(2020, 1, 1),
    'retry_delay': timedelta(minutes=5),
    'end_date': datetime(2022, 1, 1),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'wait_for_downstream': False,
    # 'dag': dag,
    # 'trigger_rule': 'all_success'
}


In [None]:
with DAG(dag_id='test_dag',
         default_args=default_args,
         schedule_interval='*/1 * * * *',
         catchup=False,
         tags=['test']
         ) as dag:
    # Define the tasks. Here we are going to define only one bash operator
    test_task = BashOperator(
        task_id='write_date_file',
        bash_command='cd ~/Desktop && date >> ai_core.txt',
        dag=dag)
    

This example can be found in the `examples` folder, under the name `dag_test.py`. Copy the example to your `dags` folder in your airflow directory.

Once the file is in the airflow directory, you can run it by running the following command (if you haven't started the scheduler yet, run `airflow scheduler -D`):

`airflow dags unpause test_dag`

If you want these DAGs to appear in the UI, you have to add them by running the following command:

`airflow db init`

So you can manage them in the UI.

## Tasks Dependencies

You just created a task, but in a workflow, you will probablu need to add more than one, so let's add a few more. If you just specify the tasks, the tasks will be executed in sequence, with no specific order(you can change how they are executed by changing the executor in the airflow.cfg file). However, you can specify hoe they are ordered by setting the dependencies between them.

Setting dependencies is quite simple. You can specify the tasks and then 'connect' them using the bit-shift operator. For example, if you want to run the task `runme_1` after the task `runme_0`, you can do it like this:

`task_0 >> task_1` or `task_0.set_downstream(task_1)` or `task_1 << task_0` or `task_1.set_upstream(task_0)`.

You can see that there are many ways to set the dependencies, so just pick the one that works for you.

Let's say the you want to run both task `task_1` and `task_2` after task `task_0` has finished. You can do it like this:

`task_0 >> [task_1, task_2]`

Another thing you my want to do is running `task_2` only when both `task_0` and `task_1` have finished. You can do it like this:
```
task_0 >> task_2
task_1 >> task_2
```

Finally, you can also set sequencial dependencies between tasks. For example, if you want to run `task_2` after `task_1`, and `task_1` after `task_0` have finished, you can do it like this:

`task_0 >> task_1 >> task_2`

The example below shows a DAG with four tasks:

1. date_task: A BashOperator that appends the current date into a file
2. add_task: A BashOperator that stages the file created by date_task
3. commit_task: A BashOperator that commits the file staged by add_task
4. push_task: A BashOperator that pushes the committed file to a remote repository

In [None]:
from airflow.models import DAG
from datetime import datetime
from datetime import timedelta
from airflow.operators.bash_operator import BashOperator

default_args = {
    'owner': 'Ivan',
    'depends_on_past': False,
    'email': ['ivan@theaicore.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'start_date': datetime(2020, 1, 1), # If you set a datetime previous to the curernt date, it will try to backfill
    'retry_delay': timedelta(minutes=5),
    'end_date': datetime(2022, 1, 1),
}
with DAG(dag_id='test_dag_dependencies',
         default_args=default_args,
         schedule_interval='*/1 * * * *',
         catchup=False,
         tags=['test']
         ) as dag:
    # Define the tasks. Here we are going to define only one bash operator
    date_task = BashOperator(
        task_id='write_date',
        bash_command='cd ~/Desktop/Weather_Airflow && date >> date.txt',
        dag=dag)
    add_task = BashOperator(
        task_id='add_files',
        bash_command='cd ~/Desktop/Weather_Airflow && git add .',
        dag=dag)
    commit_task = BashOperator(
        task_id='commit_files',
        bash_command='cd ~/Desktop/Weather_Airflow && git commit -m "Update date"',
        dag=dag)
    push_task = BashOperator(
        task_id='push_files',
        bash_command='cd ~/Desktop/Weather_Airflow && git push',
        dag=dag)
    
    date_task >> add_task >> commit_task
    add_task >> push_task
    commit_task >> push_task

Observe the last part of the DAG, you can see the dependencies between the tasks. Of course, you could simply set them all in tandem, but in this case, you will see how to set the dependencies in different ways.

<p align="center">
    <img src="images/AirFlow_Dependencies.png" width="500"/>
</p>

After running it, you will see that your repo is being updated every minute (which might be confusing, but this is a demo, so we don't care)

<p align="center">
    <img src="images/AirFlow_GitHub.png" width="500"/>
</p>

## Try it out

1. Create a new remote repository in your GitHub account. 
2. You will eventually use if for storing weather data, so name your repository accordingly.
3. Clone the repository to your local machine.
4. Copy the DAG file `dag_test_dependencies.py` to the folder `dags` in your local machine.
5. Change the file according to the name of your repository and the directory you cloned it to.
6. Unpause the DAG by running `airflow dags unpause dag_test_dependencies` or by going to the `DAGS` tab in the UI and clicking on the `dag_test_dependencies` DAG.

As you start creating DAGs, you might forget which one are active. Good thing is that airflow has many commands to check your works in the command line. If you type `airflow -h` you can see all comands.

In [1]:
%%bash
airflow -h

usage: airflow [-h] GROUP_OR_COMMAND ...

positional arguments:
  GROUP_OR_COMMAND

    Groups:
      celery         Celery components
      config         View configuration
      connections    Manage connections
      dags           Manage DAGs
      db             Database operations
      jobs           Manage jobs
      kubernetes     Tools to help run the KubernetesExecutor
      pools          Manage pools
      providers      Display providers
      roles          Manage roles
      tasks          Manage tasks
      users          Manage users
      variables      Manage variables

    Commands:
      cheat-sheet    Display cheat sheet
      info           Show information about current Airflow and environment
      kerberos       Start a kerberos ticket renewer
      plugins        Dump information about loaded plugins
      rotate-fernet-key
                     Rotate encrypted connection credentials and variables
      scheduler      Start a scheduler instance
      sync-p

Thus, you can look at the dags by running

In [1]:
%%bash
airflow dags list

dag_id                                  | filepath                                                                                                    | owner   | paused
aicore_test                             | ai_test.py                                                                                                  | airflow | True  
aicore_test2                            | ai_test2.py                                                                                                 | airflow | True  
example_bash_operator                   | /opt/miniconda3/lib/python3.9/site-packages/airflow/example_dags/example_bash_operator.py                   | airflow | False 
example_branch_datetime_operator_2      | /opt/miniconda3/lib/python3.9/site-packages/airflow/example_dags/example_branch_datetime_operator.py        | airflow | True  
example_branch_dop_operator_v3          | /opt/miniconda3/lib/python3.9/site-packages/airflow/example_dags/example_branch_python_dop_operator_3.py    | air

# Airflow Variables

One thing you might notice in the dependencies is that we had to constantly pass the file path to the BashOperator. This is not very efficient, so let's change that. One way to do it is by defining a variable that contains the path to the directory in which the file is stored.

Airflow has a way to define variables from the UI or from the command line. In this case we are going to only use the UI. The variables you include in the UI will be then available in the Python code.

So, open your UI, click on 'Admin' and then on 'Variables'.

<p align="center">
    <img src="images/AirFlow_variables.png" width="500"/>
</p>

In the next window, you can add the variables you will need. You can import a file from you computer, or you can click the `+` sign to add a variable manually.


<p align="center">
    <img src="images/AirFlow_variables2.png" width="500"/>
</p>

In the next window, you can add the name of the variable in the Key and the value in the Value.

<p align="center">
    <img src="images/AirFlow_variables3.png" width="500"/>
</p>

After pressing `Save`, the variable will be stored in the Airflow database, and you will be able to use it in your Python code. To do so, you have to import the class Variable:
```
from airflow.models import Variable

weather_dir = Variable.get("weather_dir")
```

Now, you will be able to use that variable in your Python code. If you look at the script in `dag_test_variables.py`, you will see that we are using the variable `weather_dir` to define the path to the file.

# Python Operators

We have seen that you can run bash commands in each task. Airflow is not limited to these commands, you can also create PythonOperators to do anything you, as long as it is contained in a Python function. Let's say that you want to create a PythonOperator that will extract information about events that took place 'On this day' in the past.

The first thing you have to do is creating a function that uses requests and bs4 to download that information from Wikipedia. 

In [6]:
from bs4 import BeautifulSoup
from os.path import expanduser
import requests

def get_ul(url: str):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    return soup.find('ul')

def get_today_events(url: str, file_dir: str):
    ul_soup = get_ul(url)
    for li in ul_soup.find_all('li'):
        write_file(li.text, file_dir)

def write_file(li: str, file_dir: str):
    with open(file_dir, 'a') as f:
        f.write(li)
        f.write('\n')

home = expanduser("~")
desktop_dir = os.path.join(home, 'Desktop/test_2.txt')
get_today_events('https://en.wikipedia.org/wiki/Wikipedia:On_this_day/Today', desktop_dir)

These functions can be passed to the PythonOperator as arguments. In this case, we will pass the function `get_today_events` to the PythonOperator. There are two things that you have to include to the PythonOperator:

1. The task id
2. The Python function that will be executed
3. Optional: the arguments of the function (if any)

One thing to note is that, even if functions are usually at the top of your code, in this case, that convention is not applied, and you usually specify the function right before the PythonOperator.

## Try it out

For the following example, you are going to create a PythonOperator that will download the events that took place on this day as shown above. That file is going to be uploaded to a remote repository.:

1. Create a new remote repository in your GitHub account.
2. Clone the repository to your local machine.
3. Add a variable in the Airflow UI to set the path to the remote repository.
4. Create the DAG, where you will call for the function, then stage the changes, commit them, and push them to the remote repository. The DAG should run daily.
5. Move the DAG file to the folder `dags` in your local machine.
6. Test the DAG by running `airflow dags test <Name of your DAG>`.

You have a small template in the examples folder

# Xcom: Connecting Tasks

When you are working in a Python script, you might want to pass information from one function to another. When you work with tasks in Airflow, you can do the same by using the Xcom feature.

Xcom will store the variables in a special database called XCom. You can read more about XCom in the [official documentation](https://airflow.apache.org/concepts.html#xcom). You can store those variables as the tasks are running, and when they are finished, you can retrieve them and pass them to the next tasks. Take a look at the next code (contained in `dag_test_xcom.py`),  passes information between PythonOperators.


In [None]:
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from random import uniform
from datetime import datetime

default_args = {
    'start_date': datetime(2020, 1, 1)
}


def training_model(ti):
    accuracy = uniform(0.1, 10.0) # In this case, the accuracy is a random number
                                  # but when you use it with your ML models, you
                                  # can call your real models inside
    print(f'model\'s accuracy: {accuracy}')
    ti.xcom_push(key='model_accuracy', value=accuracy)


def choose_best_model(ti):
    fetched_accuracy = ti.xcom_pull(
                            key='model_accuracy',
                            task_ids=['training_model_A'])
    print(f'choose best model: {fetched_accuracy}')


with DAG('test_dag_xcom',
         schedule_interval='@daily',
         default_args=default_args,
         catchup=False) as dag:

    downloading_data = BashOperator(
        task_id='downloading_data',
        bash_command='sleep 3'
    )
    training_model_task = [
        PythonOperator(
            task_id=f'training_model_{task}',
            python_callable=training_model
        ) for task in ['A', 'B', 'C']]

    choose_model = PythonOperator(
        task_id='choose_model',
        python_callable=choose_best_model
    )
    downloading_data >> training_model_task >> choose_model

This is just an example of how to use it. You will eventually use it for something that will actually retrieve data. Observe that, in the functions you create, you are using ti.xcom_push to pass information to the next task, and ti.xcom_pull to retrieve it. As you can see in the following graph, the input is passed to each of the models, and their results are passed to a model chooser.

<p align="center">
    <img src="images/AirFlow_Xcom.png" width="500"/>
</p>

Once you run it, you will see these Xcoms in the UI:

<p align="center">
    <img src="images/AirFlow_Xcom2.png" width="500"/>
</p>

Finally, in the next window, you can see the results of the Xcoms.

<p align="center">
    <img src="images/AirFlow_Xcom3.png" width="500"/>
</p>

Observe the names of the Xcoms. There is a name for each of the models that have been run, and there is one that is called `return_value` (In fact, there are many `return_value`s). These Xcoms correspond to the BashOperators that have been created, which _by default_ will push their output, so any task in the DAG can retrieve it.

# Airflow and PosgreSQL

Airflow allows you to run SQL queries in your tasks. You can connect to virtually any database, but in this example we will connect to a PosgreSQL database. Connecting to a database is very easy in Airflow, especially because the UI has a nice way to do it.

Before you start, you have to install the PosgreSQL client library. To do so, you can run the following command:

`pip install apache-airflow-providers-postgres`

From now on, Airflow will know how to connect to a postgres database. _If you want to connect to a different database, you can do it by installing a different provider. Check out [this webpage](https://www.astronomer.io/guides/connections) to get more info_

As mentioned, the Airflow UI has a nice way to connect to a database. Open your UI, click on 'Admin' and then on 'Connections'.

<p align="center">
    <img src="images/AirFlow_SQL.png" width="500"/>
</p>

In the next window, you can select the connection you want to configure. Take a look at all possibilities Airflow offers! Let's click on the `PostgreSQL` connection. You will see the following:

<p align="center">
    <img src="images/AirFlow_SQL2.png" width="500"/>
</p>

Fill the fields with your data, but take into account that `schema` here is the name of your database and `login` is your username. Just for demonstration purposes, we will use the `Pagila` database.

You are ready to use PostgreSQL from Airflow. In the next example, we are creating a new table with animal names. You will find the same code in `dag_test_sql.py`.

In [None]:
import datetime

from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator

with DAG(
    dag_id="test_dag_postgre",
    start_date=datetime.datetime(2020, 2, 2),
    schedule_interval="@once",
    catchup=False,
) as dag:

    create_pet_table = PostgresOperator(
        task_id="create_pet_table",
        postgres_conn_id="postgres_default",
        sql="""
            CREATE TABLE IF NOT EXISTS pet (
            pet_id SERIAL PRIMARY KEY,
            name VARCHAR NOT NULL,
            pet_type VARCHAR NOT NULL,
            birth_date DATE NOT NULL,
            OWNER VARCHAR NOT NULL);
          """,
    )

    populate_pet_table = PostgresOperator(
        task_id="populate_pet_table",
        postgres_conn_id="postgres_default",
        sql="""
            INSERT INTO pet VALUES ( 1, 'Max', 'Dog', '2018-07-05', 'Jane');
            INSERT INTO pet VALUES ( 2, 'Susie', 'Cat', '2019-05-01', 'Phil');
            INSERT INTO pet VALUES ( 3, 'Lester', 'Hamster', '2020-06-23', 'Lily');
            INSERT INTO pet VALUES ( 4, 'Quincy', 'Parrot', '2013-08-11', 'Anne');
            """,
    )

    get_all_pets = PostgresOperator(
        task_id="get_all_pets", postgres_conn_id="postgres_default", sql="SELECT * FROM pet;"
    )

    get_birth_date = PostgresOperator(
        task_id="get_birth_date",
        postgres_conn_id="postgres_default",
        sql="""
            SELECT * FROM pet
            WHERE birth_date
            BETWEEN {{ params.begin_date }} AND {{ params.end_date }};
            """,
        params={'begin_date': '2020-01-01', 'end_date': '2020-12-31'},
    )


    create_pet_table >> populate_pet_table >> get_all_pets >> get_birth_date

Defining the tasks is very similar to defining PythonOperators. The only difference is that you have to include the `postgres_conn_id` in the task definition. Then, the `sql` argument will contain the query, and the params argument will contain the variables you might want to add to the query.

If you copy that code to the `dags` folder, you can test it by running `airflow dags test dag_test_sql`. Then, in your pgAdmin, you can see the table `pets`:

<p align="center">
    <img src="images/AirFlow_SQL3.png" width="500"/>
</p>

Same way you connected to your localhost, you can also connect to your AWS RDS, we are slowly integrating everything from Python!

# Final Note

Everything we have done so far can be also done using the command line in an easy manner as well. For example, for changing the configuration of a connection you can run the following command:

`airflow connections add <connection_name> \ --conn-uri "conn-type://<user>:<password>@<host>:<port>/<database>`

Or another example you can do in the command line is to create a new variable:

`airflow variables -s <var_name> <var_value>`

Check out the official documentation for more information. 

With this in mind, you can now start creating your own Airflow DAGs in your EC2 instances without having to worry about the Airflow UI. Just leave it running as shown in `Module 4. Cloud Basics` and you will have a schedule running nonstop!

# Summary

- When dealing with tasks working in parallel or sequence, you will need to establish a workflow.
- Airflow allows you to define your tasks in a way that is easy to understand and maintain.
- Airflow orchestrates these tasks using DAGs.
- You can define your DAGs in a python script. Each task is defined by an Operator, which can be a PythonOperator, BashOperator, etc.
- The Airflow UI allows you to configure how these tasks will run. 
- The UI can also show the progress of the tasks.
- You can set the configuration of Airflow in the UI, so you can connect to different databases, or even AWS RDS.

# Challenge

Using the repo you created in this lesson for storing weather data:
### 1. Create a python function or script that will connect to the BBC weather webpage, and retrieve data about the temperature and humidity every hour in London. If you create the python function you will need to use the PythonOperator, but if you create the script you will need to use the BashOperator.
### 2. Clean the data, so both the temperature and the humidity are stored as integers.
### 3. The task created in last exercise will be connected to two different tasks:
### $\qquad$ 3.1 The first task is __appending__ the data to a csv file.
### $\qquad$ 3.1.2 Then, you will stage and commit the data, and finally push it to your remote repository.
### $\qquad$ 3.2 The second task is creating a table in your database. (Use any database you want)
### $\qquad$ 3.2.1 The table will have three columns: `time`, `temperature`, and `humidity`. Make sure that, if the table exists, it won't throw an error.
### $\qquad$ 3.2.2 Insert the retrived data into the table.
