Skip to content

Commit

Permalink
Doc: Add FAQ to speed up parsing with tons of dag files (#17519)
Browse files Browse the repository at this point in the history
This feature was added in apache/airflow#16075. This PR adds it to docs to avoid situations like apache/airflow#17437

closes apache/airflow#17437

GitOrigin-RevId: 7dfc52068c75b01a309bf07be3696ad1f7f9b9e2
  • Loading branch information
kaxil authored and Cloud Composer Team committed Jan 27, 2023
1 parent c8d973e commit dc4a986
Showing 1 changed file with 48 additions and 2 deletions.
50 changes: 48 additions & 2 deletions docs/apache-airflow/faq.rst
Expand Up @@ -20,8 +20,8 @@
FAQ
========

Scheduling
^^^^^^^^^^
Scheduling / DAG file parsing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Why is task not getting scheduled?
----------------------------------
Expand Down Expand Up @@ -119,6 +119,52 @@ How do I trigger tasks based on another task's failure?

You can achieve this with :ref:`concepts:trigger-rules`.

When there are a lot (>1000) of dags files, how to speed up parsing of new files?
---------------------------------------------------------------------------------

(only valid for Airflow >= 2.1.1)

Change the :ref:`config:scheduler__file_parsing_sort_mode` to ``modified_time``, raise
the :ref:`config:scheduler__min_file_process_interval` to ``600`` (10 minutes), ``6000`` (100 minutes)
or a higher value.

The dag parser will skip the ``min_file_process_interval`` check if a file is recently modified.

This might not work for case where the DAG is imported/created from a separate file. Example:
``dag_file.py`` that imports ``dag_loader.py`` where the actual logic of the DAG file is as shown below.
In this case if ``dag_loader.py`` is updated but ``dag_file.py`` is not updated, the changes won't be reflected
until ``min_file_process_interval`` is reached since DAG Parser will look for modified time for ``dag_file.py`` file.

.. code-block:: python
:caption: dag_file.py
:name: dag_file.py
from dag_loader import create_dag
globals()[dag.dag_id] = create_dag(dag_id, schedule, dag_number, default_args)
.. code-block:: python
:caption: dag_loader.py
:name: dag_loader.py
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def create_dag(dag_id, schedule, dag_number, default_args):
def hello_world_py(*args):
print("Hello World")
print("This is DAG: {}".format(str(dag_number)))
dag = DAG(dag_id, schedule_interval=schedule, default_args=default_args)
with dag:
t1 = PythonOperator(task_id="hello_world", python_callable=hello_world_py)
return dag
DAG construction
^^^^^^^^^^^^^^^^

Expand Down

0 comments on commit dc4a986

Please sign in to comment.