Configuration

All configuration can be done by adding a configuration file named client.cfg to your current working directory or /etc/luigi (although this is further configurable). The config file is broken into sections, each controlling a different part of the config. Example /etc/luigi/client.cfg:

[hadoop]
version: cdh4
streaming-jar: /usr/lib/hadoop-xyz/hadoop-streaming-xyz-123.jar

[core]
default-scheduler-host: luigi-host.mycompany.foo
error-email: foo@bar.baz

By default, all parameters will be overridden by matching values in the configuration file. For instance if you have a Task definition:

class DailyReport(luigi.hadoop.JobTask):
    date = luigi.DateParameter(default=datetime.date.today())
    # ...

Then you can override the default value for date by providing it in the configuration:

[DailyReport]
date: 2012-01-01

You can also use config_path as an argument to the Parameter if you want to use a specific section in the config.

Configurable options

Luigi comes with a lot of configurable options. Below, we describe each section and the parameters available within it.

[core]

These parameters control core luigi behavior, such as error e-mails and interactions between the worker and scheduler.

default-scheduler-host

Hostname of the machine running the scheduler. Defaults to localhost.

default-scheduler-port

Port of the remote scheduler api process. Defaults to 8082.

email-prefix

Optional prefix to add to the subject line of all e-mails. For example, setting this to "[LUIGI]" would change the subject line of an e-mail from "Luigi: Framework error" to "[LUIGI] Luigi: Framework error"

email-sender

User name in from field of error e-mails. Default value: luigi-client@<server_name>

email-type

Type of e-mail to send. Valid values are "plain" and "html". When set to html, tracebacks are wrapped in <pre> tags to get fixed-width font. Default value is plain.

error-email

Recipient of all error e-mails. If this is not set, no error e-mails are sent when luigi crashes. If luigi is run from the command line, no e-mails will be sent unless output is redirected to a file.

hdfs-tmp-dir

Base directory in which to store temporary files on hdfs. Defaults to tempfile.gettempdir()

history-filename

If set, specifies a filename for Luigi to write stuff (currently just job id) to in mapreduce job's output directory. Useful in a configuration where no history is stored in the output directory by Hadoop.

logging_conf_file

Location of the logging configuration file.

max-reschedules

The maximum number of times that a job can be automatically rescheduled by a worker before it will stop trying. Workers will reschedule a job if it is found to not be done when attempting to run a dependent job. This defaults to 1.

max-shown-tasks

.. versionadded:: 1.0.20

The maximum number of tasks returned in a task_list api call. This will restrict the number of tasks shown in any section in the visualiser. Small values can alleviate frozen browsers when there are too many done tasks. This defaults to 100000 (one hundred thousand).

no_configure_logging

If true, logging is not configured. Defaults to false.

parallel-scheduling

If true, the scheduler will compute complete functions of tasks in parallel using multiprocessing. This can significantly speed up scheduling, but requires that all tasks can be pickled.

retry-external-tasks

If true, incomplete external tasks (i.e. tasks where the run() method is NotImplemented) will be retested for completion while Luigi is running. This means that if external dependencies are satisfied after a workflow has started, any tasks dependent on that resource will be eligible for running. Note: Every time the task remains incomplete, it will count as FAILED, so normal retry logic applies (see: disable-num-failures and retry-delay). This setting works best with worker-keep-alive: true. If false, external tasks will only be evaluated when Luigi is first invoked. In this case, Luigi will not check whether external dependencies are satisfied while a workflow is in progress, so dependent tasks will remain PENDING until the workflow is reinvoked. Defaults to false for backwards compatibility.

rpc-connect-timeout

Number of seconds to wait before timing out when making an API call. Defaults to 10.0

smtp_host

Hostname for sending mail throug smtp. Defaults to localhost.

smtp_local_hostname

If specified, overrides the FQDN of localhost in the HELO/EHLO command.

smtp_login

Username to log in to your smtp server, if necessary.

smtp_password

Password to log in to your smtp server. Must be specified for smtp_login to have an effect.

smtp_port

Port number for smtp on smtp_host. Defaults to 0.

smtp_ssl

If true, connects to smtp through SSL. Defaults to false.

smtp_timeout

Optionally sets the number of seconds after which smtp attempts should time out.

tmp-dir

DEPRECATED - use hdfs-tmp-dir instead

worker-count-uniques

If true, workers will only count unique pending jobs when deciding whether to stay alive. So if a worker can't get a job to run and other workers are waiting on all of its pending jobs, the worker will die. worker-keep-alive must be true for this to have any effect. Defaults to false.

worker-keep-alive

If true, workers will stay alive when they run out of jobs to run, as long as they have some pending job waiting to be run. Defaults to false.

worker-ping-interval

Number of seconds to wait between pinging scheduler to let it know that the worker is still alive. Defaults to 1.0.

worker-timeout

.. versionadded:: 1.0.20

Number of seconds after which to kill a task which has been running for too long. This provides a default value for all tasks, which can be overridden by setting the worker-timeout property in any task. This only works when using multiple workers, as the timeout is implemented by killing worker subprocesses. Default value is 0, meaning no timeout.

worker-wait-interval

Number of seconds for the worker to wait before asking the scheduler for another job after the scheduler has said that it does not have any available jobs.

[elasticsearch]

These parameters control use of elasticsearch

marker-index: Defaults to "update_log".
marker-doc-type: Defaults to "entry".

[email]

These parameters control sending error e-mails through Amazon SES.

AWS_ACCESS_KEY: Your AWS access key
AWS_SECRET_KEY: Your AWS secret key
region: Your AWS region. Defaults to us-east-1.
type: If set to "ses", error e-mails will be send through Amazon SES. Otherwise, e-mails are sent via smtp.

[hadoop]

Parameters controlling basic hadoop tasks

command: Name of command for running hadoop from the command line. Defaults to "hadoop"
python-executable: Name of command for running python from the command line. Defaults to "python"
scheduler: Type of scheduler to use when scheduling hadoop jobs. Can be "fair" or "capacity". Defaults to "fair".
streaming-jar: Path to your streaming jar. Must be specified to run streaming jobs.
version: Version of hadoop used in your cluster. Can be "cdh3", "chd4", or "apache1". Defaults to "cdh4".

[hdfs]

Parameters controlling the use of snakebite to speed up hdfs queries.

client: Client to use for most hadoop commands. Options are "snakebite", "snakebite_with_hadoopcli_fallback", and "hadoopcli". Snakebite is much faster, so use of it is encouraged. Using snakebite requires it to be installed separately on the machine. Defaults to "hadoopcli".
client_version: Optionally specifies hadoop client version for snakebite.
effective_user: Optionally specifies the effective user for snakebite.
namenode_host: The hostname of the namenode. Needed for snakebite if snakebite_autoconfig is not set.
namenode_port: The port used by snakebite on the namenode. Needed for snakebite if snakebite_autoconfig is not set.
snakebite_autoconfig: If true, attempts to automatically detect the host and port of the namenode for snakebite queries. Defaults to false.
use_snakebite: DEPRECATED - use client instead

[hive]

Parameters controlling hive tasks

command: Name of the command used to run hive on the command line. Defaults to "hive".
hiverc-location: Optional path to hive rc file.
metastore_host: Hostname for metastore.
metastore_port: Port for hive to connect to metastore host.
release: If set to "apache", uses a hive client that better handles apache hive output. All other values use the standard client Defaults to "cdh4".

[mysql]

Parameters controlling use of MySQL targets

marker-table: Table in which to store status of table updates. This table will be created if it doesn't already exist. Defaults to "table_updates".

[postgres]

Parameters controlling the use of Postgres targets

local-tmp-dir: Directory in which to temporarily store data before writing to postgres. Uses system default if not specified.
marker-table: Table in which to store status of table updates. This table will be created if it doesn't already exist. Defaults to "table_updates".

[redshift]

Parameters controlling the use of Redshift targets

marker-table: Table in which to store status of table updates. This table will be created if it doesn't already exist. Defaults to "table_updates".

[resources]

This section can contain arbitrary keys. Each of these specifies the amount of a global resource that the scheduler can allow workers to use. The scheduler will prevent running jobs with resources specified from exceeding the counts in this section. Unspecified resources are assumed to have limit 1. Example resources section for a configuration with 2 hive resources and 1 mysql resource:

[resources]
hive: 2
mysql: 1

Note that it was not necessary to specify the 1 for mysql here, but it is good practice to do so when you have a fixed set of resources.

[scalding]

Parameters controlling running of scalding jobs

scala-home: Home directory for scala on your machine. Defaults to either SCALA_HOME or /usr/share/scala if SCALA_HOME is unset.
scalding-home: Home directory for scalding on your machine. Defaults to either SCALDING_HOME or /usr/share/scalding if SCALDING_HOME is unset.
scalding-provided: Provided directory for scalding on your machine. Defaults to either SCALDING_HOME/provided or /usr/share/scalding/provided
scalding-libjars: Libjars directory for scalding on your machine. Defaults to either SCALDING_HOME/libjars or /usr/share/scalding/libjars

[scheduler]

Parameters controlling scheduler behavior

disable-num-failures

Number of times a task can fail within disable-window-seconds before the scheduler will automatically disable it. If not set, the scheduler will not automatically disable jobs.

disable-persist-seconds

Number of seconds for which an automatic scheduler disable lasts. Defaults to 86400 (1 day).

disable-window-seconds

Number of seconds during which disable-num-failures failures must occur in order for an automatic disable by the scheduler. The scheduler forgets about disables that have occurred longer ago than this amount of time. Defaults to 3600 (1 hour).

record_task_history

If true, stores task history in a database. Defaults to false.

remove-delay

Number of seconds to wait before removing a task that has no stakeholders. Defaults to 600 (10 minutes).

retry-delay

Number of seconds to wait after a task failure to mark it pending again. Defaults to 900 (15 minutes).

state-path

Path in which to store the luigi scheduler's state. When the scheduler is shut down, its state is stored in this path. The scheduler must be shut down cleanly for this to work, usually with a kill command. If the kill command includes the -9 flag, the scheduler will not be able to save its state. When the scheduler is started, it will load the state from this path if it exists. This will restore all scheduled jobs and other state from when the scheduler last shut down.

Sometimes this path must be deleted when restarting the scheduler after upgrading luigi, as old state files can become incompatible with the new scheduler. When this happens, all workers should be restarted after the scheduler both to become compatible with the updated code and to reschedule the jobs that the scheduler has now forgotten about.

This defaults to /var/lib/luigi-server/state.pickle

worker-disconnect-delay

Number of seconds to wait after a worker has stopped pinging the scheduler before removing it and marking all of its running tasks as failed. Defaults to 60.

[spark]

Parameters controlling the running of Spark jobs

spark-jar: Location of the spark jar. Sets SPARK_JAR environment variable when running spark. Example: /usr/share/spark/jars/spark-assembly-0.8.1-incubating-hadoop2.2.0.jar
hadoop-conf-dir: Location of hadoop conf dir. Sets HADOOP_CONF_DIR environment variable when running spark. Example: /etc/hadoop/conf
spark-class: Location of script to invoke. Example: /usr/share/spark/spark-class
spark-submit: Command to run in order to submit spark jobs. Default: spark-submit

[task_history]

Parameters controlling storage of task history in a database

db_connection: Connection string for connecting to the task history db using sqlalchemy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configuration.rst

configuration.rst

Configuration

Configurable options

[core]

[elasticsearch]

[email]

[hadoop]

[hdfs]

[hive]

[mysql]

[postgres]

[redshift]

[resources]

[scalding]

[scheduler]

[spark]

[task_history]

Files

configuration.rst

Latest commit

History

configuration.rst

File metadata and controls

Configuration

Configurable options

[core]

[elasticsearch]

[email]

[hadoop]

[hdfs]

[hive]

[mysql]

[postgres]

[redshift]

[resources]

[scalding]

[scheduler]

[spark]

[task_history]