## Cloud Composer

Cloud Composer is a fully managed workflow orchestration service that empowers author, schedule, and monitor DAGs tasks in the `/dags` folder.

## Apache Airflow

Apache Airflow is an open source tool used to programatically author, schedule, and monitor workflows. There are a few key terms as follows:
- `DAG` (Directed Acyclic Graph), also called workflows, is a collection of organized tasks you schedule and run, organized in a way that reflects their relationships and dependencies.
- `Operator` describes a single task in a workflow
- `Task` is a parameterised instance of an `Operator`.
- `Task Instance` is a specific run of a `task`; characterized as a `DAG`

## Workflow

In [None]:
%%writefile codelab.py
"""Example Airflow DAG that checks if a local file exists, creates a Cloud Dataproc cluster, runs the Hadoop
wordcount example, and deletes the cluster.
This DAG relies on three Airflow variables
https://airflow.apache.org/concepts.html#variables
* gcp_project - Google Cloud Project to use for the Cloud Dataproc cluster.
* gce_zone - Google Compute Engine zone where Cloud Dataproc cluster should be
  created.
* gcs_bucket - Google Cloud Storage bucket to use for result of Hadoop job.
  See https://cloud.google.com/storage/docs/creating-buckets for creating a
  bucket.
"""
import datetime
import os
from airflow import models
from airflow.contrib.operators import dataproc_operator
from airflow.operators import BashOperator
from airflow.utils import trigger_rule

# Output file for Cloud Dataproc job.
output_file = os.path.join(
    models.Variable.get('gcs_bucket'), 'wordcount',
    datetime.datetime.now().strftime('%Y%m%d-%H%M%S')) + os.sep
# Path to Hadoop wordcount example available on every Dataproc cluster.
WORDCOUNT_JAR = (
    'file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar'
)
# Path to input file for Hadoop job.
input_file = '/home/airflow/gcs/data/rose.txt'
# Arguments to pass to Cloud Dataproc job.
wordcount_args = ['wordcount', input_file, output_file]

yesterday = datetime.datetime.combine(
    datetime.datetime.today() - datetime.timedelta(1),
    datetime.datetime.min.time())

default_dag_args = {
    # Setting start date as yesterday starts the DAG immediately when it is
    # detected in the Cloud Storage bucket.
    'start_date': yesterday,
    # To email on failure or retry set 'email' arg to your email and enable
    # emailing here.
    'email_on_failure': False,
    'email_on_retry': False,
    # If a task fails, retry it once after waiting at least 5 minutes
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'project_id': models.Variable.get('gcp_project')
}

with models.DAG(
    'Composer_sample_quickstart',
    # Continue to run DAG once per day
    schedule_interval=datetime.timedelta(days=1),
    default_args=default_dag_args
) as dag:
    # Check if the input file exists.
    check_file_existence =  BashOperator(
        task_id='check_file_existence',
        bash_command='if [ ! -f \"{}\" ]; then exit 1; fi'.format(input_file)
    )
    # Create a Cloud Dataproc cluster.
    create_dataproc_cluster = dataproc_operator.DataprocClusterCreateOperator(
        task_id='create_dataproc_cluster',
        # Give the cluster a unique name by appending the date scheduled.
        # See https://airflow.apache.org/code.html#default-variables.
        # {{ ds_nodash }} gets replaced with the execution_date of 
        # the DAG in YYYYMMDD format.
        cluster_name='quickstart-cluster-{{ ds_nodash }}',
        num_workers=2,
        image_version='2.0',
        zone=models.Variable.get('gce_zone'),
        region='us-central1',
        master_machine_type='n1-standard-2',
        worker_machine_type='n1-standard-2'
    )
    # Run the Hadoop wordcount example installed on the Cloud Dataproc cluster
    # master node.
    run_dataproc_hadoop = dataproc_operator.DataProcHadoopOperator(
        task_id='run_dataproc_hadoop',
        region='us-central1',
        main_jar=WORDCOUNT_JAR,
        cluster_name='quickstart-cluster-{{ ds_nodash }}',
        arguments=wordcount_args
    )
    # Delete Cloud Dataproc cluster.
    delete_dataproc_cluster = dataproc_operator.DataprocClusterDeleteOperator(
        task_id='delete_dataproc_cluster',
        cluster_name='quickstart-cluster-{{ ds_nodash }}',
        region='us-central1',
        # Setting trigger_rule to ALL_DONE causes the cluster to be deleted
        # even if the Dataproc job fails.
        trigger_rule=trigger_rule.TriggerRule.ALL_DONE
    )
    # Define DAG dependencies.
    check_file_existence >> create_dataproc_cluster >> run_dataproc_hadoop \
        >> delete_dataproc_cluster

In [None]:
%%bash
# Create Cloud Composer environment
gcloud composer environments create my-composer-environment \
--location us-central1 --zone us-central1-a

# Create Cloud Storage bucket
gsutil mb gs://<project-id>

# Set a variables
gcloud composer environments run my-composer-environment \
--location us-central1 variables -- \
--set gcp_project <your-project-id> \
--set gcs_bucket gs://<your-bucket-name> \
--set gce_zone us-central1-a \
--set dags_folder <your-dags-folder>

# View a variable
gcloud composer environments run my-composer-environment \
--location us-central1 variables -- \
--get gcp_project \
--get gcs_bucket \
--get gce_zone \
--get dags_folder

# Copy DAG into /dags folder
gsutil cp gs://cloud-training/composer-academy/codelab.py <your-dags-folder>

# Upload data to Cloud Storage
gcloud composer environments storage dags import \
--environment my-composer-environment \
--location us-central1 \
--source gs://pub/shakespeare/rose.txt

A simple workflow that verifies the existence of a data file, creates a Cloud Dataproc cluster, runs an Apache Hadoop wordcount job on the Cloud Dataproc cluster, and deletes the Cloud Dataproc cluster afterwards.

In [None]:
%%writefile composer_hadoop_tutorial.py
# [START composer_hadoop_tutorial]
"""
Example Airflow DAG that creates a Cloud Dataproc cluster, runs the Hadoop wordcount 
example, and deletes the cluster. This DAG relies on three Airflow variables
https://airflow.apache.org/concepts.html#variables
* gcp_project - Google Cloud Project to use for the Cloud Dataproc cluster.
* gce_zone - Google Compute Engine zone where Cloud Dataproc cluster should be
  created.
* gcs_bucket - Google Cloud Storage bucket to used as output for the Hadoop jobs from 
    Dataproc. See https://cloud.google.com/storage/docs/creating-buckets for creating a
    bucket.
"""
import datetime
import os
from airflow import models
from airflow.contrib.operators import dataproc_operator
from airflow.utils import trigger_rule

# Output file for Cloud Dataproc job.
output_file = os.path.join(
    models.Variable.get('gcs_bucket'), 'wordcount',
    datetime.datetime.now().strftime('%Y%m%d-%H%M%S')) + os.sep
# Path to Hadoop wordcount example available on every Dataproc cluster.
WORDCOUNT_JAR = (
    'file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar'
)
# Arguments to pass to Cloud Dataproc job.
wordcount_args = ['wordcount', 'gs://pub/shakespeare/rose.txt', output_file]

yesterday = datetime.datetime.combine(
    datetime.datetime.today() - datetime.timedelta(1),
    datetime.datetime.min.time())

default_dag_args = {
    # Setting start date as yesterday starts the DAG immediately when it is
    # detected in the Cloud Storage bucket.
    'start_date': yesterday,
    # To email on failure or retry set 'email' arg to your email and enable
    # emailing here.
    'email_on_failure': False,
    'email_on_retry': False,
    # If a task fails, retry it once after waiting at least 5 minutes
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'project_id': models.Variable.get('gcp_project')
}

# [START composer_hadoop_schedule]
with models.DAG(
        'composer_sample_quickstart',
        # Continue to run DAG once per day
        schedule_interval=datetime.timedelta(days=1),
        default_args=default_dag_args) as dag:
    # [END composer_hadoop_schedule]
    
    # Create a Cloud Dataproc cluster.
    create_dataproc_cluster = dataproc_operator.DataprocClusterCreateOperator(
        task_id='create_dataproc_cluster',
        # Give the cluster a unique name by appending the date scheduled.
        # See https://airflow.apache.org/code.html#default-variables
        cluster_name='composer-hadoop-tutorial-cluster-{{ ds_nodash }}',
        num_workers=2,
        region='us-central1',
        zone=models.Variable.get('gce_zone'),
        image_version='2.0',
        master_machine_type='n1-standard-2',
        worker_machine_type='n1-standard-2')
    
    # Run the Hadoop wordcount example installed on the Cloud Dataproc cluster
    # master node.
    run_dataproc_hadoop = dataproc_operator.DataProcHadoopOperator(
        task_id='run_dataproc_hadoop',
        region='us-central1',
        main_jar=WORDCOUNT_JAR,
        cluster_name='composer-hadoop-tutorial-cluster-{{ ds_nodash }}',
        arguments=wordcount_args)
    
    # Delete Cloud Dataproc cluster to avoid incurring ongoing Compute Engine charges.
    delete_dataproc_cluster = dataproc_operator.DataprocClusterDeleteOperator(
        task_id='delete_dataproc_cluster',
        region='us-central1',
        cluster_name='composer-hadoop-tutorial-cluster-{{ ds_nodash }}',
        # Setting trigger_rule to ALL_DONE causes the cluster to be deleted
        # even if the Dataproc job fails.
        trigger_rule=trigger_rule.TriggerRule.ALL_DONE)
    
    # [START composer_hadoop_steps]
    # Define DAG dependencies.
    create_dataproc_cluster >> run_dataproc_hadoop >> delete_dataproc_cluster
    # [END composer_hadoop_steps]
    
# [END composer_hadoop_tutorial]