# End-to-End ML Pipelines with Cloud Composer

In this advanced lab you will learn how to create and run an [Apache Airflow](http://airflow.apache.org/) workflow in Cloud Composer that completes the following tasks:
- Watches for new CSV data to be uploaded to a [Cloud Storage](https://cloud.google.com/storage/docs/) bucket
- A [Cloud Function](https://cloud.google.com/composer/docs/how-to/using/triggering-with-gcf#getting_the_client_id) call triggers the [Cloud Composer Airflow DAG](https://cloud.google.com/composer/docs/how-to/using/writing-dags) to run when a new file is detected 
- The workflow finds the input file that triggered the workflow and executes a [Cloud Dataflow](https://cloud.google.com/dataflow/) job to transform and output the data to BigQuery  
- Moves the original input file to a different Cloud Storage bucket for storing processed files

## Create Cloud Composer environment
First, create a Cloud Composer environment by doing the following:
1. In the Navigation menu under Big Data, select **Composer**
2. Select **Create**
3. Set the following parameters:
    - Name: composer-workflow
    - Location: us-central1
    - Other values at defaults
4. Select **Create**

The environment creation process is completed when the green checkmark displays to the left of the environment name on the Environments page in the GCP Console.
It can take up to 20 minutes for the environment to complete the setup process. Move on to the next section - Create Cloud Storage buckets and BigQuery dataset.


## Create Cloud Storage buckets
Create two Cloud Storage Multi-Regional buckets in your project. 
- project-id_input
- project-id_output

Run the below to automatically create the buckets:

In [1]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

INPUT = PROJECT + '_input'
OUTPUT = PROJECT + '_output'

In [3]:
## TODO export using os.environ
import os
os.environ['BUCKET'] = BUCKET

In [None]:
## check if the folder exists
%%bash
if ! gsutil ls | grep -q gs://${INPUT}; then
  gsutil mb -l ${REGION} gs://${INPUT}
fi

if ! gsutil ls | grep -q g.... output

## Create BigQuery Destination Dataset and Table
Next, we'll create a data sink to store the ingested data from GCS<br><br>

### Create a new Dataset
1. In the Navigation menu, select **BigQuery**
2. Then click on your qwiklabs project ID
3. Click **Create Dataset**
4. Name your dataset **ml_pipeline** and leave other values at defaults
5. Click **Create Dataset**


### Create a new empty table
1. Click on the newly created dataset
2. Click **Create Table**
3. For Destination Table name specify **ingest_table**
4. For schema click **Edit as Text** and paste in the below schema

    state:	STRING,<br>
    gender:	STRING,<br>
    year:	STRING,<br>
    name:	STRING,<br>
    number:	STRING,<br>
    created_date:	STRING,<br>
    filename:	STRING,<br>
    load_dt:	DATE<br><br>

5. Click **Create Table**

***
## Review of Airflow concepts
While your Cloud Composer environment is building, let’s discuss the sample file you’ll be using in this lab.
<br><br>
[Airflow](https://airflow.apache.org/) is a platform to programmatically author, schedule and monitor workflows
<br><br>
Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
<br><br>
### Core concepts
- [DAG](https://airflow.apache.org/concepts.html#dags) - A Directed Acyclic Graph  is a collection of tasks, organised to reflect their relationships and dependencies.
- [Operator](https://airflow.apache.org/concepts.html#operators) - The description of a single task, it is usually atomic. For example, the BashOperator is used to execute bash command.
- [Task](https://airflow.apache.org/concepts.html#tasks) - A parameterised instance of an Operator;  a node in the DAG.
- [Task Instance](https://airflow.apache.org/concepts.html#task-instances) - A specific run of a task; characterised as: a DAG, a Task, and a point in time. It has an indicative state: *running, success, failed, skipped, …*<br><br>
The rest of the Airflow concepts can be found [here](https://airflow.apache.org/concepts.html#).



## Complete the code in the Cloud Composer workflow
Cloud Composer workflows are comprised of [DAGs (Directed Acyclic Graphs)](https://airflow.incubator.apache.org/concepts.html#dags). The code shown in simple_load_dag.py is the workflow code, also referred to as the DAG. 
<br><br>
Open the file now to see how it is built. Next will be a detailed look at some of the key components of the file.
<br><br>
To orchestrate all the workflow tasks, the DAG imports the following operators:
- DataFlowPythonOperator
- PythonOperator
<br><br>
Action: <span style="color:red">**Complete the # TODOs in the [simple_load_dag.py](simple_load_dag.py)**</span> file while you wait for your Composer environment to be setup. 

## Viewing environment information
Now that you have a completed DAG, it's time to upload it to your Cloud Composer environment and finish the setup of your workflow.<br><br>
1. Go back to **Composer** to check on the status of your environment.
2. Once your environment has been created, click the **name of the environment** to see its details.
<br><br>
The Environment details page provides information, such as the Airflow web UI URL, Google Kubernetes Engine cluster ID, name of the Cloud Storage bucket connected to the DAGs folder.
<br><br>
Cloud Composer uses Cloud Storage to store Apache Airflow DAGs, also known as workflows. Each environment has an associated Cloud Storage bucket. Cloud Composer schedules only the DAGs in the Cloud Storage bucket.

## Setting Airflow variables
Our DAG relies on variables to pass in values like the GCP Project. We can set these in the Admin UI.

Airflow variables are an Airflow-specific concept that is distinct from [environment variables](https://cloud.google.com/composer/docs/how-to/managing/environment-variables). In this step, you'll set the following six [Airflow variables](https://airflow.apache.org/concepts.html#variables) used by the DAG we will deploy.


**Key**|**Value**|**Example**
:-----:|:-----:|:-----:
gcp\_project|your-gcp-project-id|qwiklabs-gcp-123123
gcp\_input\_location|gcs-bucket-for-dataflow-input-files|gs://qwiklabs-gcp-123123_input
gcp\_temp\_location|gcs-bucket-for-dataflow-temp-files|gs://qwiklabs-gcp-123123_output/tmp
gcs\_completion\_bucket|output-gcs-bucket|gs://qwiklabs-gcp-123123_output
input\_field\_names|comma-separated-field-names-for-delimited-file|state,gender,year,name,number,created\_date
bq\_output\_table|bigquery-output-table|ml\_pipeline.ingest\_table

### Option 1: Set the variables using the Airflow webserver UI
1. In your Airflow environment, select **Admin** > **Variables**
2. Populate each key value in the table with the required variables from the above table

### Option 2: Set the variables using the Airflow CLI
The next gcloud composer command executes the Airflow CLI sub-command [variables](https://airflow.apache.org/cli.html#variables). The sub-command passes the arguments to the gcloud command line tool.<br><br>
To set the three variables, run the gcloud composer command once for each row from the above table. 

In [None]:
%bash
gcloud composer environments run ENVIRONMENT_NAME \
 --location LOCATION variables -- \
 --set KEY VALUE

## Uploading the DAG and dependencies to Cloud Storage
1. After you have completed the # TODOs in the simple_load_dag.py, **upload the file** to the DAGs folder in your Airflow environent inside of the /dags/ sub-folder
2. Next, in the same /dags/ folder create a subfolder titled **dataflow**
3. Upload process_delimited.py from your repository into /dags/dataflow/ 
<br><br>
Cloud Composer registers the DAG in your Airflow environment automatically, DAG changes occur within 3-5 minutes. You can see task status in the Airflow web interface and confirm the DAG is not scheduled as per the settings. 


***
## Navigating Using the Airflow UI
To access the Airflow web interface using the GCP Console:
1. Go back to the **Composer Environments** page.
2. In the **Airflow webserver** column for the environment, click the new window icon. 
3. The Airflow web UI opens in a new browser window. 

### Trigger DAG run manually
Running your DAG manually ensures that it operates successfully even in the absence of triggered events. 
1. Trigger the DAG manually **click the play button** under Links


***
# Trigger DAG run automatically from a file upload to GCS
Now that your manual workflow runs successfully, you will now trigger it based on an external event. 

## Create a Cloud Function to trigger your workflow
We will be following this [reference guide](https://cloud.google.com/composer/docs/how-to/using/triggering-with-gcf) to setup our Cloud Function
1. In the below code block, replace your-project-id with your project id
2. Run the code to grant blob **permissions** to your service account 

In [6]:
%bash
gcloud iam service-accounts add-iam-policy-binding \
qwiklabs-gcp-77c0e3b62eaf4101@appspot.gserviceaccount.com \
--member=serviceAccount:qwiklabs-gcp-77c0e3b62eaf4101@appspot.gserviceaccount.com \
--role=roles/iam.serviceAccountTokenCreator

bindings:
- members:
  - serviceAccount:qwiklabs-gcp-77c0e3b62eaf4101@appspot.gserviceaccount.com
  role: roles/iam.serviceAccountTokenCreator
etag: BwV5aLqo9Ck=
bindings:
- members:
  - serviceAccount:qwiklabs-gcp-77c0e3b62eaf4101@appspot.gserviceaccount.com
  role: roles/iam.serviceAccountTokenCreator
etag: BwV5aLqo9Ck=
bindings:
- members:
  - serviceAccount:qwiklabs-gcp-77c0e3b62eaf4101@appspot.gserviceaccount.com
  role: roles/iam.serviceAccountTokenCreator
etag: BwV5aLqo9Ck=
bindings:
- members:
  - serviceAccount:qwiklabs-gcp-77c0e3b62eaf4101@appspot.gserviceaccount.com
  role: roles/iam.serviceAccountTokenCreator
etag: BwV5aLqo9Ck=
bindings:
- members:
  - serviceAccount:qwiklabs-gcp-77c0e3b62eaf4101@appspot.gserviceaccount.com
  role: roles/iam.serviceAccountTokenCreator
etag: BwV5aLqo9Ck=


2. In the code block below, uncomment the project_id, location, and composer_environment and populate them
3. Run the below code to get your **CLIENT_ID** (needed later)

In [7]:
import google.auth
import google.auth.transport.requests
import requests
import six.moves.urllib.parse

# Authenticate with Google Cloud.
# See: https://cloud.google.com/docs/authentication/getting-started
credentials, _ = google.auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])
authed_session = google.auth.transport.requests.AuthorizedSession(
    credentials)

project_id = 'qwiklabs-gcp-123123'
location = 'us-central1'
composer_environment = 'composer'

environment_url = (
    'https://composer.googleapis.com/v1beta1/projects/{}/locations/{}'
    '/environments/{}').format(project_id, location, composer_environment)
composer_response = authed_session.request('GET', environment_url)
environment_data = composer_response.json()
airflow_uri = environment_data['config']['airflowUri']

# The Composer environment response does not include the IAP client ID.
# Make a second, unauthenticated HTTP request to the web server to get the
# redirect URI.
redirect_response = requests.get(airflow_uri, allow_redirects=False)
redirect_location = redirect_response.headers['location']

# Extract the client_id query parameter from the redirect.
parsed = six.moves.urllib.parse.urlparse(redirect_location)
query_string = six.moves.urllib.parse.parse_qs(parsed.query)
print(query_string['client_id'][0])


5741407462-6inpo3s2jsfuat4vanffe8ho6tlfmj5q.apps.googleusercontent.com
5741407462-6inpo3s2jsfuat4vanffe8ho6tlfmj5q.apps.googleusercontent.com
5741407462-6inpo3s2jsfuat4vanffe8ho6tlfmj5q.apps.googleusercontent.com
5741407462-6inpo3s2jsfuat4vanffe8ho6tlfmj5q.apps.googleusercontent.com
5741407462-6inpo3s2jsfuat4vanffe8ho6tlfmj5q.apps.googleusercontent.com


## Create the Cloud Function

1. Navigate to Compute > **Cloud Functions**
2. Select **Create function**
3. For name specify **'gcs-dag-trigger-function'**
4. For trigger type select **'Cloud Storage'**
5. For event type select '**Finalize/Create'**
6. For bucket, **specify the input bucket** you created earlier 

Important: be sure to select the input bucket and not the output bucket to avoid an endless triggering loop)

### populate index.js
Complete the four required constants defined below in index.js code and **paste it into the Cloud Function editor** (the js code will not run in this notebook). The constants are: 
- PROJECT_ID
- CLIENT_ID (from earlier)
- WEBSERVER_ID (part of Airflow webserver URL) 
- DAG_NAME (GcsToBigQueryTriggered)

In [None]:
'use strict';

const fetch = require('node-fetch');
const FormData = require('form-data');

/**
 * Triggered from a message on a Cloud Storage bucket.
 *
 * IAP authorization based on:
 * https://stackoverflow.com/questions/45787676/how-to-authenticate-google-cloud-functions-for-access-to-secure-app-engine-endpo
 * and
 * https://cloud.google.com/iap/docs/authentication-howto
 *
 * @param {!Object} event The Cloud Functions event.
 * @param {!Function} callback The callback function.
 */
exports.triggerDag = function triggerDag (event, callback) {
  // Fill in your Composer environment information here.

  // The project that holds your function
  const PROJECT_ID = 'qwiklabs-gcp-123123'; 
  // example: qwiklabs-gcp-97d55fb651b04b20

  // Navigate to your webserver's login page and get this from the URL
  const CLIENT_ID = '';
  // example: 954510698485-gde6id87qtdn9itl7809uj8s6a60n9gl

  // This should be part of your webserver's URL:
  // {tenant-project-id}.appspot.com
  const WEBSERVER_ID = '';
  // example: b93193d731fd74d3f-tp

  // The name of the DAG you wish to trigger
  const DAG_NAME = 'GcsToBigQueryTriggered';
  // example: GcsToBigQueryTriggered

  ///////////////////////
  // DO NOT EDIT BELOW //

  // Other constants
  const WEBSERVER_URL = `https://${WEBSERVER_ID}.appspot.com/api/experimental/dags/${DAG_NAME}/dag_runs`;
  const USER_AGENT = 'gcf-event-trigger';
  const BODY = {'conf': JSON.stringify(event.data)};

  // Make the request
  authorizeIap(CLIENT_ID, PROJECT_ID, USER_AGENT)
    .then(function iapAuthorizationCallback (iap) {
      makeIapPostRequest(WEBSERVER_URL, BODY, iap.idToken, USER_AGENT, iap.jwt);
    })
    .then(_ => callback(null))
    .catch(callback);
};

/**
   * @param {string} clientId The client id associated with the Composer webserver application.
   * @param {string} projectId The id for the project containing the Cloud Function.
   * @param {string} userAgent The user agent string which will be provided with the webserver request.
   */
function authorizeIap (clientId, projectId, userAgent) {
  const SERVICE_ACCOUNT = `${projectId}@appspot.gserviceaccount.com`;
  const JWT_HEADER = Buffer.from(JSON.stringify({alg: 'RS256', typ: 'JWT'}))
    .toString('base64');

  var jwt = '';
  var jwtClaimset = '';

  // Obtain an Oauth2 access token for the appspot service account
  return fetch(
    `http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/${SERVICE_ACCOUNT}/token`,
    {
      headers: {'User-Agent': userAgent, 'Metadata-Flavor': 'Google'}
    })
    .then(res => res.json())
    .then(function obtainAccessTokenCallback (tokenResponse) {
      if (tokenResponse.error) {
        return Promise.reject(tokenResponse.error);
      }
      var accessToken = tokenResponse.access_token;
      var iat = Math.floor(new Date().getTime() / 1000);
      var claims = {
        iss: SERVICE_ACCOUNT,
        aud: 'https://www.googleapis.com/oauth2/v4/token',
        iat: iat,
        exp: iat + 60,
        target_audience: clientId
      };
      jwtClaimset = Buffer.from(JSON.stringify(claims)).toString('base64');
      var toSign = [JWT_HEADER, jwtClaimset].join('.');

      return fetch(
        `https://iam.googleapis.com/v1/projects/${projectId}/serviceAccounts/${SERVICE_ACCOUNT}:signBlob`,
        {
          method: 'POST',
          body: JSON.stringify({'bytesToSign': Buffer.from(toSign).toString('base64')}),
          headers: {
            'User-Agent': userAgent,
            'Authorization': `Bearer ${accessToken}`
          }
        });
    })
    .then(res => res.json())
    .then(function signJsonClaimCallback (body) {
      if (body.error) {
        return Promise.reject(body.error);
      }
      // Request service account signature on header and claimset
      var jwtSignature = body.signature;
      jwt = [JWT_HEADER, jwtClaimset, jwtSignature].join('.');
      var form = new FormData();
      form.append('grant_type', 'urn:ietf:params:oauth:grant-type:jwt-bearer');
      form.append('assertion', jwt);
      return fetch(
        'https://www.googleapis.com/oauth2/v4/token', {
          method: 'POST',
          body: form
        });
    })
    .then(res => res.json())
    .then(function returnJwt (body) {
      if (body.error) {
        return Promise.reject(body.error);
      }
      return {
        jwt: jwt,
        idToken: body.id_token
      };
    });
}

/**
   * @param {string} url The url that the post request targets.
   * @param {string} body The body of the post request.
   * @param {string} idToken Bearer token used to authorize the iap request.
   * @param {string} userAgent The user agent to identify the requester.
   * @param {string} jwt A Json web token used to authenticate the request.
   */
function makeIapPostRequest (url, body, idToken, userAgent, jwt) {
  var form = new FormData();
  form.append('grant_type', 'urn:ietf:params:oauth:grant-type:jwt-bearer');
  form.append('assertion', jwt);

  return fetch(
    url, {
      method: 'POST',
      body: form
    })
    .then(function makeIapPostRequestCallback () {
      return fetch(url, {
        method: 'POST',
        headers: {
          'User-Agent': userAgent,
          'Authorization': `Bearer ${idToken}`
        },
        body: JSON.stringify(body)
      });
    });
}

### populate package.json
Copy and paste the below into **package.json**

In [None]:
{
  "name": "nodejs-docs-samples-functions-composer-storage-trigger",
  "version": "0.0.1",
  "dependencies": {
    "form-data": "^2.3.2",
    "node-fetch": "^2.2.0"
  },
  "engines": {
    "node": ">=4.3.2"
  },
  "private": true,
  "license": "Apache-2.0",
  "author": "Google Inc.",
  "repository": {
    "type": "git",
    "url": "https://github.com/GoogleCloudPlatform/nodejs-docs-samples.git"
  },
  "devDependencies": {
    "@google-cloud/nodejs-repo-tools": "^2.2.5",
    "ava": "0.25.0",
    "proxyquire": "2.0.0",
    "semistandard": "^12.0.1",
    "sinon": "4.4.2"
  },
  "scripts": {
    "lint": "repo-tools lint",
    "test": "ava -T 20s --verbose test/*.test.js"
  }
}

10. For **Function to execute**, specify **triggerDag** (note: case sensitive)
11. Select **Create**

## Upload CSVs and Monitor
1. Practice uploading and editing CSVs into your input bucket (note: the DAG filters to only ingest CSVs with 'usa_names.csv' as the filepath. Adjust this as needed in the DAG code.)
2. Troubleshoot Cloud Function call errors by monitoring the [logs](https://console.cloud.google.com/logs/viewer?)
3. Troubleshoot Airflow workflow errors by monitoring the **Browse** > **DAG Runs** 

## Congratulations! 
You’ve have completed this advanced lab on triggering a workflow with a Cloud Function.