# Creating components from command-line programs

This tutorial shows how to quicly author a set of reusable components based on existing command-line programs.

The components are used to compose a pipeline.

In this example the goal of the pipeline is to dynamically build a container image from a directory in GIT repository.

The main program is the Kaniko executor that can take context archive file from a Google Cloud Storage location and build+push a container image. But to run that program we first need to get the files from GIT, select the needed files, archive them and upload to Google Cloud Storage.

Pipeline steps:

* Get files from GIT. (`git clone`)
* Choose needed files. (`cp`)
* Archive files. (`tar`)
* Upload archive. (`gsutil cp`)
* Build container image. (Kaniko)

In [None]:
# Install Kubeflow Pipelines SDK
!PIP_DISABLE_PIP_VERSION_CHECK=1 pip3 install 'kfp>=0.1.32.2' --quiet

In [None]:
# GCP project ID which will be used to push the new image
GCP_PROJECT_ID='...'
target_image_name = 'gcr.io/' + GCP_PROJECT_ID + '/tmp-kfp-106-build-image:tag1'

GCP_STORAGE_DIR = 'gs://<bucket-name>/tmp_kfp_106_build_image/'

import kfp
image_context_scratch_uri = GCP_STORAGE_DIR + kfp.dsl.RUN_ID_PLACEHOLDER + '.tar.gz'

In [None]:
# Initializing the client
client = kfp.Client()

# ! Use kfp.Client(host='https://xxxxx.notebooks.googleusercontent.com/') if working from GCP notebooks (or local notebooks)

In [None]:
from pathlib import Path

from kfp.components import InputPath, OutputPath, load_component_from_file, load_component_from_url

In [None]:
Path('git_clone.component.yaml').write_text('''\
name: Git clone
inputs:
- {name: Repo URI, type: URI}
- {name: Branch, type: String, default: 'master'}
outputs:
- {name: Repo dir, type: Directory}
implementation:
  container:
    image: alpine/git
    command:
    - git
    - clone
    - --depth=1
    - --branch
    - inputValue: Branch
    - inputValue: Repo URI
    - outputPath: Repo dir
'''
)

git_clone_op = load_component_from_file('git_clone.component.yaml')

In [None]:
Path('get_subdir.component.yaml').write_text('''\
name: Get subdirectory
description: Get subdirectory items (files or directories).
inputs:
- {name: Directory, type: Directory}
- {name: Subpath, type: String}
outputs:
- {name: Subdir, type: Directory}
implementation:
  container:
    image: alpine
    command:
    - sh
    - -ex
    - -c
    - |
      mkdir -p "$(dirname "$2")"
      cp -r "$0/$1" "$2"
    - inputPath: Directory
    - inputValue: Subpath
    - outputPath: Subdir
'''
)

get_subdir_op = load_component_from_file('get_subdir.component.yaml')

In [None]:
Path('tar_gzip.component.yaml').write_text('''\
name: Compress directory using TAR GZIP
inputs:
- {name: Data, type: Directory}
outputs:
- {name: Gzipped data, type: GzippedTar}
implementation:
  container:
    image: alpine
    command:
    - sh
    - -ex
    - -c
    - |
      mkdir -p "$(dirname "$1")"
      cd "$0"
      tar c -z -f "$1" .
    - inputPath: Data
    - outputPath: Gzipped data
'''
)

tar_gzip_op = load_component_from_file('tar_gzip.component.yaml')

In [None]:
Path('google_cloud_storage_upload.component.yaml').write_text('''\
name: Upload to GCS
inputs:
- {name: Data}
- {name: GCS path, type: URI}
outputs:
- {name: GCS path, type: URI}
implementation:
  container:
    image: google/cloud-sdk
    command:
    - sh
    - -ex
    - -c
    - |
      gcloud auth activate-service-account --key-file="${GOOGLE_APPLICATION_CREDENTIALS}"
      gsutil cp "$0" "$1"
      mkdir -p "$(dirname "$2")"
      echo "$1" > "$2"
    - inputPath: Data
    - inputValue: GCS path
    - outputPath: GCS path
'''
)

upload_to_gcs_op = load_component_from_file('google_cloud_storage_upload.component.yaml')

In [None]:
Path('containers_build_image_from_context_uri.component.yaml').write_text('''\
name: Build container image
inputs:
- {name: Context archive URI, type: URI}
- {name: Target image name, type: String}
outputs:
- {name: Image digest, type: String}
implementation:
  container:
    image: gcr.io/kaniko-project/executor@sha256:78d44ec4e9cb5545d7f85c1924695c89503ded86a59f92c7ae658afa3cff5400
    command:
    - /kaniko/executor
    - --cache=true
    - --dockerfile
    - Dockerfile
    - --context
    - inputValue: Context archive URI
    - --destination
    - inputValue: Target image name
    - --digest-file
    - /tmp/digest.txt
    fileOutputs:
      Image digest: /tmp/digest.txt
'''
)

build_image_from_context_uri_op = load_component_from_file('containers_build_image_from_context_uri.component.yaml')

In [None]:
print_component_text = '''
name: Print text
inputs:
- {name: Text}
implementation:
  container:
    image: alpine
    command:
    - cat
    - inputPath: Text
'''
print_op = kfp.components.load_component_from_text(print_component_text)

In [None]:
def build_image_with_staging_pipeline(
    target_image_name,
    staging_gcs_path='gs://avolkov/tmp_git_pipeline/tmp123.tgz',
    repo_uri='https://github.com/kubeflow/pipelines.git',
    repo_subpath='components/sample/keras/train_classifier',
):
    git_clone_task = git_clone_op(
        repo_uri=repo_uri,
    )

    get_subdir_task = get_subdir_op(
        directory=git_clone_task.output,
        subpath=repo_subpath,
    )

    tar_gzip_task = tar_gzip_op(
        data=get_subdir_task.output,
    )

    upload_to_gcs_task = upload_to_gcs_op(
        data=tar_gzip_task.output,
        gcs_path=staging_gcs_path,
    )

    build_image_from_context_uri_task = build_image_from_context_uri_op(
        context_archive_uri=upload_to_gcs_task.output,
        target_image_name=target_image_name,
    )

    print_op(
        build_image_from_context_uri_task.outputs['image_digest']
    )
    

# Adding GCP credential secrets
# Needed to get permissions in Kubeflow < 0.7
from kfp import gcp
pipeline_conf=kfp.dsl.PipelineConf()
pipeline_conf.add_op_transformer(gcp.use_gcp_secret('user-gcp-sa'))

client.create_run_from_pipeline_func(
    build_image_with_staging_pipeline,
    arguments={
        'target_image_name': target_image_name,
        'staging_gcs_path': image_context_scratch_uri,
    },
    pipeline_conf=pipeline_conf,
)

### Modifying the "build image" component to remove the usage of staging location in Google Cloud Storage.
After the pipeline was completed, it was discovered that the Kaniko builder can build images from local context directory

This can greatly simplify our pipeline and remove the need for the staging Google Cloud Storage location.

Let's rewrite the component to accept in-system data instead of a Google Cloud Storage URI.

In [None]:
Path('containers_build_image_from_context.component.yaml').write_text('''\
name: Build container image
inputs:
- {name: Context}
- {name: Target image name, type: String}
outputs:
- {name: Image digest, type: String}
implementation:
  container:
    image: gcr.io/kaniko-project/executor@sha256:78d44ec4e9cb5545d7f85c1924695c89503ded86a59f92c7ae658afa3cff5400
    command:
    - /kaniko/executor
    - --cache=true
    - --dockerfile
    - Dockerfile
    - --context
    - inputPath: Context
    - --destination
    - inputValue: Target image name
    - --digest-file
    - /tmp/digest.txt
    fileOutputs:
      Image digest: /tmp/digest.txt
'''
)

build_image_op = load_component_from_file('containers_build_image_from_context.component.yaml')

In [None]:
def build_image_pipeline(
    target_image_name,
    repo_uri='https://github.com/kubeflow/pipelines.git',
    repo_subpath='components/sample/keras/train_classifier',
):
    git_clone_task = git_clone_op(
        repo_uri=repo_uri,
    )

    get_subdir_task = get_subdir_op(
        directory=git_clone_task.output,
        subpath=repo_subpath,
    )

    build_image_task = build_image_op(
        context=get_subdir_task.output,
        target_image_name=target_image_name,
    )

    print_op(
        build_image_task.outputs['image_digest']
    )


# Adding GCP credential secrets
# Needed to get permissions in Kubeflow < 0.7
from kfp import gcp
pipeline_conf=kfp.dsl.PipelineConf()
pipeline_conf.add_op_transformer(gcp.use_gcp_secret('user-gcp-sa'))

client.create_run_from_pipeline_func(
    build_image_pipeline,
    arguments={'target_image_name': target_image_name},
    pipeline_conf=pipeline_conf,
)

**Exercise 1**: The build image component returns only the hash digest of the container image. Create a new component that can merge the target image name with the hash digest to get a full image name with digest.