<a href="https://colab.research.google.com/github/ShaswataJash/kfpcomponent/blob/main/KaggleDatasetFetcher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is the development workflow for kubeflow pipeline component of the same name as this notebook. Refer https://github.com/ShaswataJash/kfpcomponent

#Install required softwares

In [None]:
!uname -a

In [None]:
!lsb_release -a

In [None]:
!python --version

In [None]:
#install fuse as dependency for rclone. Additionally, install curl, unzip for rclone installer to work
!apt-get update \
    && apt-get install --no-install-recommends -y curl fuse unzip \
    && echo "user_allow_other" >> /etc/fuse.conf \
    && curl https://rclone.org/install.sh | bash \
    && apt-get -y remove --purge curl unzip \
    && rm -rf /var/lib/apt/lists/* \
    && rclone --version

In [None]:
!pip install kaggle==1.5.12

#Develop source code files

In [None]:
%%writefile kaggle_download.py
#!/usr/bin/env python3

import os
import sys
import argparse
import logging
import json
import tempfile
import subprocess

for arg in sys.argv:
    print(arg)
sys.stdout.flush()

parser = argparse.ArgumentParser(description='kubeflow pipeline component to download competition or dataset files from kaggle')
parser.add_argument('--log-level', default='INFO', choices=['ERROR', 'INFO', 'DEBUG'])
parser.add_argument('--bypass-rclone-for-output-data', default=False, action="store_true", help='whether output csv file should be written like local file - rclone is completely bypassed')
parser.add_argument('--rclone-environment-var', type=str, default= '{}', help='json formatted key-value pairs of strings which will be set as environment variables before executing rclone commands')
parser.add_argument('--kaggle-environment-var', type=str, default= '{}', help='json formatted key-value pairs of strings which will be set as environment variables before executing kaggle commands')
parser.add_argument('--kaggle-resource-type', choices=['competitions', 'datasets'])
parser.add_argument('--kaggle-resource-name', type=str, help='name of the the kaggle resource name') #refer: https://github.com/Kaggle/kaggle-api
parser.add_argument('--output-datasource-directory-mountable', default=False, action="store_true", help='whether output csv file will be written in mountable remote location when rclone is used')
parser.add_argument('--output-datasource-directory', type=str, help='the directory/bucket path holding the kaggle downloaded files')

args = parser.parse_args()

#keeping the log format same as used in pycaret for consistency (refer: https://github.com/pycaret/pycaret/blob/master/pycaret/internal/logging.py)
logging.basicConfig(level=args.log_level, format='%(asctime)s:%(levelname)s:%(message)s')

#sanity check of arguments
if args.bypass_rclone_for_output_data:
    assert args.output_datasource_directory_mountable == False
    
if args.bypass_rclone_for_output_data:
    assert args.rclone_environment_var == '{}'

#setting rclone related env
try:
    rclone_config = json.loads(args.rclone_environment_var)
    logging.info("rclone_config: type=%s content=%s", type(rclone_config), rclone_config)
    for item in rclone_config.items():
        #converting explicitely item[1] to str because rclone config can have nested json. In that case, item[1] will be of dictonary type
        #replacing quote with double quote to make the values json compatible (note for string without ', below replacement has no effect)
        os.environ[item[0]] = str(item[1]).replace('\'', '"')
        logging.debug('%s => %s', item[0], os.getenv(item[0]))
except BaseException as err:
    logging.error("rclone configuration loading related error", exc_info=True)
    sys.stdout.flush()
    sys.exit("Forceful exit as exception encountered while loading rclone_config")  

#setting kaggle related env
try:
    kaggle_config = json.loads(args.kaggle_environment_var)
    logging.info("kaggle_config: type=%s content=%s", type(kaggle_config), kaggle_config)
    for item in kaggle_config.items():
        #converting explicitely item[1] to str because kaggle config can have nested json. In that case, item[1] will be of dictonary type
        #replacing quote with double quote to make the values json compatible (note for string without ', below replacement has no effect)
        os.environ[item[0]] = str(item[1]).replace('\'', '"')
        logging.debug('%s => %s', item[0], os.getenv(item[0]))
except BaseException as err:
    logging.error("kaggle configuration loading related error", exc_info=True)
    sys.stdout.flush()
    sys.exit("Forceful exit as exception encountered while loading kaggle_config")  

#temporary directory creation
try:
    if not args.bypass_rclone_for_output_data:
        local_datastore_write_dir = tempfile.mkdtemp(prefix="my_local_write-")
        logging.debug('local_datastore_write_dir:%s',local_datastore_write_dir)
except BaseException as err:
    logging.error("temporary directory creation related error", exc_info=True)
    sys.stdout.flush()
    sys.exit("Forceful exit as exception encountered while creating temporary directories")

#output file handling
if not args.bypass_rclone_for_output_data:
    if args.output_datasource_directory_mountable:
        output_data_write_cmd = "rclone -v mount remotewrite:" + args.output_datasource_directory + ' ' + local_datastore_write_dir + ' --daemon'
        logging.info(output_data_write_cmd)
        output_data_write_call = subprocess.run(output_data_write_cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
        logging.info(output_data_write_call.stdout)
        if output_data_write_call.returncode != 0:
            logging.error("Error in rclone, errorcode=%s", output_data_write_call.returncode)
            sys.stdout.flush()
            sys.exit("Forceful exit as rclone returned error in context of mounted writing")

#handling of kaggle interaction
try:
    if args.bypass_rclone_for_output_data:
        os.makedirs(args.output_datasource_directory, exist_ok=True)
    kaggle_files_to_download_dir = args.output_datasource_directory if args.bypass_rclone_for_output_data else local_datastore_write_dir
    kaggle_write_cmd = "kaggle " + args.kaggle_resource_type + ' download -p ' + kaggle_files_to_download_dir + ' ' + args.kaggle_resource_name
    logging.info(kaggle_write_cmd)
    kaggle_write_call = subprocess.run(kaggle_write_cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    logging.info(kaggle_write_call.stdout)
    if kaggle_write_call.returncode != 0:
        logging.error("Error in kaggle downlaod, errorcode=%s", kaggle_write_call.returncode)
        sys.stdout.flush()
        sys.exit("Forceful exit as kaggle download returned error")
except BaseException as err:
    logging.error("kaggle download related error", exc_info=True)
    sys.stdout.flush()
    sys.exit("Forceful exit as exception encountered while kaggle download")




Docker size reduction tips:


*   https://devopscube.com/reduce-docker-image-size/
*   https://www.ecloudcontrol.com/best-practices-to-reduce-docker-images-size/



In [None]:
%%writefile Dockerfile
FROM python:3.7.13-slim

#install fuse as dependency for rclone. Additionally, install curl, unzip for rclone installer to work
RUN apt-get update \
    && apt-get install --no-install-recommends -y curl fuse unzip \
    && echo "user_allow_other" >> /etc/fuse.conf \
    && curl https://rclone.org/install.sh | bash \
    && apt-get -y remove --purge curl unzip \
    && apt-get -y autoremove \
    && rm -rf /var/lib/apt/lists/* \
    && rclone --version

#install kaggle client lib
RUN python3 -m pip install kaggle==1.5.12
    
COPY src/kaggle_download.py /tmp

In [None]:
%%writefile run_tests.sh
#!/bin/bash

#In production kaggle.json should be created through kubernetes secret and mounted to KAGGLE_CONFIG_DIR environment var
#For quick testing, KAGGLE_USERNAME and KAGGLE_KEY environment vars can be passed through --kaggle-environment-var (not recommended for production)

#Test: download kaggle dataset
python3 /tmp/kaggle_download.py --kaggle-resource-type 'datasets' --kaggle-resource-name 'anushonkar/network-anamoly-detection' \
    --kaggle-environment-var '{"KAGGLE_CONFIG_DIR":"/mnt"}' \
    --bypass-rclone-for-output-data --output-datasource-directory '/tmp/my_local_dir_for_test/' --log-level 'DEBUG'

if [ $? -ne 0 ]
then
    exit 1
else
    echo "============ test related to download kaggle-dataset done ==============="
fi

#Test: download kaggle competitions
#NOTE the kaggle-user need to accept competition rules before able to download competitions files 
python3 /tmp/kaggle_download.py --kaggle-resource-type 'competitions' --kaggle-resource-name 'tabular-playground-series-aug-2022' \
    --kaggle-environment-var '{"KAGGLE_CONFIG_DIR":"/mnt"}' \
    --bypass-rclone-for-output-data --output-datasource-directory '/tmp/my_local_dir_for_test/' --log-level 'DEBUG'

if [ $? -ne 0 ]
then
    exit 1
else
    echo "============ test related to download kaggle-competitions done ==============="
fi

python /tmp/test_validation.py
if [ $? -ne 0 ]
then
    exit 1
else
    exit 0
fi

In [None]:
%%writefile test_validation.py
#!/usr/bin/env python3

from os.path import exists
assert exists('/tmp/my_local_dir_for_test/network-anamoly-detection.zip') == True
assert exists('/tmp/my_local_dir_for_test/tabular-playground-series-aug-2022.zip') == True

import zipfile
with zipfile.ZipFile('/tmp/my_local_dir_for_test/network-anamoly-detection.zip', 'r') as zip_ref:
    zip_ref.extractall('/tmp/my_local_dir_for_test/network-anamoly-detection')

with zipfile.ZipFile('/tmp/my_local_dir_for_test/tabular-playground-series-aug-2022.zip', 'r') as zip_ref:
    zip_ref.extractall('/tmp/my_local_dir_for_test/tabular-playground-series-aug-2022')


import pandas

df = pandas.read_csv(filepath_or_buffer = '/tmp/my_local_dir_for_test/network-anamoly-detection/Train.txt')
print (df.shape)
assert len(df.index) > 10000 #check whether more than 10000 rows are present

df = pandas.read_csv(filepath_or_buffer = '/tmp/my_local_dir_for_test/tabular-playground-series-aug-2022/train.csv')
print (df.shape)
assert len(df.index) > 10000 #check whether more than 10000 rows are present

print ('test-validation done successfully')


*   https://www.kubeflow.org/docs/components/pipelines/sdk/component-development/#designing-a-pipeline-component
*   https://github.com/kubeflow/pipelines/blob/sdk/release-1.8/sdk/python/kfp/dsl/types.py
*   https://kubeflow-pipelines.readthedocs.io/en/stable/_modules/kfp/components/_structures.html



In [None]:
%%writefile component_output_as_artifact.yaml
name: KaggleDatasetFetcherWhereOutputAsArtifact
description: |
    Download kaggle Competitions and Dataset files. Download will be zipped form.
    For kaggle related API, refer https://github.com/Kaggle/kaggle-api
    In production kaggle.json should be created through kubernetes secret and mounted to KAGGLE_CONFIG_DIR environment var.
    For quick testing, KAGGLE_USERNAME and KAGGLE_KEY environment vars can be passed through --kaggle-environment-var (not recommended for production)
    Output csv files are stored in output artifacts. Thus the csv files are written like locally mounted POSIX files.
metadata:
  annotations:
    author: Shaswata Jash <29448766+ShaswataJash@users.noreply.github.com>
    canonical_location: https://raw.githubusercontent.com/ShaswataJash/kfpcomponent/main/KaggleDatasetFetcher/component_output_as_artifact.yaml

inputs:
- name: log_level
  type: String
  description: 'choice amongst ERROR, INFO, DEBUG'
  optional: true
- name: kaggle_environment_var
  type: String
  description: 'json formatted key-value pairs of strings which will be set as environment variables before executing kaggle commands'
- name: kaggle_resource_type
  type: String 
  description: 'choice amongst competitions, datasets'
- name: kaggle_resource_name
  type: String
  description: 'name of the competition or dataset that will be downloaded'

outputs:
- name: output_datasource_directory
  description: 'absolute local directory where downloaded file will be stored when rclone is NOT used i.e. when output file is stored in output artifact of pipeline engine (e.g. argo)'

implementation:
  container:
    image: shasjash/kfpcomponents:KaggleDatasetFetcher_devlatest
    command:
    - python3 
    - /tmp/kaggle_download.py
    args:
    - --bypass-rclone-for-output-data
    - if:
        cond: {isPresent: log_level}
        then:
        - --log-level
        - {inputValue: log_level}
    - --kaggle-environment-var
    - {inputValue: kaggle_environment_var}
    - --kaggle-resource-type
    - {inputValue: kaggle_resource_type}
    - --kaggle-resource-name
    - {inputValue: kaggle_resource_name}
    - --output-datasource-directory
    - {outputPath: output_datasource_directory}

#Software testing

In [None]:
!rm -rf /tmp/my_local_dir_for_test
!chmod 544 run_tests.sh
!cp kaggle_download.py /tmp
!cp test_validation.py /tmp
!./run_tests.sh

In [None]:
! pip3 install kfp==1.8.12

First validate the component.yaml file in http://www.yamllint.com/. Once component.yaml file is corrected, execute the below cell to finally check

In [None]:
import kfp
from kubernetes import client as k8s_client

kaggle_download_op_out_to_artifact = kfp.components.load_component_from_file('component_output_as_artifact.yaml')

@kfp.dsl.pipeline(name="testpipeline1")
def my_sample_pipeline():
    op = kaggle_download_op_out_to_artifact(
                                kaggle_environment_var = '{"KAGGLE_CONFIG_DIR":"/mnt"}', 
                                kaggle_resource_type = 'datasets',
                                kaggle_resource_name = 'anushonkar/network-anamoly-detection',
                                )
    op.add_volume(k8s_client.V1Volume(
        name="kaggle_json_volume",
        secret=k8s_client.V1SecretVolumeSource(secret_name="kaggle_json-secrets")) #kaggle_json-secrets should contain kaggle.json
    )
    op.add_volume_mount(k8s_client.V1VolumeMount(
                                          mount_path='/mnt',
                                          name='kaggle_json_volume')
    )
    directory_where_files_downloaded = op.outputs['output_datasource_directory']


kfp.compiler.Compiler().compile(pipeline_func=my_sample_pipeline,package_path='my_sample_pipeline.yaml')
kfp.v2.compiler.Compiler().compile(pipeline_func=my_sample_pipeline,package_path='my_sample_pipeline_v2.json')


In [None]:
import kfp
from kubernetes import client as k8s_client

kaggle_download_op_out_to_artifact = kfp.components.load_component_from_file('component_output_as_artifact.yaml')

@kfp.dsl.pipeline(name="testpipeline2")
def my_sample_pipeline():
    op = kaggle_download_op_out_to_artifact(
                                kaggle_environment_var = '', 
                                kaggle_resource_type = 'datasets',
                                kaggle_resource_name = 'anushonkar/network-anamoly-detection',
                                )
    op.add_volume(k8s_client.V1Volume(
        name="kaggle_json_volume",
        secret=k8s_client.V1SecretVolumeSource(secret_name="kaggle_json-secrets")) #kaggle_json-secrets should contain KAGGLE_USERNAME and KAGGLE_KEY
    )
    envs = [
        ("KAGGLE_USERNAME", "KAGGLE_USERNAME"),
        ("KAGGLE_KEY", "KAGGLE_KEY")
    ]
    for env_name, key in envs:
        op.add_env_variable(
            k8s_client.V1EnvVar(
                name=env_name,
                value_from=k8s_client.V1EnvVarSource(secret_key_ref=k8s_client.V1SecretKeySelector(
                    name="kaggle_json-secrets",
                    key=key
                    )
                )
            )
        )
    directory_where_files_downloaded = op.outputs['output_datasource_directory']


kfp.compiler.Compiler().compile(pipeline_func=my_sample_pipeline,package_path='my_sample_pipeline2.yaml')
kfp.v2.compiler.Compiler().compile(pipeline_func=my_sample_pipeline,package_path='my_sample_pipeline2_v2.json')


#Push the code to github

Before commiting code to github, install github client (gh) by following instruction mentioned in https://github.com/cli/cli/blob/trunk/docs/install_linux.md (Choose Debian, Ubuntu Linux way of installation) 

Use the colab's 'Terminal' icon present in left vertical pane to open linux terminal to type commands. Once 'gh' is installed, type **$gh auth login** (refer https://docs.github.com/en/get-started/getting-started-with-git/caching-your-github-credentials-in-git) to follow onscreen prompts. For colab, use **Paste an authentication token** option. Personal tokens can be generated in https://github.com/settings/tokens

You can use Shift+Ctrl+v shortcut to paste any string in colab console

In [None]:
!pwd

In [None]:
!rm -Rf kfpcomponent

In [None]:
!git clone https://github.com/ShaswataJash/kfpcomponent.git

Follow directory structure according to https://www.kubeflow.org/docs/components/pipelines/sdk/component-development/#organizing-the-component-files

In [None]:
!mkdir kfpcomponent/KaggleDatasetFetcher
!mkdir kfpcomponent/KaggleDatasetFetcher/src
!mkdir kfpcomponent/KaggleDatasetFetcher/tests

In [None]:
#it will ensure file is coped in git repo only if file content is changed by checking checksum of file content
!rsync -c kaggle_download.py kfpcomponent/KaggleDatasetFetcher/src
!rsync -c component_output_as_artifact.yaml kfpcomponent/KaggleDatasetFetcher/component_output_as_artifact.yaml
!rsync -c test_validation.py kfpcomponent/KaggleDatasetFetcher/tests
!rsync -c Dockerfile kfpcomponent/KaggleDatasetFetcher/
!rsync -c run_tests.sh kfpcomponent/KaggleDatasetFetcher/

In [None]:
%cd kfpcomponent

In [None]:
!git add -A

In [None]:
!git status

For git-user who has set their email visibility as private, git provides alternate email address to use in web-based Git operations, e.g., edits and merges. The alias email can be viewed in https://github.com/settings/emails

In [None]:
!git config --global user.email "29448766+ShaswataJash@users.noreply.github.com"

In [None]:
!git commit -a -m "corrected component.yaml"

In [None]:
!git push origin main

In [None]:
%cd ..