dataflow-tfrecord

This repository is a reference ETL Pipeline for creating TF-Records using Apache Beam Python SDK on Google Cloud Dataflow. You may find the blog for his repo here

To run this pipeline:

Step 1:

First have a csv_file in format in the GCS Bucket,

gs://path/img.png,label1
gs://path/img.png,label2
...

and corresponding dummy square images of same size stored in the GCS bucket at correct path.

Step 2:

Before running the pipeline make sure you initialize the following variables in create_tfrecords/create_tfrecords.py:

# TODO: Initialize below variables
LABEL_DICT = {
    'label1':0,
    'label2':1,
    'label3':2}
NUM_CLASS = len(LABEL_DICT)
IMG_SIZE = 28 # TODO: Enter your own int value for square image

PROJ_NAME = 'Your Project Name'

CSV_PATH = 'gs://<bucket-name>/path-to.csv'
RUNNER = 'DataflowRunner'
STAGING_LOCATION = 'gs://<bucket-name>/staging/'
TEMP_LOCATION = 'gs://<bucket-name>/temp/'
TEMPLATE_LOCATION = 'gs://<bucket-name>/path/to/template_location/template_name'
JOB_NAME = 'random-job-name'
OUTPUT_PATH = 'gs://<bucket-name>/output_path/'

Step 3:

Now, inorder to run the pipeline on Google VM Instance you may run,

bash run.sh

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
create_tfrecords		create_tfrecords
.gitignore		.gitignore
README.md		README.md
run.sh		run.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create_tfrecords

create_tfrecords

.gitignore

.gitignore

README.md

README.md

run.sh

run.sh

setup.py

setup.py

Repository files navigation

dataflow-tfrecord

Step 1:

Step 2:

Step 3:

About

Releases

Packages

Languages

swapnil3597/dataflow-tfrecord

Folders and files

Latest commit

History

Repository files navigation

dataflow-tfrecord

Step 1:

Step 2:

Step 3:

About

Topics

Resources

Stars

Watchers

Forks

Languages