# Data Ingestion

Data ingestion step involves in obtaining data for a ML process. In this step need to consider the training/test sets, data type we are ingesting (text, stuctured, images etc.), combination of multiple sources etc.

In the ingestion step, before passing the data to next step we need to separate data to training/validation sets and then convert those into `TFRecord` files containing the data represented as `tf.Example` data structures.

`TFRecord` is a lightweight format optimized for streaming large datasets. In practice many tensorflow users store serialized example Protocol Buffers in TFRecord files. These file type support any binary data as shown in below example.

In [6]:
import tensorflow as tf

with tf.io.TFRecordWriter("data/test.tfrecord") as w:
    w.write(b"First Record")
    w.write(b"Second Record")

for record in tf.data.TFRecordDataset("data/test.tfrecord"):
    print(record)


tf.Tensor(b'First Record', shape=(), dtype=string)
tf.Tensor(b'Second Record', shape=(), dtype=string)


TFRecord files contains tf.Example records (which acts as rows IMO) and more details regarding this can be read at [Tensorflow Docs](https://www.tensorflow.org/tutorials/load_data/tfrecord).

But generally storing our data as TFRecords and tf.Examples provides benefits including system independence since it is implemented using `Protocol Buffers` a cross-platform cross-language libary to serialize data, optimizations for downloading/writing large amount of data quickly and compatibility with Tensorflow ecosystem in general.

> The process of ingesting. splitting and converting datasets is performed using the `ExampleGen` component of the TFX.

### Ingesting Local data files

ExampleGen component of TFX support ingesting CSVs, precomputed TFRecord files and serialization outputs from apache avro/parquet files.

Following code segment demonstrate in the ingestion of folder containing CSVs.

In [8]:
import tensorflow as tf
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

context = InteractiveContext()



In [9]:
import os
from tfx.components import CsvExampleGen


base_dir = os.getcwd()
data_dir = os.path.join(base_dir, "data", "CSVs")

exmaple_gen = CsvExampleGen(data_dir)

context.run(exmaple_gen)



0,1
.execution_id,1
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } CsvExampleGen at 0x1b674398f70.inputs{}.outputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b6743acdc0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0.exec_properties['input_base']c:\Users\ddsdi\Desktop\Machine Learning Pipelines\data\CSVs['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:78956236,xor_checksum:1658167881,sum_checksum:1658167881"
.component.inputs,{}
.component.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b6743acdc0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
.inputs,{}
.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b6743acdc0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"
.exec_properties,"['input_base']c:\Users\ddsdi\Desktop\Machine Learning Pipelines\data\CSVs['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:78956236,xor_checksum:1658167881,sum_checksum:1658167881"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b6743acdc0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1
.span,0
.split_names,"[""train"", ""eval""]"
.version,0

0,1
['input_base'],c:\Users\ddsdi\Desktop\Machine Learning Pipelines\data\CSVs
['input_config'],"{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }"
['output_config'],"{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }"
['output_data_format'],6
['output_file_format'],5
['custom_config'],
['range_config'],
['span'],0
['version'],
['input_fingerprint'],"split:single_split,num_files:1,total_bytes:78956236,xor_checksum:1658167881,sum_checksum:1658167881"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b6743acdc0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1
.span,0
.split_names,"[""train"", ""eval""]"
.version,0


But there are many cases our data cannot be expressed as CSVs like images. In such cases it is recommended to convert the data to TFRecord format and then load it with TFX ExampleGen component. About converting data to TFRecords we'll look later. But the way to load data is as follows.

In [None]:
import os
from tfx.components import ImportExampleGen

base_dir = os.getcwd()
data_dir = os.path.join(base_dir, "data", "TFRecords")

example_gen = ImportExampleGen(input=data_dir) # To load TFRecord formatted data.

context.run(example_gen)

Since TFRecords contain data in tf.Example format there's no need to do a conversion hence the direct `ImportExampleGen`.

But there may be cases you have a data type which is not provided by TFX. In such cases, we can override the 'executor' part of ExampleGen component (remember the architecture of components!). 

We will use the generic file loader `FileBasedExampleGen` which allows us to override the executor_class to read Parquet files.

In [None]:
from tfx.components import FileBasedExampleGen
from tfx.components.example_gen.custom_executors import parquet_executor

base_dir = os.getcwd()
data_dir = os.path.join(base_dir, "data", "Parquets")

example_gen = FileBasedExampleGen(
                        data_dir, 
                        custom_executor_spec=parquet_executor.Executor) # Overriding the executor


Sometimes it is more convienient to convert our data to TFRecord files before using. Assume a case our data is in a JSON or XML format. We can convert these to TFRecord format to use with ImportExampleGen component. But first we need to understand the tf.Example/TFRecord data structures.

Generally speaking, tf.Example is a key-value mapping (like python dictionary). These tf.Example expects a tf.Features object with have a dictionary of features as key value pairs. The `key` is always a string representing the feature and value is tf.train.Feature object. TFRecord consist with such tf.Example objects.

eg:- TFRecord object

- Record 1:
    - tf.Example
        * tf.Features
            * 'column A': tf.train.Feature
            * 'column B': tf.train.Feature
            * 'column C': tf.train.Feature

Here tf.train.Feature allows data types like tf.train.BytesList, tf.train.FloatList, tf.train.Int64List etc.

With those in mind we can convert our data into TFRecord data type as below.

In [18]:
import tensorflow as tf

def _bytes_features(value):
    if isinstance(value, str):
        value = value.encode('utf8')
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _floats_features(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_features(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

# Zio codes have suffixes. This is to fix that
def convert_zipcode_to_int(zipcode):
    if isinstance(zipcode, str) and "XX" in zipcode:
        zipcode = zipcode.replace("XX", "00")
    int_zipcode = int(zipcode)
    return int_zipcode

# To fix empty zip code issue
def clean_rows(row):
    if not row["zip_code"]:
        row["zip_code"] = "99999"
    return row

import csv
import os

original_data = os.path.join('data', 'consumer_complaints_with_narrative.csv')

tfrecord_file_name = 'data/consumer_complaints.tfrecord'
tf_record_writer = tf.io.TFRecordWriter(tfrecord_file_name)

with open(original_data) as csv_file:
    reader = csv.DictReader(csv_file, delimiter=',', quotechar='"')

    for row in reader:
        row = clean_rows(row)
        example_row = tf.train.Example(features=tf.train.Features(feature={
                                        "product": _bytes_features(row["product"]),
                                        "sub_product": _bytes_features(row["sub_product"]),
                                        "issue": _bytes_features(row["issue"]),
                                        "sub_issue": _bytes_features(row["sub_issue"]),
                                        "state": _bytes_features(row["state"]),
                                        "zip_code": _int64_features(convert_zipcode_to_int(row["zip_code"])),
                                        "company": _bytes_features(row["company"]),
                                        "company_response": _bytes_features(row["company_response"]),
                                        "consumer_complaint_narrative": _bytes_features(row["consumer_complaint_narrative"]),
                                        "timely_response": _bytes_features(row["timely_response"]),
                                        "consumer_disputed": _bytes_features(row["consumer_disputed"]),
                                    }))
        
        tf_record_writer.write(example_row.SerializeToString())
    tf_record_writer.close()
    

Check the way we have used the previously mentioned data structures to build TFRecord type dataset. More details regarding this can be obtained from [Tensorflow Documentation](https://www.tensorflow.org/tutorials/load_data/tfrecord).

The converted data can be directly imported using ImportExampleGen method.

> TFX ExampleGen supports reading files from remote cloud storages directly using their URLs. Depending on the type of security requirement exact method would change. Better to check the documentation in such cases.

> Also ExampleGen components support reading data from Databases directly. But these require credentials to access the data sources and therefore more configurations. Also we can write custom executors if needed.

### Data Preparation using TFX

As we know ExampleGen components allows us to configure input and output settings. We can define a `span` if we need to ingest datasets incrementally. We can define how we need our data to be splitted. Below are some examples for such scenarios.

**Splitting data to subsets (train, val and test)**

Here we show the usage of tfx to split datasets. 

In [22]:
from tfx.components import CsvExampleGen
from tfx.proto import example_gen_pb2
import os

data_dir = os.path.join('data', 'CSVs')

output = example_gen_pb2.Output(split_config= example_gen_pb2.SplitConfig(splits=[
    example_gen_pb2.SplitConfig.Split(name='train', hash_buckets=6), # Hash buckets define ratios
    example_gen_pb2.SplitConfig.Split(name='test', hash_buckets=2),
    example_gen_pb2.SplitConfig.Split(name='eval', hash_buckets=2)
] ))

example_gen = CsvExampleGen(data_dir, output_config=output)
context.run(example_gen)

0,1
.execution_id,4
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } CsvExampleGen at 0x1b675551eb0.inputs{}.outputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b675535700.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4) at 0x1b674751b80.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4.span0.split_names[""train"", ""test"", ""eval""].version0.exec_properties['input_base']data\CSVs['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 6,  ""name"": ""train""  },  {  ""hash_buckets"": 2,  ""name"": ""test""  },  {  ""hash_buckets"": 2,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:78956236,xor_checksum:1658167881,sum_checksum:1658167881"
.component.inputs,{}
.component.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b675535700.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4) at 0x1b674751b80.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4.span0.split_names[""train"", ""test"", ""eval""].version0"

0,1
.inputs,{}
.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b675535700.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4) at 0x1b674751b80.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4.span0.split_names[""train"", ""test"", ""eval""].version0"
.exec_properties,"['input_base']data\CSVs['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 6,  ""name"": ""train""  },  {  ""hash_buckets"": 2,  ""name"": ""test""  },  {  ""hash_buckets"": 2,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:78956236,xor_checksum:1658167881,sum_checksum:1658167881"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b675535700.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4) at 0x1b674751b80.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4.span0.split_names[""train"", ""test"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4) at 0x1b674751b80.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4.span0.split_names[""train"", ""test"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4) at 0x1b674751b80.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4.span0.split_names[""train"", ""test"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4
.span,0
.split_names,"[""train"", ""test"", ""eval""]"
.version,0

0,1
['input_base'],data\CSVs
['input_config'],"{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }"
['output_config'],"{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 6,  ""name"": ""train""  },  {  ""hash_buckets"": 2,  ""name"": ""test""  },  {  ""hash_buckets"": 2,  ""name"": ""eval""  }  ]  } }"
['output_data_format'],6
['output_file_format'],5
['custom_config'],
['range_config'],
['span'],0
['version'],
['input_fingerprint'],"split:single_split,num_files:1,total_bytes:78956236,xor_checksum:1658167881,sum_checksum:1658167881"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b675535700.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4) at 0x1b674751b80.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4.span0.split_names[""train"", ""test"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4) at 0x1b674751b80.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4.span0.split_names[""train"", ""test"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4) at 0x1b674751b80.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4.span0.split_names[""train"", ""test"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\4
.span,0
.split_names,"[""train"", ""test"", ""eval""]"
.version,0


In above code, hash_bucket parameter defines the ratio in which the dataset should be splitted. So in this case its 6:2:2 ratio. (More details from [Documentation](https://www.tensorflow.org/tfx/guide/examplegen#custom_inputoutput_split).)

Also there may be cases we already have splits created externally and we need to preserve that while ingesting. In such cases we can change the input config as follows.

In [None]:
from tfx.components import CsvExampleGen
from tfx.proto import example_gen_pb2
import os

data_dir = os.path.join('data', 'CSVs')

input_config = example_gen_pb2.Input(splits=[
    example_gen_pb2.Input.Split(name='train', pattern="train/*"),
    example_gen_pb2.Input.Split(name='test', pattern="test/*"),
    example_gen_pb2.Input.Split(name='eval', pattern="eval/*")
] )

example_gen = CsvExampleGen(data_dir, input_config==input_config)
# context.run(example_gen)

### Spanning Datasets

In many ML scenarios you will not get data regularly. But once it is available you need to to the processing. To help in these use cases ExampleGen component allows us to use `spans`. We can think of it as a snapshot of data like hourly, weekly or batch from ETL process etc.  

From practical standpoint this means regex based data ingestion for the latest version . check the [documentation](https://www.tensorflow.org/tfx/guide/examplegen#span)!

Example usage code is below!

In [None]:
from tfx.components import CsvExampleGen
from tfx.proto import example_gen_pb2
import os

data_dir = os.path.join('data', 'CSVs')

input_dir = example_gen_pb2.Input(splits=[example_gen_pb2.Input.Split(pattern='export-{SPAN}/*')])

example_gen = CsvExampleGen(data_dir, input_config==input_config)
# context.run(example_gen)

Note the pattern definition of the Input.Split() object. It has a pattern defining how the data versions would be in the input data location. For example in the above code we expect data directories to be like below.

- data/
    - export - 0/*.csv
    - export - 1/*.csv
    - export - 2/*.csv

This is important because, in ML data versioning helps us to keep track of which model used which data set. Not only that, we can use those meta data to end to end result reproducibility.


### Ingestion Strategies

Depending on the type of data we are using we need to make use of proper ExampleGen component. Or at least use convert the data to efficient format which can be read directly by TFX.

For example structured can be converted to CSV, presto types which can be directly read by TFX. NLP corpus like data can be converted in to TFRecord format or apache Parquet format. Image data can be stored as byte strings as below.




In [None]:
# Example code. do not Run!

path = "path to images"
img_paths_list = []

def get_image_lbl(path):
    pass

with tf.io.TFRecordWriter(path=path) as writer:
    for img_path in img_paths_list:
        try:
            raw_file = tf.io.read_file(img_path)
        except:
            print("File could not be found!", img_path)
            continue
        example = tf.train.Example(features=tf.train.Features(feature={
            'image_raw': _bytes_features(raw_file.numpy()),
            'label': _int64_features(get_image_lbl(raw_file))
        }))
        writer.write(example.SerializeToString())