# Data Ingestion

Data ingestion step involves in obtaining data for a ML process. In this step need to consider the training/test sets, data type we are ingesting (text, stuctured, images etc.), combination of multiple sources etc.

In the ingestion step, before passing the data to next step we need to separate data to training/validation sets and then convert those into `TFRecord` files containing the data represented as `tf.Example` data structures.

`TFRecord` is a lightweight format optimized for streaming large datasets. In practice many tensorflow users store serialized example Protocol Buffers in TFRecord files. These file type support any binary data as shown in below example.

In [6]:
import tensorflow as tf

with tf.io.TFRecordWriter("data/test.tfrecord") as w:
    w.write(b"First Record")
    w.write(b"Second Record")

for record in tf.data.TFRecordDataset("data/test.tfrecord"):
    print(record)


tf.Tensor(b'First Record', shape=(), dtype=string)
tf.Tensor(b'Second Record', shape=(), dtype=string)


TFRecord files contains tf.Example records (which acts as rows IMO) and more details regarding this can be read at [Tensorflow Docs](https://www.tensorflow.org/tutorials/load_data/tfrecord).

But generally storing our data as TFRecords and tf.Examples provides benefits including system independence since it is implemented using `Protocol Buffers` a cross-platform cross-language libary to serialize data, optimizations for downloading/writing large amount of data quickly and compatibility with Tensorflow ecosystem in general.

> The process of ingesting. splitting and converting datasets is performed using the `ExampleGen` component of the TFX.

### Ingesting Local data files

ExampleGen component of TFX support ingesting CSVs, precomputed TFRecord files and serialization outputs from apache avro/parquet files.

Following code segment demonstrate in the ingestion of folder containing CSVs.

In [8]:
import tensorflow as tf
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

context = InteractiveContext()



In [9]:
import os
from tfx.components import CsvExampleGen


base_dir = os.getcwd()
data_dir = os.path.join(base_dir, "data", "CSVs")

exmaple_gen = CsvExampleGen(data_dir)

context.run(exmaple_gen)



0,1
.execution_id,1
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } CsvExampleGen at 0x1b674398f70.inputs{}.outputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b6743acdc0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0.exec_properties['input_base']c:\Users\ddsdi\Desktop\Machine Learning Pipelines\data\CSVs['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:78956236,xor_checksum:1658167881,sum_checksum:1658167881"
.component.inputs,{}
.component.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b6743acdc0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
.inputs,{}
.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b6743acdc0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"
.exec_properties,"['input_base']c:\Users\ddsdi\Desktop\Machine Learning Pipelines\data\CSVs['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:78956236,xor_checksum:1658167881,sum_checksum:1658167881"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b6743acdc0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1
.span,0
.split_names,"[""train"", ""eval""]"
.version,0

0,1
['input_base'],c:\Users\ddsdi\Desktop\Machine Learning Pipelines\data\CSVs
['input_config'],"{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }"
['output_config'],"{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }"
['output_data_format'],6
['output_file_format'],5
['custom_config'],
['range_config'],
['span'],0
['version'],
['input_fingerprint'],"split:single_split,num_files:1,total_bytes:78956236,xor_checksum:1658167881,sum_checksum:1658167881"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x1b6743acdc0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1) at 0x1b4a4baf1c0.type<class 'tfx.types.standard_artifacts.Examples'>.uriC:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,C:\Users\ddsdi\AppData\Local\Temp\tfx-interactive-2022-07-20T19_17_59.640495-11c9ykx4\CsvExampleGen\examples\1
.span,0
.split_names,"[""train"", ""eval""]"
.version,0


But there are many cases our data cannot be expressed as CSVs like images. In such cases it is recommended to convert the data to TFRecord format and then load it with TFX ExampleGen component. About converting data to TFRecords we'll look later. But the way to load data is as follows.

In [None]:
import os
from tfx.components import ImportExampleGen

base_dir = os.getcwd()
data_dir = os.path.join(base_dir, "data", "TFRecords")

example_gen = ImportExampleGen(input=data_dir) # To load TFRecord formatted data.

context.run(example_gen)

Since TFRecords contain data in tf.Example format there's no need to do a conversion hence the direct `ImportExampleGen`.

But there may be cases you have a data type which is not provided by TFX. In such cases, we can override the 'executor' part of ExampleGen component (remember the architecture of components!). 

We will use the generic file loader `FileBasedExampleGen` which allows us to override the executor_class to read Parquet files.

In [None]:
from tfx.components import FileBasedExampleGen
from tfx.components.example_gen.custom_executors import parquet_executor

base_dir = os.getcwd()
data_dir = os.path.join(base_dir, "data", "Parquets")

example_gen = FileBasedExampleGen(
                        data_dir, 
                        custom_executor_spec=parquet_executor.Executor) # Overriding the executor


Sometimes it is more convienient to convert our data to TFRecord files before using. Assume a case our data is in a JSON or XML format. We can convert these to TFRecord format to use with ImportExampleGen component. But first we need to understand the tf.Example/TFRecord data structures.

Generally speaking, tf.Example is a key-value mapping (like python dictionary). These tf.Example expects a tf.Features object with have a dictionary of features as key value pairs. The `key` is always a string representing the feature and value is tf.train.Feature object. TFRecord consist with such tf.Example objects.

eg:- TFRecord object

- Record 1:
    - tf.Example
        * tf.Features
            * 'column A': tf.train.Feature
            * 'column B': tf.train.Feature
            * 'column C': tf.train.Feature

Here tf.train.Feature allows data types like tf.train.BytesList, tf.train.FloatList, tf.train.Int64List etc.

With those in mind we can convert our data into TFRecord data type as below.