# **DataPipeline Guide**

This notebook is a guide for DataPipeline, a configurable streaming dataset loader. DataPipeline's core functionality is to convert each row of a tabular dataset into a vector that Bolt can understand. What makes DataPipeline unique is that it allows you to configure what features to extract from a dataset and how to encode them as vectors. We do this by using what we call "blocks", objects that read specific columns of each row in a dataset and adds a feature segment to the row's corresponding vector.

To illustrate, let's take a look at an example use case. First, we generate a mock dataset. Each row of this dataset has two comma-delimited columns. The first column contains a class ID and the second column contains a sentence. This is typical of a text classification dataset, where the task is to predict the class ID from the given text.

In [None]:
n_classes = 3
n_samples = 10_000

with open("test.csv") as f:
    for i in range(n_samples):
        sentiment = i % 3
        if sentiment == 0:
            f.write("0,bad stuff\n")
        elif sentiment == 1:
            f.write("1,good stuff\n")
        else:
            f.write("2,neutral stuff\n")

Given the shape of our dataset and the task that it is used for, we want to pass vectors that represent text from the second column to our Bolt model as input and pass vectors that represent categorical information from the first column as labels. This is what that looks like with DataPipeline.

In [None]:
from thirdai.dataset import DataPipeline, blocks

pipeline = DataPipeline(
    filename="path/to/file.csv", 
    input_blocks=[blocks.Text(col=1, dim=100_000)],
    label_blocks=[blocks.Categorical(col=0, dim=2)],
    batch_size=256)

We've covered the main ideas, so now let's have a closer look at how we can configure this more.

In [None]:
from thirdai.dataset import text_encodings, categorical_encodings

pipeline = DataPipeline(
    filename="path/to/file.csv", 
    input_blocks=[
        blocks.Text(col=1, encoding=text_encodings.UniGram(dim=100_000)), # Default encoding for text block
        blocks.Text(col=1, encoding=text_encodings.PairGram(dim=100_000)),
        blocks.Text(col=1, encoding=text_encodings.CharKGram(k=3, dim=100_000)), # Character trigrams
        blocks.Categorical(col=0, encoding=categorical_encodings.ContiguousNumericId(dim=2)), # Default encoding for categorical block
    ],
    label_blocks=[blocks.Categorical(col=0, dim=2)],
    batch_size=256,
    has_header=True, # Pipeline discards the header if the dataset has one.
    delimiter='\t') # Any character is a valid delimiter.

# **Use case**
As you can see, DataPipeline is quite flexible – in fact, too flexible to be given to customers. However, this can be useful in cases where we don't know how to best encode features for a new dataset or task. We can try different combinations of features without writing new data processing scripts, without recompiling C++ code after each iteration, and with the speed of a C++ implementation.

# **Limitations**
You may also have noticed that we don't have that many blocks at all! This is because we have only implemented the bare minimum for a specific use case. The good news is that the block interface is highly extendable and we can keep adding new blocks as we see fit.