# Multimodal Lhotse Dataloading

This tutorial explains how NeMo uses Lhotse for multimodal dataloading.
The modalities supported as of the time of writing are audio and text.
The intended audience of this tutorial are NeMo developers and persons who build/modify NeMo models.
After finishing this tutorial, you should have an understanding how to use various Lhotse building blocks in NeMo for designing the kind of model you want.

We cover the following topics:
* What are data types?
* What data types are availabe in NeMo?
* How do we read them from files?
* How to apply prompt formatting to various data types?
* How to create tensors for training with these examples?
* How to optimize the training by stratifying data sampling on sequence lengths, and how these lengths are measured for different examples and models. 
* How to train on multiple data types together?

## Data types

A data type represents examples of your training data: speech recordings, text sentences, text sentence pairs, conversations, etc.

A data type consists of:
* a class that represents a single sample
  * includes properties allowing sequence length measurement for sampling purposes
* a parser class that's initialized with a config (e.g. paths to data) and acts as an iterator of examples
* extension functions that define how to apply prompt formatting to a given data type

NeMo uses Lhotse Cuts as a basic data type for audio, and defines several data types for text. We'll go over them below.

External references:
* [Lhotse documentation](https://lhotse.readthedocs.io/en/latest/getting-started.html)
* [Lhotse in NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#lhotse-dataloading)

### Audio examples (Lhotse cuts)

In [1]:
from lhotse import MonoCut, Recording, SupervisionSegment, AudioSource
from lhotse.testing.dummies import dummy_cut


# A basic audio example: recording with transcription
cut = MonoCut(
    id="utt-0",
    start=0.0,
    duration=10.0,
    channel=0,
    supervisions=[SupervisionSegment(id="utt-0", recording_id="rec-0", start=0.0, duration=10.0, text="Welcome to Lhotse!")],
    recording=Recording(
        id="rec-0",
        sources=[AudioSource(type="file", channels=[0], source="/path/to/recording.wav")],
        sampling_rate=16000,
        duration=10.0,
        num_samples=160000,
    ),
)

## Single text examples 

In [2]:
from nemo.collections.common.data.lhotse.text_adapters import TextExample

# A basic text example: single line of text.
text = TextExample(
    text="This is a single sentence, which may be used in language modeling.",
    language="en"
)
print(text)

TextExample(text='This is a single sentence, which may be used in language modeling.', language='en', tokens=None, custom=None)


## Pairs of text examples

In [3]:
from nemo.collections.common.data.lhotse.text_adapters import SourceTargetTextExample

# A pair of text examples, usable e.g. in machine translation.
text_pair = SourceTargetTextExample(
    source=TextExample(
        text="Some machine translation example.",
        language="en",
    ),
    target=TextExample(
        text="Algunos ejemplos de traducción automática.",
        language="es",
    ),
)

## Conversations: text, audio, and multimodal

In [4]:
from nemo.collections.common.data.lhotse.text_adapters import NeMoMultimodalConversation, TextTurn, AudioTurn

# A text-only conversation, useful for chat LLM training.
text_conversation = NeMoMultimodalConversation(
    id="convo-text-0",
    turns=[
        TextTurn(value="Is this a text-only conversation?", role="user"),
        TextTurn(value="Yes, but we can do more than that.", role="assistant"),
        TextTurn(value="Tell me more.", role="user"),
        TextTurn(value="Of course! Let's move on to the next example.", role="assistant"),
    ]
)

# An audio-only conversation, useful for chat speech LLM training.
# We'll explain [audio] tag and token_equivalent_duration later in this tutorial.
audio_conversation = NeMoMultimodalConversation(
    id="convo-audio-0",
    turns=[
        AudioTurn(cut=dummy_cut(0, duration=7.18, with_data=True), role="user", audio_locator_tag="[audio]"),
        AudioTurn(cut=dummy_cut(0, duration=21.64, with_data=True), role="assistant", audio_locator_tag="[audio]"),
    ],
    token_equivalent_duration=0.08,
)

# A multimodal conversation.
multimodal_conversation = NeMoMultimodalConversation(
    id="convo-multimodal-0",
    turns=[
        TextTurn(value="Is this a text-only conversation?", role="user"),
        TextTurn(value="No, feel free to speak to me.", role="assistant"),
        AudioTurn(cut=dummy_cut(0, duration=5.87, with_data=True), role="user", audio_locator_tag="[audio]"),
        TextTurn(value="Should I respond in voice too?", role="assistant"),
        TextTurn(value="Yes", role="user"),
        TextTurn(value="Certainly!", role="assistant"),
        AudioTurn(cut=dummy_cut(0, duration=14.62, with_data=True), role="assistant", audio_locator_tag="[audio]"),
    ],
    token_equivalent_duration=0.08,
)

As you can see, these data structures serve as a complete description of training examples of different types, 
as they contain both the data (audio) and various metadata.

## Parsing data types from files

Related: for an overview of NeMo data configuration format, please see these docs: 
* [Extended multi-dataset configuration format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#extended-multi-dataset-configuration-format)
* [Configuring multi-modal dataloading](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#configuring-multi-modal-dataloading)

The goal of data type parser is to read a configuration specifying where the data is located / how to read it,
create an iterable over the corresponding data type, and wrap it into a Lhotse CutSet.

Adding support for a new data type parser requires two components:
* An adapter/iterator class dedicated to your data type.
* A function that instantiates this adapter/iterator, registered with a `@data_type_parser("name")` decorator to make it auto-detectable by NeMo.

We'll take a deeper look at how source-target text example pairs parsing is implemented. We'll implement a custom parser for `SourceTargetTextExample` that reads them from JSON files.

In [5]:
from lhotse.serialization import load_jsonl
import random
from typing import Literal, Iterator
from dataclasses import dataclass

from lhotse import CutSet
from lhotse.dataset.dataloading import resolve_seed
from omegaconf import DictConfig
from nemo.collections.common.data.lhotse.nemo_adapters import expand_sharded_filepaths
from nemo.collections.common.data.lhotse.cutset import data_type_parser


@dataclass
class LhotseTextPairAdapterFromJsonl:
    manifest_path: str | list[str]
    shuffle_shards: bool = False
    shard_seed: int | Literal["trng", "randomized"] = "trng"

    def __post_init__(self):
        self.manifest_path = expand_sharded_filepaths(self.manifest_path)

    def __iter__(self) -> Iterator[SourceTargetTextExample]:
        seed = resolve_seed(self.shard_seed)
        rng = random.Random(seed)
        paths = self.manifest_path
        if self.shuffle_shards:
            rng.shuffle(paths)
        for p in paths:
            for item in load_jsonl(p):
                yield SourceTargetTextExample(
                    source=TextExample(item["source"], item.get("source_lang")),
                    target=TextExample(item["target"], item.get("target_lang")),
                    question=(
                        TextExample(item["prompt"], language=item("prompt_lang"))
                        if "prompt" in item
                        else None
                    ),
                )


@data_type_parser("txt_pair_jsonl")
def read_txt_pair_paths(config: DictConfig) -> tuple[CutSet, bool]:
    cuts = CutSet(
        LhotseTextPairAdapterFromJsonl(
            manifest_path=config.manifest_path,
            shuffle_shards=config.shuffle,
            shard_seed=config.shard_seed,
        )
    )
    if not config.get("force_finite", False):
        cuts = cuts.repeat()
    return cuts, True

Note that there is a bit of boilerplate (`expand_sharded_filepaths`, `force_finite`, `shuffle_shards`, `shard_seed`) - we might reduce the amount of necessary boilerplate in the future, but for now it is required.

Let's test that it works. We'll first create two JSONL files (shards) with one entry each, and later use NeMo's path expansion mechanism to provide them as the input configuration.

Then, we'll read it using the high-level API `read_cutset_from_config` that's actually used by NeMo+Lhotse dataloader to show that the auto-registration mechanism works as expected.

In [6]:
!echo '{"source": "A", "target": "B"}' >> _tutorial_nmt_0.jsonl
!echo '{"source": "C", "target": "D"}' >> _tutorial_nmt_1.jsonl

from nemo.collections.common.data.lhotse.cutset import read_cutset_from_config

data, use_iterable_dataset = read_cutset_from_config(
    {
        "input_cfg": [
            {
                "type": "txt_pair_jsonl", 
                "manifest_path": "_tutorial_nmt__OP_0..1_CL_.jsonl", 
            }
        ]
    }
)

example = next(iter(data))
assert isinstance(example, SourceTargetTextExample)
assert example.source.text == "A"
assert example.target.text == "B"
print(example)

    


SourceTargetTextExample(source=TextExample(text='A', language=None, tokens=None, custom=None), target=TextExample(text='B', language=None, tokens=None, custom=None), question=None, custom=None)


## Prompt formatting and conversion of data types to tensors

Since we now understand how data types are read, let's see how to convert them to actual training examples.
Because this tutorial is focused on multimodal LLM / speech LLM training, we'll be using prompt templates adequate for various LLMs to prepare the training data. In this example, we'll use Llama2 prompt template to format each data type.

 We'll need to initialize a prompt formatter and a tokenizer; we'll just train a dummy BPE tokenizer for the purpose of the tutorial.

In [7]:
import string
import shlex
from nemo.collections.common.tokenizers.sentencepiece_tokenizer import SentencePieceTokenizer, create_spt_model
from nemo.collections.common.prompts.formatter import PromptFormatter

!echo {shlex.quote(' '.join(string.printable))} > _tutorial_train_text.txt

tok_path, vocab_path = create_spt_model(
    data_file="_tutorial_train_text.txt", 
    output_dir="_tutorial_spt",
    vocab_size=512, 
    sample_size=-1, 
    do_lower_case=False, 
    bos=True, 
    eos=True, 
    pad=True, 
    user_defined_symbols=["[INST]", "[/INST]", "<<SYS>>", "<</SYS>>", "[audio]"]
)

tokenizer = SentencePieceTokenizer(tok_path)
prompt = PromptFormatter.resolve("llama2")(tokenizer)

[NeMo I 2024-10-18 14:12:19 sentencepiece_tokenizer:333] tokenizer model _tutorial_spt/tokenizer.model already exists


Now, we'll convert the data types to a training/inference friendly format. Specifically, we want to have 4 tensors:
* `context_ids`: token IDs that serve as the input for LLM (e.g. user query, conversation history, etc.)
* `answer_ids`: token IDs that serve as the answer for LLM (assistant response)
* `input_ids`: concatenated `context_ids` and `answer_ids`
* `mask`: loss mask that's only set to `True` for each token belonging to each of assistant's turns. Same length as `input_ids`.

Let's first go through Cut, SourceTargetTextExample, and NeMoMultimodalConversation to see what happens with them.

In [8]:
from nemo.collections.common.data.prompt_fn import apply_prompt_format_fn

cut.context = "Repeat after me:"
print("Cut:")
formatted = apply_prompt_format_fn(cut, prompt)
for name in ["input_ids", "context_ids", "answer_ids"]:
    print("\t*", name, tokenizer.ids_to_text(formatted[name]))
print("loss mask", formatted["mask"])
print()

print("SourceTargetTextExample:")
formatted = apply_prompt_format_fn(text_pair, prompt)
for name in ["input_ids", "context_ids", "answer_ids"]:
    print("\t*", name, tokenizer.ids_to_text(formatted[name]))
print("loss mask", formatted["mask"])
print()

print("NeMoMultimodalConversation:")
formatted = apply_prompt_format_fn(multimodal_conversation, prompt)
for name in ["input_ids", "context_ids", "answer_ids"]:
    print("\t*", name, tokenizer.ids_to_text(formatted[name]))
print("loss mask", formatted["mask"])
print()

Cut:
	* input_ids [INST] Repeat after me: [/INST] Welcome to Lhotse!
	* context_ids [INST] Repeat after me: [/INST]
	* answer_ids Welcome to Lhotse!
loss mask tensor([False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True])

SourceTargetTextExample:
	* input_ids [INST] Some machine translation example. [/INST] Algunos ejemplos de traducci ⁇ n autom ⁇ tica.
	* context_ids [INST] Some machine translation example. [/INST]
	* answer_ids Algunos ejemplos de traducci ⁇ n autom ⁇ tica.
loss mask tensor([False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
  

Note how each example got converted into the same prompt format. 

For multimodal conversation we have a special mechanism that replaces audio turns with an `audio_locator_tag`. 
We expect that the tokenizer contains this tag as a special token.
The user will later replace these special tokens with audio representations (tokenized, or not) in the training step of the model. 

If you create a new prompt format, or a new data type, or want to specialize how a given data type is formatted with a given prompt, it is easily customizable by defining a single function with `@registered_prompt_format_fn(DataType, PromptFormatterType)` decorator. For example, if we created a new data type called `TextTriplet`, and added a default prompt format function, and another one specialized for Llama2:

In [9]:
from nemo.collections.common.prompts import Llama2PromptFormatter
from nemo.collections.common.data.prompt_fn import registered_prompt_format_fn
from nemo.collections.common.data.lhotse.text_adapters import Formattable, CustomFieldMixin


@dataclass
class TextTriplet(Formattable, CustomFieldMixin):
    # Note: we will explain Formattable and CustomFieldMixin in the next sections.
    text1: str
    text2: str
    text3: str


@registered_prompt_format_fn(TextTriplet)
def text_triplets_generic(example: TextTriplet, prompt: PromptFormatter):
    return prompt.encode_dialog(turns=[
        {"role": "user", "slots": {"message": f"{example.text1} {example.text2}"}},
        {"role": "assistant", "slots": {"message": f"{example.text3}"}},
    ])

    
@registered_prompt_format_fn(TextTriplet, Llama2PromptFormatter)
def text_triplets_llama2(example: TextTriplet, prompt: Llama2PromptFormatter):
    return prompt.encode_dialog(turns=[
        {"role": "system_and_user", "slots": {"system": example.text1 , "message": example.text2}},
        {"role": "assistant", "slots": {"message": example.text3}},
    ])


formatted = apply_prompt_format_fn(TextTriplet("A", "B", "C"), prompt)
for k, v in formatted.items():
    print(k, v)

input_ids tensor([ 1,  9,  4,  9,  6,  9, 42,  9,  7,  9, 43,  9,  5,  9, 44,  2])
context_ids tensor([ 1,  9,  4,  9,  6,  9, 42,  9,  7,  9, 43,  9,  5])
answer_ids tensor([ 9, 44,  2])
mask tensor([False, False, False, False, False, False, False, False, False, False,
        False, False, False,  True,  True,  True])


If we also created a data type parser for `TextTriplet` like we did for `SourceTargetTextExample` in the section before, we have a complete new data type support for dataloading. 

## Support for sequence length stratification / dynamic bucketing

References: 
* [EMMeTT: Efficient Multimodal Machine Translation Training](https://arxiv.org/abs/2409.13523) 

We found that by using dynamic bucketing with [OOMptimizer](https://github.com/NVIDIA/NeMo/blob/main/docs/source/asr/datasets.rst#pushing-gpu-utilization-to-the-limits-with-bucketing-and-oomptimizer) can significantly accelerate multimodal LLM training. 
In order to ensure that all data types can benefit from this acceleration, we introduced the `Formattable` concept.
It indicates that a given data type supports prompt formatting and provides properties to measure input and output sequence length.

Let's see this in action with the previously formatted data types:

In [10]:
print("SourceTargetTextPair:")
text_pair = text_pair.apply_prompt_format(prompt)
print("\t*", "input_length", text_pair.input_length)
print("\t*", "output_length", text_pair.output_length)
print("\t*", "total_length", text_pair.total_length)
print("\t*", "len(context_ids)", len(text_pair.context_ids))
print("\t*", "len(answer_ids)", len(text_pair.answer_ids))
print("\t*", "len(input_ids)", len(text_pair.input_ids))

print("NeMoMultimodalConversation")
text_pair = multimodal_conversation.apply_prompt_format(prompt)
print("\t*", "input_length", multimodal_conversation.input_length)
print("\t*", "output_length", multimodal_conversation.output_length)
print("\t*", "total_length", multimodal_conversation.total_length)
print("\t*", "len(context_ids)", len(multimodal_conversation.context_ids))
print("\t*", "len(answer_ids)", len(multimodal_conversation.answer_ids))
print("\t*", "len(input_ids)", len(multimodal_conversation.input_ids))


SourceTargetTextPair:
	* input_length 39
	* output_length 44
	* total_length 83
	* len(context_ids) 39
	* len(answer_ids) 44
	* len(input_ids) 83
NeMoMultimodalConversation
	* input_length 191
	* output_length 196
	* total_length 387
	* len(context_ids) 118
	* len(answer_ids) 14
	* len(input_ids) 132


Note that for `NeMoMultimodalConversation` the length is much greater that the number of text tokens. 
This is where `token_equivalent_duration` comes in: we want to factor in the audio turns into sequence lengths.
Since we know what is the duration of audio, we only need to know how much duration should be covered by each audio "token" or "frame".
A typical setup would be with NeMo FastConformer as an audio encoder, which uses 10ms frames at the input and subsamples them by a factor of 8 in the output. 
The resulting `token_equivalent_duration` is therefore `0.08`, i.e., a single token created from audio is worth 80ms of duration. 
For length computation, we sum the number of text tokens and the equivalent number of audio tokens.

We can see that lhotse's `DynamicBucketingSampler` is able to process this data using NeMo multimodal sampling strategies:

In [11]:
from lhotse.dataset import DynamicBucketingSampler
from nemo.collections.common.data.lhotse.sampling import MultimodalFixedBucketBatchSizeConstraint2D

cuts = CutSet([multimodal_conversation]).repeat()  # repeat makes iterable infinite
sampler = DynamicBucketingSampler(
    cuts, 
    constraint=MultimodalFixedBucketBatchSizeConstraint2D(
        max_seq_len_buckets=[32, 64, 128, 256, 512, 1024, 1536, 2048],
        batch_sizes=[8, 7, 6, 5, 4, 3, 2, 1],
        token_equivalent_duration=0.08, 
        measure_total_length=True,
    ),
    buffer_size=10,
)

batch = next(iter(sampler))
assert len(batch) == 4  
# Our conversation example fell into bucket number 4 (min: 256, max: 512) with an assigned batch size of 4

## Putting it all together to configure joint audio, text, and conversation dataloading

We'll showcase some higher level APIs here. First, we'll create data examples on disk for three distinct types: audio to text, text to text, and multimodal conversations.

In [12]:
from pathlib import Path
from lhotse.serialization import save_to_jsonl
from lhotse.testing.dummies import dummy_recording

# Prepare dummy ASR data
d = Path("_tutorial_data")
!mkdir -p {d}/asr_shar
cut = dummy_recording(0, duration=17.11, with_data=True).to_cut()
cut.supervisions = [SupervisionSegment(id=cut.id, recording_id=cut.id, start=0.0, duration=cut.duration, text="Welcome to Lhotse!")]
cut.context = "Repeat after me"
CutSet([cut.save_audio(d / "rec.flac")]).to_shar(d / "asr_shar", fields={"recording": "flac"})

# Prepare dummy translation data
(d / "src.txt").write_text("A")
(d / "tgt.txt").write_text("B")

# Prepare dummy multimodal conversation
save_to_jsonl(
    [
        {
            "id": "convo-1",
            "conversations": [
                {"from": "user", "value": "tell me what you hear", "type": "text"},
                {"from": "user", "value": str(d / "rec.flac"), "duration": cut.duration, "type": "audio"},
                {"from": "assistant", "value": "somebody just welcomed me to a himalayan mountain", "type": "text"},
            ]
        }
    ],
    d / "conv.jsonl"
)

Now we'll configure a Lhotse dataloader to yield mini-batches with different data types in a round-robin fashion.

In [13]:
import torch
from omegaconf import OmegaConf
from nemo.collections.common.data.lhotse.dataloader import get_lhotse_dataloader_from_config

# This configuration is typically present in NeMo training configs under `model.train_ds` key.
cfg = OmegaConf.create({
    # Note that we have several sampler groups under keys: "asr", "nmt", and "chat".
    # Each group has its own data source and sampling settings, i.e., you can define
    # completely different batch sizes, sequence length filters, etc. for each type of data.
    # To enable this behaviour, set multi_config to True.
    "multi_config": True,
    
    # The following fields are shared by all groups.
    # sampler_fusion key determines how to yield batches from different samplers:
    # * "round_robin" will just yield one type at a time
    # * "zip" will sample a batch for each type and concatenate them, yielding a larger multimodal batch
    # * "randomized_round_robin" expects an extra "sampler_weights" option which will define sampling probs for each group.:
    "sampler_fusion": "round_robin",
    "shuffle": True,
    "num_workers": 0,
    "seed": 0,
    "shard_seed": "trng",
    
    "asr": {
        "input_cfg": [
            {
                "type": "lhotse_shar", 
                "shar_path": d / "asr_shar"
            }
        ],
        "min_duration": 0.5,
        "max_duration": 40,
        "use_bucketing": True,
        "bucket_duration_bins": [5, 10, 20, 40],
        "bucket_batch_size": [4, 3, 2, 1],
        "prompt_format": "llama2",

        # Simplified settings for quick tutorial running (don't use those in real applciations).
        "concurrent_bucketing": False,
        "bucket_buffer_size": 50,
        "shuffle_buffer_size": 50,
    },

    "nmt": {
        "input_cfg": [
            {
                "type": "txt_pair", 
                "source_paths": d / "src.txt", 
                "target_paths": d / "tgt.txt"
            }
        ],
        "use_multimodal_sampling": True,  # will count tokens instead of seconds
        "min_tokens": 1,
        "max_tokens": 32,
        "measure_total_length": False,  # filters by input length instead of total length
        "use_bucketing": True,
        "bucket_duration_bins": [[16, 16], [16, 32], [32, 16], [32, 32]],  # 2D buckets
        "bucket_batch_size": [4, 3, 2, 1],
        "prompt_format": "llama2",
        
        # Simplified settings for quick tutorial running (don't use those in real applciations).
        "concurrent_bucketing": False,
        "bucket_buffer_size": 50,
        "shuffle_buffer_size": 50,
    },

    "chat": {
        "input_cfg": [
            {
                "type": "multimodal_conversation", 
                "manifest_filepath": d / "conv.jsonl", 
                "audio_locator_tag": "[audio]"
            }
        ],
        "use_multimodal_sampling": True,  # will count tokens instead of seconds
        "min_tokens": 1,
        "max_tokens": 1024,
        "measure_total_length": True,
        "token_equivalent_duration": 0.08,
        "use_bucketing": True,
        "bucket_duration_bins": [128, 256, 512, 1024],
        "bucket_batch_size": [4, 3, 2, 1],
        "prompt_format": "llama2",

        # Simplified settings for quick tutorial running (don't use those in real applciations).
        "concurrent_bucketing": False,
        "bucket_buffer_size": 50,
        "shuffle_buffer_size": 50,
    },
})


# A no-op PyTorch Dataset class that will just return the data structures.
# In a real training setup, you'll want to implement conversion of a list of examples to a tensor mini-batch
# that is adequate for your model. 
# Note that you can handle multiple types of examples to create appropriate mini-batch schema for each.
class Identity(torch.utils.data.Dataset):
    def __getitem__(self, examples: CutSet):
        return examples

dloader = get_lhotse_dataloader_from_config(cfg, global_rank=0, world_size=1, dataset=Identity(), tokenizer=tokenizer)

[NeMo I 2024-10-18 14:12:19 dataloader:481] Creating a Lhotse DynamicBucketingSampler (max_batch_duration=None max_batch_size=None)
[NeMo I 2024-10-18 14:12:19 dataloader:481] Creating a Lhotse DynamicBucketingSampler (max_batch_duration=None max_batch_size=None)
[NeMo I 2024-10-18 14:12:19 dataloader:481] Creating a Lhotse DynamicBucketingSampler (max_batch_duration=None max_batch_size=None)


In [14]:
for idx, batch in enumerate(dloader):
    if idx == 5:
        break
    print(f"Step {idx}. Examples:")
    for item in batch:
        print("\t*", item)
    print()

Step 0. Examples:
	* MonoCut(id='dummy-recording-0000_repeat10', start=0, duration=17.11, channel=0, supervisions=[SupervisionSegment(id='dummy-recording-0000', recording_id='dummy-recording-0000', start=0.0, duration=17.11, channel=0, text='Welcome to Lhotse!', language=None, speaker=None, gender=None, custom=None, alignment=None)], features=None, recording=Recording(id='rec', sources=[AudioSource(type='memory', channels=[0], source='<binary-data>')], sampling_rate=16000, num_samples=273760, duration=17.11, channel_ids=[0], transforms=None), custom={'context': 'Repeat after me', 'shard_origin': PosixPath('_tutorial_data/asr_shar/cuts.000000.jsonl.gz'), 'shar_epoch': 10, 'input_ids': tensor([ 1,  9,  4,  9, 59, 78, 89, 78, 74, 93,  9, 74, 79, 93, 78, 91,  9, 86,
        78,  9,  5,  9, 64, 78, 85, 76, 88, 86, 78,  9, 93, 88,  9, 53, 81, 88,
        93, 92, 78, 10,  2]), 'context_ids': tensor([ 1,  9,  4,  9, 59, 78, 89, 78, 74, 93,  9, 74, 79, 93, 78, 91,  9, 86,
        78,  9,  5]), 