In [1]:
# In this notebook, you learn:
# 
# 1) How to create dataset iterators for English-Telugu translation data to be used by Transformers?
# 3) How to create DataLoaders that group the data based on the length of the sentences?

In [2]:
# Resources to go through to understand this notebook:
#
# 1) https://www.youtube.com/watch?v=IGu7ivuy1Ag
#       -- Gives a high level walk through of how the input and the output are structured and transformed in a transformer model.
#       -- This is important to understand so that we how how to batch our input data.
# 2) https://www.youtube.com/watch?v=dZzVA6VbAR8
#       -- Decent video that gives an overview about the Data Preparation process in transformers.

In [27]:
import datasets
import random
import spacy
import torch

from abc import ABC, abstractmethod
from tokenizers import ByteLevelBPETokenizer  # type: ignore
from torch import Tensor
from torch.utils.data import Dataset, DataLoader, Sampler
from torchtext.vocab import build_vocab_from_iterator
from typing import Callable, Dict, List, Iterator, Optional, Tuple

In the previous notebook, we created a Pytorch Dataloader on a simple dataset of tuples. However, the \
input data in the transformer models is a bunch of English, Telugu sentences. In this notebook, we \
build on the concepts from step_3 and show how to build Pytorch DataLoader on the English - Telugu \
sentences.

In [3]:
# Path (directory) where all the data used by this project is stored.
AI4_BHARAT_DATA_PATH = "../../Data/AI4Bharat"
# Number of tokens in the vocabulary of the tokenizer.
VOCAB_SIZE = 30000
START_TOKEN = "<sos>"
END_TOKEN = "<eos>"
PAD_TOKEN = "<pad>"
UNK_TOKEN = "<unk>"

The Transformer model expects a sequence of tokens as input and outputs one vector per token.   \
The DataLoaders should just provide the batches of sequence of tokens to the transformer model  \
during training given the training dataset as input. So, the tokenization should be done within \
the collate function of the Dataloaders. 

In [4]:
# Load the train dataset we have created and saved to disk in 'step_1_data_exploration.ipynb' notebook. 
en_te_translation_dataset = datasets.load_from_disk(f"{AI4_BHARAT_DATA_PATH}/train_dataset")
print(type(en_te_translation_dataset))
print(en_te_translation_dataset[0])
print(len(en_te_translation_dataset))
print("-" * 150)
en_te_debug_dataset = datasets.load_from_disk(f"{AI4_BHARAT_DATA_PATH}/debug_dataset")
print(en_te_debug_dataset[0])
print(len(en_te_debug_dataset))

<class 'datasets.arrow_dataset.Dataset'>
{'idx': 0, 'src': 'Have you heard about Foie gras?', 'tgt': 'ఇక ఫ్రూట్ ఫ్లైస్ గురించి మీరు విన్నారా?'}
250000
------------------------------------------------------------------------------------------------------------------------------------------------------
{'idx': 0, 'src': 'Have you heard about Foie gras?', 'tgt': 'ఇక ఫ్రూట్ ఫ్లైస్ గురించి మీరు విన్నారా?'}
200


## Setting up the Tokenizers.

We first train tokenizers and pass these tokenizers to the DataLoader so that tokenization is    <br>
handled within the DataLoader.

In [5]:
def text_extractor(data_point: dict[str, str], language: str) -> str:
    """Extracts the appropriate text from the example in the dataset based on the language.

    Args:
        data_point (dict[str, str]): A single example from the dataset containing the text in the form of 
                                     a dictionary. The sources sentence is stored in the key 'src' and the 
                                     target sentence is stored in the key 'tgt'.
        language (str): Language of the text to be extracted from the data_point.

    Raises:
        ValueError: Raises an error if the language is not 'english' or 'telugu'.

    Returns:
        str: The text in the data_point.
    """
    if language == "english":
        return data_point["src"]
    elif language == "telugu":
        return data_point["tgt"]
    raise ValueError("Language should be either 'english' or 'telugu'.")

It's easier to create classes to create tokenizers and keep track of the required data. We will also    <br>
use these classes in the actual implementation of the model. The code used within the tokenizer <br>
classes is same as what we have seen in step_2, but just a bit organized.

In [11]:
# This is a base class that will be inherited by the actual tokenizer classes.
class BaseTokenizer(ABC):
    """A class created to hold different kinds of tokenizers and handle the token encoding in a common way.
       Here, we only use SpacyTokenizer and HuggingFaceTokenizer."""
    def __init__(self, language: str, tokenizer_type: str):
        self.language = language
        self.tokenizer_type = tokenizer_type
        self.special_tokens = [START_TOKEN, END_TOKEN, PAD_TOKEN, UNK_TOKEN]

    # Abstract methods need to be overridden by the child class. It raises TypeError if not overridden.
    @abstractmethod
    def initialize_tokenizer_and_build_vocab(self, 
                                             data_iterator: datasets.arrow_dataset.Dataset, 
                                             text_extractor: Callable[[dict[str, str], str], str], 
                                             max_vocab_size: Optional[int] = VOCAB_SIZE):
        """Initializes the tokenizers and builds the vocabulary for the given dataset.

        Args:
            data_iterator (datasets.arrow_dataset.Dataset): An iterator that gives input sentences (text) when iterated upon.
            text_extractor (Callable[[dict[str, str], str], str]): A function that extracts the appropriate text from the input 
                dataset. This parameter is added to make the tokenizer class independent of the input dataset format. If not 
                provided as an argument, we will have to extract the text from the dataset within the 'CustomTokenizer' class 
                which makes it dependent on the dataset format. 
            max_vocab_size (int, optional): Maximum size of the vocabulary to create from the input data corpus. Defaults 
                                            to VOCAB_SIZE (30000).
        """
        pass

    # Abstract methods need to be overridden by the child class. It raises TypeError if not overridden.
    @abstractmethod
    def tokenize(self, text: str) -> list[str]:
        """Returns the individual tokens (possibly readable text) for the given text."""
        pass

    # Abstract methods need to be overridden by the child class. It raises TypeError if not overridden.
    @abstractmethod
    def encode(self, text: str) -> list[str]:
        """Returns the encoded token ids for the given text."""
        pass
    
    @abstractmethod
    def decode(self, token_ids: List[int]) -> str:
        """Converts the series of token ids back to the original text.

        Args:
            token_ids (List[int]): A list of token ids corresponding to some text.

        Returns:
            str: The original text corresponding to the token ids.
        """
        pass

    # Abstract methods need to be overridden by the child class. It raises TypeError if not overridden.
    @abstractmethod
    def get_token_id(self, token: str) -> int:
        """Returns the token id for the given token. If the token is not present in the vocabulary, it returns None."""
        pass

In [12]:
# Refer 'step_2_training_bpe_tokenizer.ipynb' and 'step_2_alternate_tokenization_with_spacy.ipynb' notebooks to understand this class better.
class SpacyTokenizer(BaseTokenizer):
    """Creates a tokenizer that tokenizes the text using the Spacy tokenizer models."""
    def __init__(self, language: str):
        super().__init__(language, "spacy")
    
    def initialize_tokenizer_and_build_vocab(self, 
                                             data_iterator: datasets.arrow_dataset.Dataset, 
                                             text_extractor: Callable[[dict[str, str], str], str], 
                                             max_vocab_size: int = 40000):
        # Load spacy models for English text tokenization.
        if self.language == "english":
            self.tokenizer = spacy.load("en_core_web_sm").tokenizer          
        elif self.language == "telugu":
            # Load spacy model for Telugu text tokenization.
            self.tokenizer = spacy.blank("te").tokenizer            
        else:
            # Raise an error for unknown language
            pass
        self.max_vocab_size = max_vocab_size
        self.__build_vocab(data_iterator=data_iterator, text_extractor=text_extractor)

    def tokenize(self, text: str) -> list[str]:
        return [token.text for token in self.tokenizer(text)]

    def encode(self, text: str) -> list[int]:
        return self.vocab(self.tokenize(text))
    
    def decode(self, token_ids: List[int]) -> str:
        """Converts the series of token ids back to the original text.

        Args:
            token_ids (List[int]): A list of token ids corresponding to some text.

        Returns:
            str: The original text corresponding to the token ids.
        """
        token_strings = self.vocab.lookup_tokens(token_ids)
        # Here we are just attaching all the individual token strings with a space in between to form 
        # the original text. This does not give the exact original text but a close approximation. 
        # Using this here for simplicity. This might not always be the correct way to decode the text.
        return " ".join(token_strings)

    def get_token_id(self, token: str) -> int:
        return self.vocab([token])[0]

    def __yield_tokens(self, data_iterator: datasets.arrow_dataset.Dataset, text_extractor: Callable[[dict[str, str], str], str]):
        """Returns a generator object that emits tokens for each sentence in the dataset"""
        for data_point in data_iterator:
            yield self.tokenize(text_extractor(data_point, self.language))

    def __build_vocab(self, data_iterator: datasets.arrow_dataset.Dataset, text_extractor: Callable[[dict[str, str], str], str]):
        """Builds the vocabulary for the given dataset"""
        self.vocab = build_vocab_from_iterator(iterator=self.__yield_tokens(data_iterator=data_iterator, text_extractor=text_extractor), 
                                               min_freq=2, 
                                               specials=self.special_tokens, 
                                               special_first=True, 
                                               max_tokens=self.max_vocab_size)
        self.vocab.set_default_index(self.vocab[UNK_TOKEN])

In [13]:
spacy_telugu_tokenizer: BaseTokenizer = SpacyTokenizer(language="telugu")
spacy_telugu_tokenizer.initialize_tokenizer_and_build_vocab(data_iterator=en_te_translation_dataset, text_extractor=text_extractor, max_vocab_size=VOCAB_SIZE)
spacy_english_tokenizer: BaseTokenizer = SpacyTokenizer(language="english")
spacy_english_tokenizer.initialize_tokenizer_and_build_vocab(data_iterator=en_te_translation_dataset, text_extractor=text_extractor, max_vocab_size=VOCAB_SIZE)

In [None]:
# Note that we cannot add special tokens to the sentence and ask the spacy_tokenizer to tokenize. If we do that, it 
# will break the special tokens into simpler tokens again. This is because the spacy tokenizer was not trained 
# specifically to keep the special tokens intact during tokenization. It doesn't know about special tokens by 
# default. spacy uses a default tokenizer which is already trained by some rules. We haven't added any rules about 
# tokenization ourselves. However, since we added the special tokens to the vocabulary, spacy tokenizer maps them to 
# the corresponding token ids and returns them when prompted for ids.
telugu_sentence = "ఇక ఫ్రూట్ ఫ్లైస్ గురించి మీరు విన్నారా?"
print(spacy_telugu_tokenizer.tokenize(text=telugu_sentence))
# This might print <unk> token id (3) for some of the tokens. This is because the spacy tokenizer might not have some
# of these tokens in its vocabulary. We limited the vocabulary to 30000 tokens which probably left out a huge number
# of tokens from the vocabulary.
print(spacy_telugu_tokenizer.encode(text=telugu_sentence))
print(spacy_telugu_tokenizer.decode(token_ids=spacy_telugu_tokenizer.encode(text=telugu_sentence)))  # type: ignore
english_sentence = "Have you heard about Foie gras?"
print(spacy_english_tokenizer.tokenize(text=english_sentence))
print(spacy_english_tokenizer.encode(text=english_sentence))
print(spacy_english_tokenizer.decode(token_ids=spacy_english_tokenizer.encode(text=english_sentence)))  # type: ignore
print("token(<sos>) =", spacy_english_tokenizer.get_token_id(START_TOKEN), ", token(<eos>) =", spacy_english_tokenizer.get_token_id(END_TOKEN))
print("token(<unk>) =", spacy_english_tokenizer.get_token_id(UNK_TOKEN), ", token(<pad>) =", spacy_english_tokenizer.get_token_id(PAD_TOKEN))
print("token(NON_EXISTENT_TOKEN) =", spacy_english_tokenizer.get_token_id("ఫ్రూట్"))

['ఇక', 'ఫ్రూట్', 'ఫ్లైస్', 'గురించి', 'మీరు', 'విన్నారా', '?']
[94, 23082, 3, 72, 40, 18178, 7]
ఇక ఫ్రూట్ <unk> గురించి మీరు విన్నారా ?
['Have', 'you', 'heard', 'about', 'Foie', 'gras', '?']
[1017, 46, 1003, 76, 3, 3, 20]
Have you heard about <unk> <unk> ?
token(<sos>) = 0 , token(<eos>) = 1
token(<unk>) = 3 , token(<pad>) = 2
token(NON_EXISTENT_TOKEN) = 3


In [18]:
# Refer 'step_2_training_bpe_tokenizer_and_vocab.ipynb' notebook to understand this class better.
# We train our own tokenizer in HuggingFace since it doesn't provide a tokenizer for Telugu by default.
class BPETokenizer(BaseTokenizer):
    """Trains a tokenizer using HuggingFace libraries"""
    def __init__(self, language: str):
        super().__init__(language, "hugging_face")

    def initialize_tokenizer_and_build_vocab(self, 
                                             data_iterator: datasets.arrow_dataset.Dataset, 
                                             text_extractor: Callable[[dict[str, str], str], str], 
                                             max_vocab_size: Optional[int] = VOCAB_SIZE):
        self.max_vocab_size = max_vocab_size
        self.tokenizer = self.__train_tokenizer(data_iterator=data_iterator, text_extractor=text_extractor, max_vocab_size=max_vocab_size)

    def tokenize(self, text: str) -> list[str]:
        encoded_text = self.tokenizer.encode(text)
        return encoded_text.tokens

    def encode(self, text: str) -> list[int]:
        encoded_text = self.tokenizer.encode(text)
        return encoded_text.ids

    # Spacy tokenizer doesn't have a built-in mechanism to generate text from token ids. It tokenizes the text 
    # based on a bunch of rules and doesn't know how to join the tokens back to form the original text. This 
    # is one of the additional advantages of using a BPE tokenizer over a spacy tokenizer because BPE algorithm 
    # is designed to both encode and decode text.
    def decode(self, token_ids: List[int]) -> str:
        """Converts the series of token ids back to the original text.

        Args:
            token_ids (List[int]): A list of token ids corresponding to some text.

        Returns:
            str: The original text corresponding to the token ids.
        """
        return self.tokenizer.decode(token_ids)

    def get_token_id(self, token: str) -> int:
        return self.tokenizer.token_to_id(token)

    # We need an iterator to train the tokenizer. Using an iterator ensures that not all 
    # the data is loaded into memory at once.
    def __get_data_iterator(self, data_iterator: datasets.arrow_dataset.Dataset, text_extractor: Callable[[dict[str, str], str], str]):
        for data_point in data_iterator:
            yield text_extractor(data_point=data_point, language=self.language)

    def __train_tokenizer(self, data_iterator: datasets.arrow_dataset.Dataset, 
                          text_extractor: Callable[[dict[str, str], str], str], 
                          max_vocab_size: Optional[int]=VOCAB_SIZE) -> ByteLevelBPETokenizer:
        # Use BPE to train a ByteLevel BPE tokenizer.
        tokenizer = ByteLevelBPETokenizer()
        # train_from_iterator is used so that the entire dataset is not loaded into memory at once.
        tokenizer.train_from_iterator(iterator=self.__get_data_iterator(data_iterator=data_iterator, text_extractor=text_extractor), 
                                      vocab_size= max_vocab_size, 
                                      special_tokens=self.special_tokens)
        return tokenizer

In [19]:
bpe_telugu_tokenizer: BaseTokenizer = BPETokenizer(language="telugu")
bpe_telugu_tokenizer.initialize_tokenizer_and_build_vocab(data_iterator=en_te_translation_dataset, text_extractor=text_extractor, max_vocab_size=30000)
bpe_english_tokenizer: BaseTokenizer = BPETokenizer(language="english")
bpe_english_tokenizer.initialize_tokenizer_and_build_vocab(data_iterator=en_te_translation_dataset, text_extractor=text_extractor, max_vocab_size=30000)









In [21]:
# This tokenizer conveniently handles special tokens by default. It can keep the special
# tokens intact during tokenization. It knows about special tokens by default.
telugu_sentence = "ఇక ఫ్రూట్ ఫ్లైస్ గురించి మీరు విన్నారా?"
print(bpe_telugu_tokenizer.tokenize(text=telugu_sentence))
print(bpe_telugu_tokenizer.encode(text=telugu_sentence))
print(bpe_telugu_tokenizer.decode(token_ids=bpe_telugu_tokenizer.encode(text=telugu_sentence)))
print("-" * 150)
english_sentence = "Have you heard about Foie gras?"
print(bpe_english_tokenizer.tokenize(text=english_sentence))
print(bpe_english_tokenizer.encode(text=english_sentence))
print(bpe_english_tokenizer.decode(token_ids=bpe_english_tokenizer.encode(text=english_sentence)))
print("-" * 150)
print("token(<sos>) =", bpe_english_tokenizer.get_token_id(START_TOKEN), ", token(<eos>) =", bpe_english_tokenizer.get_token_id(END_TOKEN))
print("token(<unk>) =", bpe_english_tokenizer.get_token_id(UNK_TOKEN), ", token(<pad>) =", bpe_english_tokenizer.get_token_id(PAD_TOKEN))
# Notice that BPE doesn't return the token id 3 (<UNK>) for the string "NON_EXISTENT_TOKEN" as it did for the spacy tokenizer. 
# This is because NON_EXISTENT_TOKEN can be broken down into smaller tokens by the BPE tokenizer. BPE never returns an unk
# token id. It always breaks down the unknown token into smaller tokens. It is not even useful to have an unk token id in BPE.
print("token(NON_EXISTENT_TOKEN) =", bpe_english_tokenizer.get_token_id("NON_EXISTENT_TOKEN"))

['à°ĩà°ķ', 'Ġà°«', 'à±į', 'à°°', 'à±Ĥ', 'à°Ł', 'à±į', 'Ġà°«', 'à±į', 'à°²', 'à±Ī', 'à°¸', 'à±į', 'Ġà°Ĺ', 'à±ģ', 'à°°', 'à°¿à°Ĥ', 'à°ļ', 'à°¿', 'Ġà°®', 'à±Ģ', 'à°°', 'à±ģ', 'Ġà°µ', 'à°¿', 'à°¨', 'à±į', 'à°¨', 'à°¾', 'à°°', 'à°¾?']
[539, 360, 263, 267, 298, 277, 263, 360, 263, 270, 305, 278, 263, 322, 266, 267, 294, 286, 264, 293, 283, 267, 266, 291, 264, 268, 263, 268, 265, 267, 440]
ఇక ఫ్రూట్ ఫ్లైస్ గురించి మీరు విన్నారా?
------------------------------------------------------------------------------------------------------------------------------------------------------
['Have', 'Ġyou', 'Ġheard', 'Ġabout', 'ĠF', 'o', 'ie', 'Ġgras', '?']
[3519, 452, 3123, 629, 510, 82, 485, 26172, 34]
Have you heard about Foie gras?
------------------------------------------------------------------------------------------------------------------------------------------------------
token(<sos>) = 0 , token(<eos>) = 1
token(<unk>) = 3 , token(<pad>) = 2
token(NON_EXISTENT_TOKEN) = None


## Creating pytorch Dataset from HuggingFace dataset.

Lets create Pytorch Dataset based on the HuggingFace translation dataset. This is necessary to <br>
to create the DataLoader in the following steps.

In [22]:
# We need to wrap our HuggingFace dataset into the torch Dataset to integrate with pytorch DataLoader.
class HuggingFaceDatasetWrapper(torch.utils.data.Dataset):
    def __init__(self, hf_dataset: datasets.arrow_dataset.Dataset):
        """Initializes the HuggingFaceDatasetWrapper with the given dataset.

        Args:
            hf_dataset (datasets.arrow_dataset.Dataset): The hugging face dataset to be wrapped.
        """
        self.dataset = hf_dataset
    
    def __len__(self):
        """Extracts the length of the dataset.

        Returns:
            int: Length of the dataset.
        """
        return self.dataset.num_rows
    
    def __getitem__(self, index):
        """Extracts the data_point at a particular index in the dataset.

        Args:
            index (int): Index of the data_point to be extracted from the dataset.

        Returns:
            dict: Data_point at the given index in the dataset. This turns out to be a dictionary for our dataset but
                  it could be any type in general.
        """
        # Return the dataset at a particular index.
        # The index provided will always be less then length (64 in this case) returned by __len__ function.
        return self.dataset[index]

In [23]:
wrapped_translation_dataset = HuggingFaceDatasetWrapper(hf_dataset=en_te_translation_dataset)
wrapped_debug_dataset = HuggingFaceDatasetWrapper(hf_dataset=en_te_debug_dataset)

In [24]:
# A sample iteration on the Pytorch dataset to show the output.
for idx, data_point in enumerate(wrapped_debug_dataset):
    if idx > 10:
        break
    print(data_point)
    

{'idx': 0, 'src': 'Have you heard about Foie gras?', 'tgt': 'ఇక ఫ్రూట్ ఫ్లైస్ గురించి మీరు విన్నారా?'}
{'idx': 1, 'src': 'I never thought of acting in films.', 'tgt': 'సూర్య సినిమాల్లో నటించాలని ఎప్పుడూ అనుకోలేదు.'}
{'idx': 2, 'src': 'Installed Software', 'tgt': 'స్థాపించబడిన సాఫ్ట్\u200dవేర్'}
{'idx': 3, 'src': 'A case has been registered under Sections 302 and 376, IPC.', 'tgt': 'నిందితులపై సెక్షన్ 376 మరియు 302ల కింద కేసు నమోదు చేశాం.'}
{'idx': 4, 'src': 'Of this, 10 people succumbed to the injuries.', 'tgt': 'అందులో 10 మంది తీవ్రంగా గాయపడ్డారు.'}
{'idx': 5, 'src': 'Her acting has been praised by critics.', 'tgt': 'నటనకు గాను విమర్శకుల నుంచి ప్రశంసలు పొందింది.'}
{'idx': 6, 'src': 'The Bibles viewpoint on this is clearly indicated at Colossians 3: 9: Do not be lying to one another.', 'tgt': 'ఈ విషయంపై బైబిలు దృక్కోణం కొలొస్సయులు 3 :\u2060 9లో “ఒకనితో ఒకడు అబద్ధమాడకుడి ” అని స్పష్టంగా సూచించబడింది.'}
{'idx': 7, 'src': 'The incident was recorded in the CCTV footage.', 'tgt': 'ఈ ప్రమాద 

In [25]:
# 'len' function returns the number of examples in the dataset. 'len' internally calls the '__len__'
# method which is implemented above to return the 'num_rows' which is the number of examples in the
# dataset.
print(len(wrapped_debug_dataset))

200


## Creating basic pytorch DataLoaders

Lets create DataLoader object using the above translation dataset as shown in `step_3_datasets_and_dataloaders.ipynb` notebook.

In [32]:
# The role of collate_fn is to iterate on each batch of the dataset and convert the data in the batch into the format 
# that is required by the model (transformers in this case). A more extensively commented collate_fn is implemented 
# below in the same notebook. You will understand better there if something is not clear in this function.
def transformer_collate_fn(batch: List[Dict]):
    """Converts the raw data in the batch into the format required by the model. In this case, it
       converts the text to token ids, adds start, end, and padding tokens and batches the converted 
       data to be used by the model. 

    Args:
        batch (List[Dict]): A batch of raw data points from the dataset.

    Returns:
        Tuple: A tuple containing the source and target tensors in the batch.
    """
    # Maximum length of the sentence. Every sentence in the batch will be brought to this length by padding.
    MAX_LEN = 100
    # Holds the list of processed source sentences (English sentences) from the batch. 
    src_list = []
    # Holds the list of processed target sentences (Telugu sentences) from the batch. The data at corresponding indices 
    # in src_list and tgt_list are a pair i.e., tgt sentence at index 'j' is translation of src sentence at index 'j'.
    tgt_list = []
    # start of sentence token id to be added at the beginning of the sentence.
    sos_id = torch.tensor([0])
    # end of sentence token id to be added at the end of the sentence.
    eos_id = torch.tensor([1])
    index = 0
    for data_point in batch:
        src_sentence = data_point["src"]
        tgt_sentence = data_point["tgt"]
        # This if block is just to print the first sentence pair in the batch to understand how the batch looks.
        if index == 0:
            print(f"src_sentence: {src_sentence}")
            print(f"tgt_sentence: {tgt_sentence}")
            index += 1
        # List of tensors to concatenate and dimension along which we want to concatenate the list of tensors.
        processed_src = torch.cat([sos_id, torch.tensor(spacy_english_tokenizer.encode(src_sentence)), eos_id], dim=0)
        processed_tgt = torch.cat([sos_id, torch.tensor(spacy_telugu_tokenizer.encode(tgt_sentence)), eos_id], dim=0)
        # value parameter corresponds to the index we use to represent the padding token.
        processed_src = torch.nn.functional.pad(processed_src, (0, MAX_LEN - len(processed_src)), value=2)
        processed_tgt = torch.nn.functional.pad(processed_tgt, (0, MAX_LEN - len(processed_tgt)), value=2)
        src_list.append(processed_src)
        tgt_list.append(processed_tgt)
    src = torch.stack(src_list)
    tgt = torch.stack(tgt_list)
    return (src, tgt)

In [None]:
dataloader = DataLoader(dataset=wrapped_debug_dataset, batch_size=32, collate_fn=transformer_collate_fn)
# Debug dataset has 200 examples. We used the batch_size of 32. So, the number of batches or the length
# of the dataloader is ceil(200/32) = 7
print(len(dataloader))
# See that all the English and Telugu sentences are converted into token ids. Also note that start token id (0)
# is added at the beginning of every sequence, end token id (1) is added at the end (before padding tokens)
# of every sequence, and pad token id (2) is optionally added to a few sequences depending on the length of 
# the sequence.
for src, tgt in dataloader:
    print(f"src: {src}")
    print(f"tgt: {tgt}")
    print("-" * 150)

7
src_sentence: Have you heard about Foie gras?
tgt_sentence: ఇక ఫ్రూట్ ఫ్లైస్ గురించి మీరు విన్నారా?
src: tensor([[   0, 1017,   46,  ...,    2,    2,    2],
        [   0,   33,  375,  ...,    2,    2,    2],
        [   0,    3, 5208,  ...,    2,    2,    2],
        ...,
        [   0,   12, 1107,  ...,    2,    2,    2],
        [   0,  125,   54,  ...,    2,    2,    2],
        [   0, 1684, 1076,  ...,    2,    2,    2]])
tgt: tensor([[    0,    94, 23082,  ...,     2,     2,     2],
        [    0,  3189,  1262,  ...,     2,     2,     2],
        [    0, 14009,     3,  ...,     2,     2,     2],
        ...,
        [    0,  7898,   290,  ...,     2,     2,     2],
        [    0,    15,     3,  ...,     2,     2,     2],
        [    0,   648,  1622,  ...,     2,     2,     2]])
------------------------------------------------------------------------------------------------------------------------------------------------------
src_sentence: Will you enter politics?
tgt_senten

## Creating length aware pytorch DataLoaders

The above created DataLoader has a few issues and is not optimal to be used with Transformers. We will discuss <br>
the issues below and show how to handle these issues.

In [None]:
# Issues with the DataLoader above:
#
# ISSUE 1: The above DataLoader batches sentences randomly. So, it is possible a sentence of length 1 is batched 
#          together with a sentence of length 100. This causes the the data pre-processor to add 97 pad tokens 
#          (excluding <sos>, <eos> ids) to a sentence of length 1 which results in a huge waste of computation 
#          during training. 
#
# ISSUE 2: In the above DataLoader, every batch contains sentences of length 100 exactly. This is not optimal.
#          Ideally, different batches can contain sentences of different lengths but every sentence within a batch 
#          should be of same length.
#
# ISSUE 3: This is a minor issue but note that, if any sentence results in more than 100 tokens, the above code 
#          will fail badly.
#
# We will handle the above 3 issues in the next parts.


In [33]:
# DataLoader expects the sampler to provide the indices of the data points from the Dataset. These indices are used 
# to construct the batches internally that are later passed to 'collate_fn'. We are creating a sampler which groups 
# sentences of similar lengths into batches. We still need to be aware that the grouping is not perfect and it is 
# done here based on the number of words separated by 'spaces' in the sentence and not the actual number of tokens
# we obtain while using the tokenization algorithms. The actual number of tokens could be very different depending on 
# the tokenization algorithm which splits sentences (and words into tokens) based on a number of other factors.
class LengthAwareSampler(Sampler):
    def __init__(self, dataset: HuggingFaceDatasetWrapper, batch_size: int):
        # dataset is the Dataset wrapper we created on top of HuggingFace dataset.
        self.dataset = dataset
        self.batch_size = batch_size
        self.sorted_indices = self.extract_lengths()
        print("sorted_indices: ", self.sorted_indices)
    
    # We don't want the entire dataset to be loaded into the memory at once. So, we first iterate over the entire 
    # dataset, extract the lengths of the sentences and sort the indices of the sentences according to the sentence 
    # lengths. This is to ensure that sentences of similar lengths are grouped together in a batch to minimize the 
    # overall padding necessary. When we iterate over the dataset, we only load the necessary data from the dataset 
    # into memory and not the entire dataset at once --> This loading logic could be a little different based on the 
    # hugging face implementations. Please look into the hugging face documentation for more details.
    def extract_lengths(self) -> list[int]:
        """Sorts the indices of the dataset based on the lengths of the sentences in the dataset.

        Returns:
            list[int]: Indices of the dataset sorted in ascending order (small to big) based on the lengths of 
                       the sentences in the dataset.
        """
        # Note that the lengths are calculated based on the number of words separated by space in the sentence and 
        # not the number of tokens.
        self.lengths = [len(data_point["src"].split(" ")) + len(data_point["tgt"].split(" ")) for data_point in self.dataset]
        print("calculated (src + tgt) sentence lengths: ", self.lengths)
        # Create an indices list in the first step i.e., [0, 1, 2, ..., 299] --> For debug_dataset.
        # Sorts the indices list based on the calculated in the previous step i.e., after sorting
        # value at 0th index is the example for which the len(src_sentence) + len(tgt_sentence) is minimum and
        # value at 199th index is the example for which the len(src_sentence) + len(tgt_sentence) is maximum. 
        return sorted(range(len(self.dataset)), key=lambda index: self.lengths[index])

    # The __iter__ function is called once per epoch. The returned iterator is iterated on to get the list of indices 
    # for the data points in a batch.
    def __iter__(self) -> Iterator[int]:
        """Provides an iterator that yields the indices of the dataset in the order of the sentence lengths.

        Returns:
            Generator: A generator object that yields the indices of the dataset in the order of the sentence lengths.

        Yields:
            Generator[list[int], None, None]: A generator object that yields the indices of the dataset in the order
                                              of the sentence lengths.
        """
        # Create the batches of indices based on the sentence lengths. 
        # batches look like: [[0, 5, 90], [23, 4, 5], ...] if batch_size is 3.
        # [0, 5, 90] is a batch corresponding to the sentences at indices 0, 5 and 90 in the original dataset.
        # [23, 4, 5] is a batch corresponding to the sentences at indices 23, 4 and 5 in the original dataset.
        batches = [self.sorted_indices[index: index + self.batch_size] for index in range(0, len(self.dataset), self.batch_size)]
        # Shuffle the batches to ensure that the order of batches is different in every epoch. We want the model to 
        # see the data in different order in every epoch. So, we shuffle the order of the batches within the dataset.
        random.shuffle(batches)
        # Flatten the list of batches to get an iterable of indices. At the end, the dataloader expects an iterable of
        # indices to get the data points from the dataset. So, we convert the list of batches back to an iterable of 
        # indices.
        print("batches: ", batches)
        return iter([index for batch in batches for index in batch])

In [34]:
wrapped_debug_dataset = HuggingFaceDatasetWrapper(hf_dataset=en_te_debug_dataset)

In [35]:
# Create a dummy length aware sampler object and play around with it.
length_aware_sampler = LengthAwareSampler(dataset=wrapped_debug_dataset, batch_size=10)

calculated (src + tgt) sentence lengths:  [12, 12, 4, 20, 13, 13, 34, 15, 4, 8, 4, 24, 14, 9, 10, 15, 18, 5, 40, 21, 7, 27, 15, 13, 21, 7, 7, 6, 6, 15, 48, 21, 7, 45, 23, 8, 30, 25, 27, 23, 5, 40, 7, 27, 6, 11, 14, 42, 14, 18, 25, 30, 12, 11, 11, 25, 33, 13, 8, 23, 13, 6, 31, 15, 11, 18, 31, 13, 16, 19, 15, 9, 6, 9, 25, 11, 49, 15, 21, 7, 6, 18, 8, 6, 22, 12, 9, 22, 9, 12, 7, 18, 11, 22, 8, 23, 23, 41, 5, 5, 7, 15, 8, 7, 14, 21, 80, 19, 13, 4, 4, 17, 15, 43, 27, 22, 9, 12, 10, 25, 22, 32, 10, 5, 6, 10, 17, 6, 29, 10, 7, 14, 4, 7, 22, 52, 23, 5, 68, 7, 12, 13, 8, 21, 30, 34, 9, 8, 61, 43, 29, 20, 9, 40, 31, 8, 10, 10, 6, 7, 35, 10, 5, 14, 48, 15, 19, 22, 31, 16, 6, 28, 45, 19, 36, 6, 9, 9, 9, 8, 15, 5, 43, 17, 15, 14, 32, 11, 25, 6, 6, 42, 7, 15, 7, 24, 8, 63, 14, 14]
sorted_indices:  [2, 8, 10, 109, 110, 132, 17, 40, 98, 99, 123, 137, 162, 181, 27, 28, 44, 61, 72, 80, 83, 124, 127, 158, 170, 175, 189, 190, 20, 25, 26, 32, 42, 79, 90, 100, 103, 130, 133, 139, 159, 192, 194, 9, 35, 58, 8

In [36]:
# Iterate on the sampler object to retrieve the indices of the data points (examples).
for data_point_index in length_aware_sampler:
    print(data_point_index)

batches:  [[87, 93, 115, 120, 134, 167, 34, 39, 59, 95], [96, 136, 11, 195, 37, 50, 55, 74, 119, 188], [57, 60, 67, 108, 141, 12, 46, 48, 104, 131], [77, 101, 112, 165, 180, 184, 193, 68, 169, 111], [163, 185, 198, 199, 7, 15, 22, 29, 63, 70], [21, 38, 43, 114, 171, 128, 150, 36, 51, 144], [173, 3, 151, 19, 24, 31, 78, 105, 143, 84], [123, 137, 162, 181, 27, 28, 44, 61, 72, 80], [0, 1, 52, 85, 89, 117, 140, 4, 5, 23], [174, 18, 41, 153, 97, 47, 191, 113, 149, 182], [83, 124, 127, 158, 170, 175, 189, 190, 20, 25], [126, 183, 16, 49, 65, 81, 91, 69, 107, 166], [2, 8, 10, 109, 110, 132, 17, 40, 98, 99], [26, 32, 42, 79, 90, 100, 103, 130, 133, 139], [159, 192, 194, 9, 35, 58, 82, 94, 102, 142], [147, 155, 179, 196, 13, 71, 73, 86, 88, 116], [146, 152, 176, 177, 178, 14, 118, 122, 125, 129], [62, 66, 154, 168, 121, 186, 56, 6, 145, 160], [156, 157, 161, 45, 53, 54, 64, 75, 92, 187], [33, 172, 30, 164, 76, 135, 148, 197, 138, 106]]
87
93
115
120
134
167
34
39
59
95
96
136
11
195
37
50
55
74

In [37]:
def length_aware_collate_fn(batch: list, 
                            english_tokenizer: BaseTokenizer, 
                            telugu_tokenizer: BaseTokenizer, 
                            sos_id: int = 0, 
                            eos_id: int = 1, 
                            pad_id: int = 2) -> Tuple[Tensor, Tensor]:
    """Converts the raw data in the batch into the format required by the MachineTranslationTransformer model. It encodes the
       sentences into token ids, adds start, end and padding tokens and batches the converted data back to be used by the 
       model.

    Args:
        batch (list): Holds the raw data points (the actual english (src) and telugu (tgt) sentences batched) from the dataset.
        english_tokenizer (BaseTokenizer): Tokenizer to tokenize and encode the english sentences into corresponding token ids.
        telugu_tokenizer (BaseTokenizer): Tokenizer to tokenize and encode the english sentences into corresponding token ids.
        sos_id (int, optional): start of sentence token id. Defaults to 0.
        eos_id (int, optional): end of sentence token id. Defaults to 1.
        pad_id (int, optional): padding token id. Defaults to 2.

    Returns:
        Tuple [Tensor, Tensor]: Returns the encoded source and target tensors in the batch which can be used by the transformer model
                                as input.
    """
    print("batch_type: ", type(batch))
    # Holds all the encoded src sentences (english sentences) from the batch. encoded sentence means sentence divided 
    # into tokens and tokens converted into their integer ids.
    # [[0, 223, 4345, 545, 1], [0, 23, 234, 67, 1]] is an example for the processed_src_sentences variable where
    # [0, 223, 4345, 545, 1] represents an encoded sentence from the batch and 0 at the start is <sos> and 1 at the 
    # end is <eos>. 
    processed_src_sentences = []
    # Holds all the encoded tgt sentences (telugu sentences) from the batch.
    processed_tgt_sentences = []
    index = 0
    for data_point in batch:
        # src is english sentence.
        src_sentence = data_point["src"]
        # tgt is telugu sentence.
        tgt_sentence = data_point["tgt"]
        # start of sentence id to append at the start of every sentence.
        sos_tensor = torch.tensor([sos_id], dtype=torch.int64)
        # end of sentence id to append at the end of every sentence.
        eos_tensor = torch.tensor([eos_id], dtype=torch.int64)
        if index < 10:
            # This conditional block is to print during experiments and can be removed from the actual code later.
            print(f"src_sentence: {src_sentence}")
            print(f"tgt_sentence: {tgt_sentence}")
            index += 1
        # It is important to set the dtype to 'torch.int64' because we map token_ids (integers) to their embeddings in the 
        # transformer model.'<sos>' and '<eos>' tokens are not added to the src sentences. They are only added to the target 
        # sentences.
        encoded_src_sentence = torch.tensor(english_tokenizer.encode(src_sentence), dtype=torch.int64)
        # prepares the tensor in the format 'token_id(<sos>) token_id1 token_id2 ... last_token_id token_id(<eos>)'. 
        encoded_tgt_sentence = torch.cat([sos_tensor, torch.tensor(telugu_tokenizer.encode(tgt_sentence), dtype=torch.int64), eos_tensor], dim=0)
        processed_src_sentences.append(encoded_src_sentence)
        processed_tgt_sentences.append(encoded_tgt_sentence)
    # Finds the maximum length of the src_sequences in the batch so that src sequences are padded to get all the sequences
    # to the same length i.e., max_src_seq_len.
    max_src_seq_len = max(src_ids.size(0) for src_ids in processed_src_sentences)
    # Finds the maximum length of the tgt_sequences in the batch so that src sequences are padded to get all the sequences
    # to the same length i.e., max_tgt_seq_len.
    max_tgt_seq_len = max(tgt_ids.size(0) for tgt_ids in processed_tgt_sentences)
    # We pad the sentences with pad token so that every sentence in the batch is of same length. Also, notice 
    # that the pad token is appended after (not before) the <eos> token is appended to every sentence.
    src_ids = [torch.nn.functional.pad(input=src_ids, pad=(0, max_src_seq_len - src_ids.size(0)), mode="constant", value=pad_id) for src_ids in processed_src_sentences]
    tgt_ids = [torch.nn.functional.pad(input=tgt_ids, pad=(0, max_tgt_seq_len - tgt_ids.size(0)), mode="constant", value=pad_id) for tgt_ids in processed_tgt_sentences]
    # stack the src tensors along dimension 0. This then becomes a 2D tensor of shape (BATCH_SIZE, MAX_SENTENCE_LENGTH).
    src = torch.stack(tensors=src_ids, dim=0)
    tgt = torch.stack(tensors=tgt_ids, dim=0)
    return (src, tgt)

In [38]:
def collate_fn(batch):
    """The collate_fn is called by the DataLoader with just the batch of data points from the dataset. So, we wrap the 
       length_aware_collate_fn function with the required parameters to create a collate_fn that can be used by the 
       DataLoader.

    Args:
        batch (_type_): Batch of raw data points from the dataset.

    Returns:
        _type_: Returns the batch of data points in the format required by the MachineTranslationTransformer model.
    """
    sos_id = spacy_english_tokenizer.get_token_id(token="<sos>")
    eos_id = spacy_english_tokenizer.get_token_id(token="<eos>")
    pad_id = spacy_telugu_tokenizer.get_token_id(token="<pad>")
    return length_aware_collate_fn(batch=batch, english_tokenizer=spacy_english_tokenizer, telugu_tokenizer=spacy_telugu_tokenizer, sos_id=sos_id, eos_id=eos_id, pad_id=pad_id)

In [39]:
length_aware_sampler = LengthAwareSampler(dataset=wrapped_debug_dataset, batch_size=10)
dataloader = DataLoader(dataset=wrapped_debug_dataset, batch_size=10, sampler=length_aware_sampler, num_workers=0, collate_fn=collate_fn)

calculated (src + tgt) sentence lengths:  [12, 12, 4, 20, 13, 13, 34, 15, 4, 8, 4, 24, 14, 9, 10, 15, 18, 5, 40, 21, 7, 27, 15, 13, 21, 7, 7, 6, 6, 15, 48, 21, 7, 45, 23, 8, 30, 25, 27, 23, 5, 40, 7, 27, 6, 11, 14, 42, 14, 18, 25, 30, 12, 11, 11, 25, 33, 13, 8, 23, 13, 6, 31, 15, 11, 18, 31, 13, 16, 19, 15, 9, 6, 9, 25, 11, 49, 15, 21, 7, 6, 18, 8, 6, 22, 12, 9, 22, 9, 12, 7, 18, 11, 22, 8, 23, 23, 41, 5, 5, 7, 15, 8, 7, 14, 21, 80, 19, 13, 4, 4, 17, 15, 43, 27, 22, 9, 12, 10, 25, 22, 32, 10, 5, 6, 10, 17, 6, 29, 10, 7, 14, 4, 7, 22, 52, 23, 5, 68, 7, 12, 13, 8, 21, 30, 34, 9, 8, 61, 43, 29, 20, 9, 40, 31, 8, 10, 10, 6, 7, 35, 10, 5, 14, 48, 15, 19, 22, 31, 16, 6, 28, 45, 19, 36, 6, 9, 9, 9, 8, 15, 5, 43, 17, 15, 14, 32, 11, 25, 6, 6, 42, 7, 15, 7, 24, 8, 63, 14, 14]
sorted_indices:  [2, 8, 10, 109, 110, 132, 17, 40, 98, 99, 123, 137, 162, 181, 27, 28, 44, 61, 72, 80, 83, 124, 127, 158, 170, 175, 189, 190, 20, 25, 26, 32, 42, 79, 90, 100, 103, 130, 133, 139, 159, 192, 194, 9, 35, 58, 8

In [40]:
print(dataloader, type(dataloader))

<torch.utils.data.dataloader.DataLoader object at 0x7fd54910be20> <class 'torch.utils.data.dataloader.DataLoader'>


In [41]:
data_iterator = iter(dataloader)
print(data_iterator, type(data_iterator))

<torch.utils.data.dataloader._SingleProcessDataLoaderIter object at 0x7fd536bbc640> <class 'torch.utils.data.dataloader._SingleProcessDataLoaderIter'>


In [43]:
batched_data = next(data_iterator)
print("-" * 150)
print("shapes: ", batched_data[0].shape, batched_data[1].shape)
# Notice that <sos> and <eos> token ids are only added to the target sentences and not to the source sentences.
# This is inline with what will be fed to the transformer model during training.
print("batched_data: ", batched_data)

batch_type:  <class 'list'>
src_sentence: Its called illeism.
tgt_sentence: ఇది దిగజారుడుతనానికి గా పిలవబడింది.
src_sentence: Everyone accepted that.
tgt_sentence: దానికి అందరూ వత్తాసు పలికారు.
src_sentence: Heavy rainfall...
tgt_sentence: ఏకధాటిగా కురుస్తున్న భారీ వర్షాలకు .
src_sentence: 5 lakh would be provided.
tgt_sentence: 5లక్షలు సాయం అందజేశారు.
src_sentence: The children were terrified.
tgt_sentence: పిల్లలు తీవ్ర భయాందోళనకు గురయ్యారు.
src_sentence: Doctors, too, are delighted.
tgt_sentence: వైద్యులు సంతోషం వ్యక్తం చేస్తున్నారు.
src_sentence: But they will disappear soon.
tgt_sentence: కానీ వెంటనే అదృశ్యమవుతుంది.
src_sentence: From culture to commerce.
tgt_sentence: సంస్కృతి నుంచి వాణిజ్యం వ‌ర‌కు
src_sentence: It means elder sister.
tgt_sentence: అంటే పెద్దన్నయ్య అని అర్థం.
src_sentence: Peoples lifestyle started changing.
tgt_sentence: మనుషుల లైఫ్ స్టైల్ మారింది.
-------------------------------------------------------------------------------------------------------------------

In [44]:
# Lets verify the correctness of dataloader.
# Notice that the encoded tokens printed in the above cell for the first sentence and the encoded
# tokens printed here are the same for both English sentence and Telugu sentence.
english_sentence = "The smartphone was recently launched in Indonesia."
print(spacy_english_tokenizer.encode(text=english_sentence), "\n")
telugu_sentence = "ఈ స్మార్ట్ ఫోన్ ఇప్పటికే ఇండోనేషియాలో లాంచ్ అయింది."
print(spacy_telugu_tokenizer.encode(text=telugu_sentence))

[12, 1697, 18, 345, 520, 10, 4726, 4] 

[6, 1628, 456, 149, 15006, 1997, 432, 4]
