Helps prepare and Load Data --> Implemented Using DataLoader class in PyTorch

Efficient Batching and Suffling of data

Efficient Loading and Preprocessing data


## Pipeline

1. Get the dataset
2. Tokenize the data
3. Numerlizing the data
4. Fix a constant Batch size
5. Turn it into a Tensor

if we set:
    batch_first = `True`
    The first dimension in the output tensor will be `batch size`
    And the second dimension will be `Sequence Size.`
else:
    Vice Versa Process

## Installing required Libraries

In [2]:
!pip install nltk
!pip install transformers==4.42.1
!pip install sentencepiece
!pip install spacy
!pip install numpy==1.26.0
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!pip install torch==2.2.2 torchtext==0.17.2
!pip install torchdata==0.7.1
!pip install portalocker
!pip install numpy pandas
!pip install numpy scikit-learn

Collecting numpy>=1.19.0 (from spacy)
  Obtaining dependency information for numpy>=1.19.0 from https://files.pythonhosted.org/packages/2b/3e/e7247c1d4f15086bb106c8d43c925b0b2ea20270224f5186fa48d4fb5cbd/numpy-2.2.4-cp311-cp311-macosx_14_0_arm64.whl.metadata
  Using cached numpy-2.2.4-cp311-cp311-macosx_14_0_arm64.whl.metadata (62 kB)
Using cached numpy-2.2.4-cp311-cp311-macosx_14_0_arm64.whl (5.4 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.0
    Uninstalling numpy-1.26.0:
      Successfully uninstalled numpy-1.26.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tables 3.8.0 requires blosc2~=2.0.0, which is not installed.
tables 3.8.0 requires cython>=0.29.21, which is not installed.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
numba 0.57.1 requires numpy<1.25,>=1.21, b

## Importing Required Libraries

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random

import torchtext
from transformers import BertTokenizer,XLNetTokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
from torch.utils.data import Dataset,DataLoader
from torchtext.datasets import multi30k,Multi30k
from typing import Iterable,List
from torch.nn.utils.rnn import pad_sequence


## Dataset

## **Data set**

A data set in **PyTorch** is an object that represents a collection of data samples. Each data sample typically consists of one or more input features and their corresponding target labels. You can also use your data set to transform your data as needed.

## **Data loader**

A data loader in **PyTorch** is responsible for efficiently loading and batching data from a data set. It abstracts away the process of iterating over a data set, shuffling, and dividing it into batches for training. In NLP applications, the data loader is used to process and transform your text data, rather than just the data set.

Data loaders have several key parameters, including the data set to load from, batch size (determining how many samples per batch), shuffle (whether to shuffle the data for each epoch), and more. Data loaders also provide an iterator interface, making it easy to iterate over batches of data during training.

Now, you may ask, '**What is an iterator?**'

An iterator is an object that can be looped over. It contains elements that can be iterated through and typically includes two methods, `__iter__()` and `__next__()`. When there are no more elements to iterate over, it raises a **`StopIteration`** exception.

Iterators are commonly used to traverse large data sets without loading all elements into memory simultaneously, making the process more memory-efficient. In PyTorch, not all data sets are iterators, but all data loaders are.

In PyTorch, the data loader processes data in batches, loading and processing one batch at a time into memory efficiently. The batch size, which you specify when creating the data loader, determines how many samples are processed together in each batch. The data loader's purpose is to convert input data and labels into batches of tensors with the same shape for deep learning models to interpret.

Finally, a data loader can be used for tasks such as tokenizing, sequencing, converting your samples to the same size, and transforming your data into tensors that your model can understand.

--- 



## Custom Data set and data loader in PyTorch

Defining a CustomDataset which inherits from the torch.utils.data.Dataset class and is initialized with a list of sentences.

The Dataset comprises of two essential methods:
* *init*(self,sentence): Initializes the data set with a list of sentences.
* *getitem*(self,idx): Retrives an item (in this case, a sentence) at a specific index, idx


Now, by creating an instance of your custom data set(custom_dataset) by passing in the list of the sentences. Additionally, you can specify a batch_size (batch_size), which determinates how many sentences will be grouped together in each batch during data loading. 




In [11]:
sentences = ["If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals.",
    "Fame's a fickle friend, Harry.",
    "It is our choices, Harry, that show what we truly are, far more than our abilities.",
    "Soon we must all face the choice between what is right and what is easy.",
    "Youth can not know how age thinks and feels. But old men are guilty if they forget what it was to be young.",
    "You are awesome!"]

from torch.utils.data import Dataset
# Creation of a Custom Dataset
class CustomDataset(Dataset): # Dataset is the parent class
    def __init__(self,sentences):
        self.sentences = sentences

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self,idx): # retirves an item, get a specific item
        return self.sentences[idx]

# Create an instance of your custom dataset
custom_dataset = CustomDataset(sentences)

# Define Batch Size
batch_size = 2 # Indicated two sentences are grouped together at a time

# Create a dataloader
dataloader = DataLoader(dataset=custom_dataset,
                       batch_size=batch_size,
                       shuffle=True)
    

In [10]:
# Iterate through the Dataloader
for batch in dataloader:
    print(f"Batch:{batch}\n")

Batch:["Fame's a fickle friend, Harry.", "If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals."]

Batch:['Youth can not know how age thinks and feels. But old men are guilty if they forget what it was to be young.', 'Soon we must all face the choice between what is right and what is easy.']

Batch:['You are awesome!', 'It is our choices, Harry, that show what we truly are, far more than our abilities.']



In [17]:
custom_dataset.__getitem__(5)

'You are awesome!'

## Creating Tensor for custom data set

Pipeline:
1. Create a custom dataset
2. Pass the dataset through the dataloader, and create batches of input data
3. tokenize the input data
4. Get the index of the data and ready for to pass thorugh the neural model


In [28]:
sentences=["If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals.",
    "Fame's a fickle friend, Harry.",
    "It is our choices, Harry, that show what we truly are, far more than our abilities.",
    "Soon we must all face the choice between what is right and what is easy.",
    "Youth can not know how age thinks and feels. But old men are guilty if they forget what it was to be young.",
    "You are awesome!"]

class CustomDataset(Dataset):
    def __init__(self,
                 sentences:list,
                 tokenizer:torchtext.data.utils,
                 vocab:torchtext.vocab):
        self.sentences = sentences
        self.tokenizer = tokenizer
        self.vocab = vocab

    def __len__(self): #returns the total number of sample in the dataset
        return len(self.sentences)
        
    def __getitem__(self,idx):
        tokens = self.tokenizer(self.sentences[idx])

        # now turn the tokens into indices
        tensor_indices = [self.vocab[token] for token in tokens]
        return torch.tensor(tensor_indices)


# Tokenizer
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')

# Vocab
from torchtext.vocab import build_vocab_from_iterator
vocab = build_vocab_from_iterator(map(tokenizer,sentences))

# The full process
custom_dataset_2 = CustomDataset(sentences=sentences,
                                tokenizer=tokenizer,
                                vocab=vocab)


        

In [29]:
# print function
print(f"Custom Dataset Length: {len(custom_dataset_2)}")
print("Sample Items:")

for i in range(len(custom_dataset_2)):
    sample_item = custom_dataset_2[i]
    print(f"Item{i+1}:{sample_item}")
    

Custom Dataset Length: 6
Sample Items:
Item1:tensor([11, 19, 63, 17, 13,  2,  3, 47,  6, 16, 45,  0, 55,  3, 41, 46, 24, 10,
        43, 61,  9, 44,  0, 14,  9, 33,  1])
Item2:tensor([35,  6, 16,  3, 38, 40,  0,  8,  1])
Item3:tensor([12,  5, 15, 31,  0,  8,  0, 57, 53,  2, 18, 62,  4,  0, 36, 49, 56, 15,
        21,  1])
Item4:tensor([54, 18, 50, 23, 34, 58, 30, 27,  2,  5, 52,  7,  2,  5, 32,  1])
Item5:tensor([66, 29, 14, 13, 10, 22, 60,  7, 37,  1, 28, 51, 48,  4, 42, 11, 59, 39,
         2, 12, 64, 17, 26, 65,  1])
Item6:tensor([19,  4, 25, 20])


## DataLoader

In [30]:
custom_data_2 = CustomDataset(sentences=sentences,
                                tokenizer=tokenizer,
                                vocab=vocab)

batch_size = 2

# Create an instance of DataLoader
batched_data = DataLoader(dataset=custom_data,
                         batch_size=batch_size,
                         shuffle=True)

In [31]:
for batch in batched_data:
    print(batch)

RuntimeError: stack expects each tensor to be equal size, but got [9] at entry 0 and [20] at entry 1

Got the error because tensors are of different sizes.

If we want to stack tensors, it must be of same sizes. 

You will encounter an error when attempting to create batches for the tensors. This error arises because the tensor batches have unequal lengths. The data loader is using the default `collate_function`, which requires tensors to have equal lengths. You can define your own `collate_function` and pass the data into it to establish your own rules. Typically, to address the issue of unequal tensor lengths, you employ data padding. This will be demonstrated in the following section.


## Custom Collate Function

A collate funtion is employed in the context of data loading and batching in ML, particularly when dealing with variable-length data,such as sequences(e.g, text, time series, sequences of events). Its priamry purpose is to prepare and format individual data samples(exm) into batches that can be efficiently processed by machine learning models.

* Use custom `custom collate function`
* `pad_sequence`: This function is a part of PyTorch and is utilized to pad sequences in a batch, ensuring uniform length. Takes a `batch of sequences` as `input` and pads them to match the length of the longest sequence. The `padding_value=0`argument specifies the value to use for padding

In [32]:
# Create a custom collate function
def collate_fn(batch):

    # Padding the batch to have equal lengths
    padded_batch = pad_sequence(batch,batch_first=True,padding_value=0)

    return padded_batch

In the above cell batch_first=`True`; so the shape of the output will be `[batch_size*seq_len]`. If that is false it will be `seq_len*batch_size`

In [33]:
# Create a dataloader with this custom collate_fn
data_loader_custom_collate = DataLoader(dataset=custom_data_2,
                                       batch_size=batch_size,
                                       shuffle=True,
                                       collate_fn=collate_fn)

for batch in data_loader_custom_collate:
    print(batch)

tensor([[12,  5, 15, 31,  0,  8,  0, 57, 53,  2, 18, 62,  4,  0, 36, 49, 56, 15,
         21,  1,  0,  0,  0,  0,  0],
        [66, 29, 14, 13, 10, 22, 60,  7, 37,  1, 28, 51, 48,  4, 42, 11, 59, 39,
          2, 12, 64, 17, 26, 65,  1]])
tensor([[35,  6, 16,  3, 38, 40,  0,  8,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0],
        [11, 19, 63, 17, 13,  2,  3, 47,  6, 16, 45,  0, 55,  3, 41, 46, 24, 10,
         43, 61,  9, 44,  0, 14,  9, 33,  1]])
tensor([[54, 18, 50, 23, 34, 58, 30, 27,  2,  5, 52,  7,  2,  5, 32,  1],
        [19,  4, 25, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])


In [35]:
for batch in data_loader_custom_collate:
    for row in batch:
        for idx in row:
            words = [vocab.get_itos()[idx] for idx in row]
        print(words)

['if', 'you', 'want', 'to', 'know', 'what', 'a', 'man', "'", 's', 'like', ',', 'take', 'a', 'good', 'look', 'at', 'how', 'he', 'treats', 'his', 'inferiors', ',', 'not', 'his', 'equals', '.']
['soon', 'we', 'must', 'all', 'face', 'the', 'choice', 'between', 'what', 'is', 'right', 'and', 'what', 'is', 'easy', '.', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',']
['you', 'are', 'awesome', '!', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',']
['it', 'is', 'our', 'choices', ',', 'harry', ',', 'that', 'show', 'what', 'we', 'truly', 'are', ',', 'far', 'more', 'than', 'our', 'abilities', '.']
['fame', "'", 's', 'a', 'fickle', 'friend', ',', 'harry', '.', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',']
['youth', 'can', 'not', 'know', 'how', 'age', 'thinks', 'and', 'feels', '.', 'but', 'old', 'men', 'are', 'guilty', 'if', 'they', 'forget', 'what', 'it', 'was', 'to', 'be', 'young', '.']


## Full Process

### Create Custom Dataset


In [37]:
from torch.utils.data import Dataset,DataLoader
class CustomDataset(Dataset): # Parent Class Dataset
    def __init__(self,
                 sentence,
                tokenizer,
                vocab):
        self.sentence = sentence
        self.tokenizer = tokenizer
        self.vocab = vocab

    def __len__(self): # get to know the length of the dataset
        return len(self.sentence)

    def __getitem__(self,idx): # grab each item of the custom dataset
        tokens = self.tokenizer(self.sentence[idx]) # Grab each sample and convert it into token

        # now convert into vocab indices
        tensor_indices = [self.vocab[token] for token in tokens]
        return torch.tensor(tensor_indices)


In [40]:
# Sample dataset:
corpus = [
    "Ceci est une phrase.",
    "C'est un autre exemple de phrase.",
    "Voici une troisième phrase.",
    "Il fait beau aujourd'hui.",
    "J'aime beaucoup la cuisine française.",
    "Quel est ton plat préféré ?",
    "Je t'adore.",
    "Bon appétit !",
    "Je suis en train d'apprendre le français.",
    "Nous devons partir tôt demain matin.",
    "Je suis heureux.",
    "Le film était vraiment captivant !",
    "Je suis là.",
    "Je ne sais pas.",
    "Je suis fatigué après une longue journée de travail.",
    "Est-ce que tu as des projets pour le week-end ?",
    "Je vais chez le médecin cet après-midi.",
    "La musique adoucit les mœurs.",
    "Je dois acheter du pain et du lait.",
    "Il y a beaucoup de monde dans cette ville.",
    "Merci beaucoup !",
    "Au revoir !",
    "Je suis ravi de vous rencontrer enfin !",
    "Les vacances sont toujours trop courtes.",
    "Je suis en retard.",
    "Félicitations pour ton nouveau travail !",
    "Je suis désolé, je ne peux pas venir à la réunion.",
    "À quelle heure est le prochain train ?",
    "Bonjour !",
    "C'est génial !"
]


# Set the tokenizer
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')

# set the vocab to convert the token into indices
from torchtext.vocab import build_vocab_from_iterator
vocab = build_vocab_from_iterator(iterator = map(tokenizer,corpus))

# Create an instance of CustomDataset
custom_data = CustomDataset(sentence=corpus,
                           tokenizer=tokenizer,
                           vocab=vocab)

In [42]:
# Show the outputs
print(f"Length of the dataset: {len(custom_data)}")

Length of the dataset: 30


In [45]:
# Print out the vocab
custom_data.__getitem__(0)

tensor([43,  5, 12, 11,  0])

In [47]:
for i in range(len(custom_data)):
    print(custom_data.__getitem__(i))    

tensor([43,  5, 12, 11,  0])
tensor([ 13,   4,   5, 106,  38,  59,   7,  11,   0])
tensor([111,  12, 102,  11,   0])
tensor([16, 60, 39, 37,  4, 69,  0])
tensor([70,  4, 30,  9, 10, 48, 64,  0])
tensor([91,  5, 21, 86, 89,  8])
tensor([  1, 100,   4,  28,   0])
tensor([40, 32,  2])
tensor([ 1,  3, 15, 22, 49,  4, 31,  6, 63,  0])
tensor([ 81,  53,  84, 105,  51,  75,   0])
tensor([ 1,  3, 68,  0])
tensor([  6,  62, 116, 113,  42,   2])
tensor([ 1,  3, 74,  0])
tensor([ 1, 18, 98, 19,  0])
tensor([ 1,  3, 61, 33, 12, 73, 71,  7, 23,  0])
tensor([ 57,  90, 104,  35,  52,  88,  20,   6, 114,   8])
tensor([  1, 108,  46,   6,  79,  44,  34,   0])
tensor([10, 78, 29, 17, 80,  0])
tensor([ 1, 54, 27, 14, 83, 58, 14, 72,  0])
tensor([ 16, 115,  26,   9,   7,  77,  50,  45, 110,   0])
tensor([76,  9,  2])
tensor([36, 96,  2])
tensor([  1,   3,  93,   7, 112,  94,  56,   2])
tensor([ 17, 107,  99, 101, 103,  47,   0])
tensor([ 1,  3, 15, 95,  0])
tensor([65, 20, 21, 82, 23,  2])
tensor([  1,   

In [53]:
custom_data.vocab.get_itos()

['.',
 'je',
 '!',
 'suis',
 "'",
 'est',
 'le',
 'de',
 '?',
 'beaucoup',
 'la',
 'phrase',
 'une',
 'c',
 'du',
 'en',
 'il',
 'les',
 'ne',
 'pas',
 'pour',
 'ton',
 'train',
 'travail',
 'à',
 ',',
 'a',
 'acheter',
 'adore',
 'adoucit',
 'aime',
 'apprendre',
 'appétit',
 'après',
 'après-midi',
 'as',
 'au',
 'aujourd',
 'autre',
 'beau',
 'bon',
 'bonjour',
 'captivant',
 'ceci',
 'cet',
 'cette',
 'chez',
 'courtes',
 'cuisine',
 'd',
 'dans',
 'demain',
 'des',
 'devons',
 'dois',
 'désolé',
 'enfin',
 'est-ce',
 'et',
 'exemple',
 'fait',
 'fatigué',
 'film',
 'français',
 'française',
 'félicitations',
 'génial',
 'heure',
 'heureux',
 'hui',
 'j',
 'journée',
 'lait',
 'longue',
 'là',
 'matin',
 'merci',
 'monde',
 'musique',
 'médecin',
 'mœurs',
 'nous',
 'nouveau',
 'pain',
 'partir',
 'peux',
 'plat',
 'prochain',
 'projets',
 'préféré',
 'que',
 'quel',
 'quelle',
 'ravi',
 'rencontrer',
 'retard',
 'revoir',
 'réunion',
 'sais',
 'sont',
 't',
 'toujours',
 'troisièm

In [54]:
custom_data.vocab.get_stoi()

{'week-end': 114,
 'vraiment': 113,
 'voici': 111,
 'y': 115,
 'ville': 110,
 'venir': 109,
 'vacances': 107,
 'trop': 103,
 'troisième': 102,
 'toujours': 101,
 'sais': 98,
 'réunion': 97,
 'j': 70,
 'revoir': 96,
 'retard': 95,
 "'": 4,
 'rencontrer': 94,
 'que': 90,
 'projets': 88,
 'cuisine': 48,
 'nous': 81,
 'plat': 86,
 'peux': 85,
 'autre': 38,
 'pain': 83,
 'ceci': 43,
 'mœurs': 80,
 'médecin': 79,
 'là': 74,
 'longue': 73,
 'de': 7,
 'heure': 67,
 'une': 12,
 'félicitations': 65,
 'nouveau': 82,
 'français': 63,
 'fatigué': 61,
 'exemple': 59,
 'merci': 76,
 'enfin': 56,
 'désolé': 55,
 'dois': 54,
 'prochain': 87,
 'française': 64,
 'adoucit': 29,
 'dans': 50,
 'courtes': 47,
 'quelle': 92,
 'chez': 46,
 'pour': 20,
 'des': 52,
 'ton': 21,
 'cette': 45,
 'suis': 3,
 'cet': 44,
 'bon': 40,
 'beau': 39,
 'matin': 75,
 'au': 36,
 'après': 33,
 'vais': 108,
 'journée': 71,
 'après-midi': 34,
 'apprendre': 31,
 'adore': 28,
 'acheter': 27,
 'musique': 78,
 'ne': 18,
 'vous': 112,