## `Dataset` and `DataLoader`: The Why?
> Loading entire datasets into memory at once can cause problems like memory overflow and slow processing. Dataset and DataLoader classes solve these issues by:
> 1. Loading data in smaller batches to prevent memory overflow
> 2. Using parallel processing to speed up computations.

These tools help us work with large datasets more efficiently.


In [20]:
# Importing necessaey libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
from torch.utils.data import Dataset, DataLoader

In [28]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate random data for 100 students
n_students = 1000

# Generate features
cgpa = np.random.uniform(5.0, 10.0, n_students)  # CGPA between 5.0 and 10.0
iq = np.random.normal(100, 15, n_students)  # IQ with mean 100 and std 15
marks_12th = np.random.uniform(60, 100, n_students)  # 12th marks between 60 and 100
marks_10th = np.random.uniform(60, 100, n_students)  # 10th marks between 60 and 100

# Generate placement status (binary)
# Higher probability of placement for students with better scores
placement_prob = (cgpa/10 * 0.3 + iq/150 * 0.3 + marks_12th/100 * 0.2 + marks_10th/100 * 0.2)
placed = np.random.binomial(1, placement_prob)

# Create DataFrame
student_data = pd.DataFrame({
    'CGPA': np.round(cgpa, 2),
    'IQ': np.round(iq),
    'Marks_12th': np.round(marks_12th, 2),
    'Marks_10th': np.round(marks_10th, 2),
    'Placed': placed
})

print("Sample of generated student data:")
student_data.head()

Sample of generated student data:


Unnamed: 0,CGPA,IQ,Marks_12th,Marks_10th,Placed
0,6.87,103.0,98.86,75.51,0
1,9.75,80.0,73.25,92.14,1
2,8.66,106.0,79.28,96.07,1
3,7.99,109.0,67.84,68.14,0
4,5.78,108.0,84.43,62.68,1


In [29]:
# Shape of the dataset
student_data.shape

(1000, 5)

In [31]:
# Splitting the data
X = student_data.drop(columns=['Placed']),
y = student_data['Placed']

In [None]:
# Building a Dataset Class
class CustomDataset(Dataset):
    def __init__(self, features: np.array, labels: np.array) -> None:
        """
            In this method, you write an operation which loads the data from any local/cloud storage
        """
        # The data we are getting is directly provided by the user through features and labels
        self.features = torch.tensor(features, device='cpu', requires_grad=False)
        self.labels = torch.tensor(labels, device='cpu', requires_grad=False)

    def __len__(self) -> torch.int32:
        return len(self.features)

    def __getitem__(self, index):
        """
            If you want to apply any kind of transformations then you can apply here
            Transformation:
                Tabular Data:
                    - Scaling
                    - Encoding
                    - Missing Values
                    - Others

                Image Data:
                    - Normalizing the pixels
                    - Resizing the image
                    - Data Augumentation
                    - RGB to black and white transformation
                
                Textual Data:
                    - Embedding / Encoding
                    - Stop Word removel
                    - Lemitization
                    - Other internal preprocessing steps
        """
        # Transformers - Scaling
        # TODO: Mentioning StandardScaler which is entire dataset oriented transformation and not the perticular row specific, because of which we are transforming our data multiple times through standardscaler (Unnecessary computations)
        scaler = StandardScaler()
        self.features = torch.tensor(scaler.fit_transform(self.features.cpu().numpy()), device='cpu')

        # Here you can mention the return the data however you want but the way you have mentioned here simillarly you have to access it while loading the data for training
        return self.features[index], self.labels[index]

In [50]:
# Creating an object of dataset class
dataset = CustomDataset(
    features=np.array(X[0]),
    labels=np.array(y)
)

In [51]:
# Finding the length of the dataset
len(dataset)

1000

In [52]:
# Accessing a row
index = int(input("Enter the index: "))
print(f"Features at index {index}:", dataset[index][0])
print(f"Labels at index {index}:", dataset[index][1])

Features at index 400: tensor([-1.3228,  0.1017, -1.2777,  1.5235], dtype=torch.float64)
Labels at index 400: tensor(1, dtype=torch.int32)


In [53]:
# Creating a DataLoader object
dataloader = DataLoader(
    dataset=dataset,
    batch_size=2,
    shuffle=True, # Will be False for data like Time Series
    pin_memory=True,
    num_workers=0 # If you mention more than zero workers then you will not be able to print the data
)

# Returning the generator
dataloader

<torch.utils.data.dataloader.DataLoader at 0x21004c3c050>

In [54]:
for batch_features, batch_labels in dataloader:
    print(batch_features, batch_labels)
    print("-" * 50)

tensor([[-0.1104, -0.2355, -0.2085,  0.3630],
        [-0.9529, -0.4378,  0.9058,  0.6808]], dtype=torch.float64) tensor([0, 0], dtype=torch.int32)
--------------------------------------------------
tensor([[-0.8502,  0.1691, -1.0793, -0.7931],
        [-0.3776,  0.9110, -1.1477, -0.5084]], dtype=torch.float64) tensor([0, 0], dtype=torch.int32)
--------------------------------------------------
tensor([[ 1.2116, -0.7750, -0.9606, -1.5466],
        [ 0.0882,  0.9110,  0.2066,  0.2609]], dtype=torch.float64) tensor([1, 1], dtype=torch.int32)
--------------------------------------------------
tensor([[ 0.8828, -0.7750, -1.5186, -0.0317],
        [ 1.4239,  2.5295, -1.4467,  0.2967]], dtype=torch.float64) tensor([1, 1], dtype=torch.int32)
--------------------------------------------------
tensor([[-0.4598, -0.3029, -0.4597,  0.2224],
        [-1.3571, -0.7750, -0.9736,  0.4268]], dtype=torch.float64) tensor([0, 1], dtype=torch.int32)
--------------------------------------------------
tenso