#### **Problems in last notebook**

1. We are using Batch Gradient Descent means to update the weights in backward pass we are sending all of our data at once and then changing the weights.
2. Batch gradient is memory inefficient
3. No better convergence (No fast updation of parameters)

Instead of loading entire data at once we will divide the data in say 10 batches update the weights (backward pass) after forward pass of each batch which will result in better convergence and will be memory efficient : This is **Mini Batch Gradient Descent**

#### The Dataset and DataLoader Classes

Dataset and DataLoader are core abstractions in PyTorch that decouple how you define your data from how you efficiently iterate over it in training loops.

You `data` is in memory and the `dataset class` know where the data is located and it retrieves the rows of your data from memory one by one now the `dataloader class` takes this loaded data and create batches from them.

So the `dataset class` loads the data and the `dataloader class` create batches of this loaded data.

**Dataset Class**
The dataset class if essentially a blueprint. When you create a custom dataset, you decide how data is loaded and returned.

It defines :    
  - `__init__()` : which tells how the data should be loaded.
  - `__len__()` : which returns the total number of samples
  - `__getitem__(index)` : which returns the data (and label) at the given index. Also if you want to do any transformations it should be done in this class (eg. augmentation, resizing, stemming, lemetization, etc)

**DataLoader Class**
The DataLoader wraps a Dataset and handles batching, shuffling, and parallel loading for you.

**DataLoader Control Flow** :
  - At the start of each epoch, the DataLoader (if shuffle=True) shuffles indices (using a sampler)
  - It divides the indices into chunks of batch_size
  - For each index in the chunk, data samples are fetched from the Dataset object
  - The samples are then collected and combined into a batch (using collate_fn)
  - The batch is returned to the main training loop

#### **Parallelization**

If you want to do parallel operation you can do it using DataLoader Class what it does it it assign the batches created to different workers and these workers do the task simultaneously

#### **Sampler**

In PyTorch, the sampler in the DataLoader determines the strategy for selecting samples from the dataset during data loading. It controls how indices of the dataset are drawn for each batch

**Types of Sampler**
PyTorch provides several predefined samplers, and you can create custom ones:  
  1. SequentialSampler :    
    - Samples elements sequentially, in the order they appear in the dataset.
    - Default when shuffle = False
  2. Random Sampler :     
    - Samples elements randomly without replacement
    - Default when shuffle = True

#### `collate_fn`

The `collate_fn` in PyTorch's DataLoader is a function that specifies how to combine a list of samples from a dataset into a single batch. By default, the DataLoader uses a simple batch collation mechanism, but `collate_fn` allows you to customize how the data should be processed and batched



#### DataLoader Important Parameters

The DataLoader class in PyTorch comes with several parameters that allows you to customize how data is loaded, batched, and preprocessed. Some of the most commonly used and important parameters includes

  1. dataset(mandatory) :    
    - The Dataset from which the DataLoader will pull data
    - Must be a subclass of torch.utils.data.Dataset that implements `__getitem__` and `__len__`
  2. batch_size :    
    - How many samples per batch to load
    - Default is 1
    - Larger batch sizes can speed up training on GPUs but require more memory
  3. shuffle :    
    - If True, the DataLoader will shuffle the dataset incides each epoch
  4. num_worker :    
    - The number of worker processes used to load data in parallel
    - Setting num_workder > 0 can speed up data loading by leverging multiple CPU cores, especially if I/O or preprocessing is a bottelneck
  5. pin_memory :    
    - If True, the DataLoader will copy tensors into pinned(page-locked) memory before returning them.
    - This can improve GPU transfer speed and thun overall training throughput, particularly on CUDA systems
  6. drop_last :    
    - If True, the DataLoader will drop the last incomplete batch if the total number of samples is not divisble by the batch_size
    - Useful when exact batch sizes are required (for eg. in some batch nomalization scenarios)
  7. collate_fn :   
    - A callable that processes a list of samples into a batch (the default simply stacks tensors).
    - Custom collate_fn can handle variable - length sequences, perform custom batching logic, or handle complex data structures
  8. sampler :     
    - sampler defines the strategy for drawing samples (eg. for handling imbalanced classes, or custom sampling strategies)
    - batch_sampler works at the batch level, controlling how batches are formed.
    - Typically, you don't need to specify these if you are using batch_size an shuffle. However, they provide lower-level control if you have advanced requirements

#### Dataset Creation

In [1]:
# imports
from sklearn.datasets import make_classification
import torch

In [2]:
# Step 1 : Create a synthetic classification dataset using sklearn
X, y = make_classification(
    n_samples = 10, # Number of samples
    n_features = 2, # Number of features
    n_informative = 2, # Number of informative features
    n_redundant = 0, # Number of redundant features
    n_classes = 2, # Number of classes
    random_state = 42
)

In [3]:
X

array([[ 1.06833894, -0.97007347],
       [-1.14021544, -0.83879234],
       [-2.8953973 ,  1.97686236],
       [-0.72063436, -0.96059253],
       [-1.96287438, -0.99225135],
       [-0.9382051 , -0.54304815],
       [ 1.72725924, -1.18582677],
       [ 1.77736657,  1.51157598],
       [ 1.89969252,  0.83444483],
       [-0.58723065, -1.97171753]])

In [4]:
X.shape

(10, 2)

In [5]:
y

array([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [6]:
y.shape

(10,)

#### Converting NumPy arrays into PyTorch tensors

In [8]:
X = torch.tensor(X, dtype = torch.float32)
y = torch.tensor(y, dtype = torch.long)

  X = torch.tensor(X, dtype = torch.float32)
  y = torch.tensor(y, dtype = torch.long)


In [9]:
X

tensor([[ 1.0683, -0.9701],
        [-1.1402, -0.8388],
        [-2.8954,  1.9769],
        [-0.7206, -0.9606],
        [-1.9629, -0.9923],
        [-0.9382, -0.5430],
        [ 1.7273, -1.1858],
        [ 1.7774,  1.5116],
        [ 1.8997,  0.8344],
        [-0.5872, -1.9717]])

In [10]:
y

tensor([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

#### Creating Dataset Class

In [12]:
from torch.utils.data import Dataset, DataLoader

In [13]:
class CustomDataset(Dataset):
  def __init__(self, features, labels):
    self.features = features
    self.labels = labels
  def __len__(self):
    return self.features.shape[0]
  def __getitem__(self, index):
    return self.features[index], self.labels[index]

In [14]:
dataset = CustomDataset(X, y)

In [15]:
len(dataset)

10

In [16]:
dataset[0]

(tensor([ 1.0683, -0.9701]), tensor(1))

In [17]:
dataset[8]

(tensor([1.8997, 0.8344]), tensor(1))

#### Creating DataLoader class

In [18]:
dataloader = DataLoader(dataset, batch_size = 2, shuffle = True)

In [19]:
for batch_features, batch_labels in dataloader:
  print(batch_features)
  print(batch_labels)
  print('-'*50)

tensor([[-0.9382, -0.5430],
        [ 1.7273, -1.1858]])
tensor([1, 1])
--------------------------------------------------
tensor([[-2.8954,  1.9769],
        [-1.1402, -0.8388]])
tensor([0, 0])
--------------------------------------------------
tensor([[ 1.8997,  0.8344],
        [-0.5872, -1.9717]])
tensor([1, 0])
--------------------------------------------------
tensor([[ 1.0683, -0.9701],
        [-0.7206, -0.9606]])
tensor([1, 0])
--------------------------------------------------
tensor([[-1.9629, -0.9923],
        [ 1.7774,  1.5116]])
tensor([0, 1])
--------------------------------------------------


### Applying Mini Batch Gradient Descent using Dataset and DataLoader classes on Breast Cancer Project

#### Imports

In [21]:
# Imports
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

In [22]:
df = pd.read_csv("https://raw.githubusercontent.com/gscdit/Breast-Cancer-Detection/refs/heads/master/data.csv")

#### Preprocessing, Scaling, Encoding and converting into PyTorch Tensors

In [23]:
df.head()
df.drop(columns=["Unnamed: 32","id"],inplace=True)

In [24]:
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,1:],df.iloc[:,0],test_size=0.2,random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)

In [25]:
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

#### Creating CustomDataset class

In [26]:
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

In [27]:
train_dataset = CustomDataset(X_train_tensor, y_train_tensor)
test_dataset = CustomDataset(X_test_tensor, y_test_tensor)

In [28]:
train_dataset[10]

(tensor([-0.4976,  0.6137, -0.4981, -0.5310, -0.5769, -0.1749, -0.3622, -0.2849,
          0.4335,  0.1782, -0.3684,  0.5531, -0.3167, -0.4052,  0.0403, -0.0380,
         -0.1804,  0.1648, -0.1217,  0.2308, -0.5004,  0.8194, -0.4692, -0.5331,
         -0.0491, -0.0416, -0.1491,  0.0968,  0.1062,  0.4904]),
 tensor(0.))

In [29]:
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

#### Defining the Model

In [30]:
import torch
import torch.nn as nn

# Defining the Model

class Model(nn.Module):
  def __init__(self, num_features):
    super().__init__()
    self.linear1 = nn.Linear(num_features, 3)
    self.relu = nn.ReLU()
    self.linear2 = nn.Linear(3, 1)
    self.sigmoid = nn.Sigmoid()

  def forward(self, features):
    out = self.linear1(features)
    out = self.relu(out)
    out = self.linear2(out)
    out = self.sigmoid(out)
    return out

#### Important Paramters

In [31]:
learning_rate = 0.1
epochs = 25

In [35]:
model = Model(X_train_tensor.shape[1])

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

loss_function = nn.BCELoss()

#### Training

In [36]:
for epoch in range(epochs):
  for batch_features, batch_labels in train_loader:
    y_pred = model(batch_features)
    loss = loss_function(y_pred, batch_labels.view(-1, 1))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Epoch 1, Loss: 0.5787898302078247
Epoch 2, Loss: 0.32075998187065125
Epoch 3, Loss: 0.2856273651123047
Epoch 4, Loss: 0.2120477259159088
Epoch 5, Loss: 0.0909612700343132
Epoch 6, Loss: 0.23282615840435028
Epoch 7, Loss: 0.25627651810646057
Epoch 8, Loss: 0.1917082518339157
Epoch 9, Loss: 0.14033189415931702
Epoch 10, Loss: 0.18099470436573029
Epoch 11, Loss: 0.08367836475372314
Epoch 12, Loss: 0.07366105169057846
Epoch 13, Loss: 0.6025298237800598
Epoch 14, Loss: 0.1660197228193283
Epoch 15, Loss: 0.11129027605056763
Epoch 16, Loss: 0.08485733717679977
Epoch 17, Loss: 0.07812585681676865
Epoch 18, Loss: 0.08550560474395752
Epoch 19, Loss: 0.1536358743906021
Epoch 20, Loss: 0.0030576311983168125
Epoch 21, Loss: 0.08321792632341385
Epoch 22, Loss: 0.06259900331497192
Epoch 23, Loss: 0.2605583667755127
Epoch 24, Loss: 0.011140079237520695
Epoch 25, Loss: 0.033462852239608765


#### Evaluation

In [38]:
model.eval()
accuracy_list = []
with torch.no_grad():
  for batch_features, batch_labels in test_loader:
    y_pred = model(batch_features)
    y_pred = (y_pred > 0.5).float()
    batch_accuracy = (y_pred.view(-1) == batch_labels).float().mean()
    accuracy_list.append(batch_accuracy.item())

overall_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f"Accuracy : {overall_accuracy}")

Accuracy : 0.9782986044883728
