# **Using PyTorch Dataset and DataLoader for Breast Cancer Detection with a Simple Neural Network**
---

## **Overview**
>This code demonstrates the use of PyTorch's Dataset and DataLoader classes for managing and batching data in a binary classification task. It focuses on classifying breast cancer cases as malignant or benign using a simple neural network. The dataset is preprocessed by cleaning irrelevant columns, splitting it into training and test sets, scaling the features, and encoding labels for compatibility with PyTorch. A custom Dataset class is implemented to encapsulate the features and labels, while DataLoader is used to handle batching and shuffling of the data during training and evaluation. A single-layer neural network is then defined, trained using Binary Cross-Entropy Loss, and optimized with Stochastic Gradient Descent. Finally, the model is evaluated for accuracy, showcasing the efficient data handling capabilities provided by Dataset and DataLoader.

---
## **Importing Libraries**
>Import essential libraries for:
 - Handling data (`pandas`, `numpy`).
 - Preprocessing (`StandardScaler`, `LabelEncoder`).
 - Machine learning and neural network implementation (`torch` and `torch.nn`).

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

---
## **Loading and Cleaning the Dataset**
>- **Dataset**: A CSV file containing breast cancer data is loaded directly from GitHub. <br>
- **Cleaning**: Irrelevant columns (`id` and `Unnamed: 32`) are removed to retain only meaningful features and the target label.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/gscdit/Breast-Cancer-Detection/refs/heads/master/data.csv")

In [None]:
df.drop(columns=["id", "Unnamed: 32"], inplace=True)

---
## **Splitting the Data into Training and Test Sets**
>- Splitting: The data is divided into:
 - Features (`X`) — all columns except the first.
 - Labels (`y`) — the first column. <br>
- 80% of the data is used for training, and 20% is used for testing.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], df.iloc[:, 0],
                                                    test_size=0.2, random_state=42)

---
## **Data Scaling and Encoding**
>- **Scaling**: Standardizes the feature values for better performance during training.
- **Encoding**: Converts string labels (`malignant`, `benign`) into numeric values (0 or 1) for compatibility with PyTorch.

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)

## **Converting Data to PyTorch Tensors**
> Converts numpy arrays for features and labels into PyTorch tensors. Tensors are necessary for PyTorch operations.


In [None]:
X_train_tensor = torch.from_numpy(X_train).float()
X_test_tensor = torch.from_numpy(X_test).float()

y_train_tensor = torch.from_numpy(y_train).float()
y_test_tensor = torch.from_numpy(y_test).float()

## **Creating a Custom Dataset Class**
>- A custom dataset class inherits from `torch.utils.data.Dataset`.
- Provides methods to retrieve:
 - The total number of samples (`__len__`).
 - A specific sample and label by index (`__getitem__`).

In [None]:
class CustomDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

## **Preparing DataLoaders**
>- Wraps the datasets in `DataLoader` objects for:
 - Batch processing (batch size = 32).
 - Shuffling the data for randomness.

In [None]:
train_dataset = CustomDataset(X_train_tensor, y_train_tensor)
test_dataset = CustomDataset(X_test_tensor, y_test_tensor)

In [None]:
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True)

## **Defining the Neural Network**
>- Implements a single-layer neural network with:
 - One linear layer (`nn.Linear`) for mapping features to the output.
 - A forward method defining the computation for the input data.

In [None]:
class Neuron(nn.Module):
  def __init__(self, num_features):
    super().__init__()
    self.linear = nn.Linear(num_features, 1)

  def forward(self, features):
    return self.linear(features)

## **Initializing the Model, Loss, and Optimizer**
>- **Loss Function**: `BCELoss` (Binary Cross-Entropy Loss) for binary classification tasks.
- **Model**: An instance of the Neuron class initialized with the number of features.
- **Optimizer**: Stochastic Gradient Descent (`SGD`) with a learning rate of 0.1.

In [None]:
loss = nn.BCELoss()
model = Neuron(X_train_tensor.shape[1])
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

## **Training the Model**
>- **Epoch Loop**: Runs the training for 25 iterations.
- **Batch Loop**: Iterates through batches of training data.
- **Forward Pass**: Computes the model's predictions.
- **Loss Calculation**: Compares predictions to actual labels.
- **Backward Pass**: Updates model parameters using the gradient of the loss.
- **Logging**: Prints loss after each epoch.

In [None]:
for epoch in range(25):
  for features, labels in train_loader:
    output = model(features)

    l = loss(output, labels.unsqueeze(1))
    optimizer.zero_grad()
    l.backward()
    optimizer.step()

  print(f"Epoch: {epoch} Loss: {l.item()}")

## **Evaluating the Model**
>- **Evaluation Mode**: Disables gradient calculation to speed up testing.
- **Predictions**: Applies a sigmoid function to map outputs to probabilities, then rounds to binary values.
- **Accuracy Calculation**: Compares predictions with actual labels.
- **Result**: Prints the average accuracy over all test batches.

In [None]:
model.eval()
accuracy_list = []

with torch.no_grad():
  for features, labels in test_loader:
    output = model(features)
    output = torch.round(torch.sigmoid(output))
    accuracy = (output == labels.unsqueeze(1)).float().mean()
    accuracy_list.append(accuracy)

print(f"Accuracy: {sum(accuracy_list) / len(accuracy_list)}")