# **Binary Classification of Breast Cancer using PyTorch**
---

## **Overview**
>This notebook demonstrates a step-by-step implementation of a binary classification task using PyTorch to predict whether breast cancer is malignant or benign based on clinical features. The dataset is preprocessed by scaling the features using `StandardScaler` and encoding the target labels into a numerical format. A simple single-layer neural network, equivalent to logistic regression, is built using PyTorch's `nn.Module`. The model is trained using the Binary Cross-Entropy Loss with logits (`BCEWithLogitsLoss`) and optimized with Stochastic Gradient Descent (SGD). A training loop is implemented to perform forward and backward passes, calculate the loss, and update model parameters iteratively. Finally, the model is evaluated on the test dataset to compute its accuracy, showcasing the entire workflow of training and testing a binary classifier with PyTorch.

---
## **Install Required Libraries**
>Installs the `torchinfo` library, which is useful for summarizing PyTorch models.

In [42]:
!pip install torchinfo



---
## **Import Required Libraries**
>- `torch`: Core PyTorch library.
- `torch.nn`: Module for defining and training neural networks.
- `pandas`: For loading and preprocessing the dataset.
- `numpy`: To handle numerical operations.
- `sklearn`: Provides tools for splitting data, scaling features, and encoding labels.
- `torchinfo`: To summarize the PyTorch model architecture (not used in this notebook).

In [43]:
import torch
import torch.nn as nn
from torchinfo import summary
import pandas as pd

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

---
## **Load and Preprocess the Dataset**
>- Loads the Breast Cancer Wisconsin dataset from a GitHub link.
- Drops unnecessary columns (`id` and `Unnamed: 32`) since they are not relevant for model training.

In [44]:
df = pd.read_csv("https://raw.githubusercontent.com/gscdit/Breast-Cancer-Detection/refs/heads/master/data.csv")

In [45]:
df.drop(columns=["id", "Unnamed: 32"], inplace=True)

---
## **Split Data into Training and Testing Sets**
>Splits the dataset into training (80%) and testing (20%) sets.
 - `X`: Feature columns.
 - `y`: Target column (diagnosis).
 - `random_state=42`: Ensures reproducibility.

In [46]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], df.iloc[:, 0],
                                                    test_size=0.2, random_state=42)

---
## **Feature Scaling and Label Encoding**
>- **Feature Scaling**: Standardizes the feature values to have a mean of 0 and a standard deviation of 1 using StandardScaler. This is important to ensure faster and stable convergence.
- **Label Encoding**: Converts the categorical labels (e.g., "M" and "B") into numerical format (0 and 1).

In [47]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

In [48]:
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)

---
## **Convert Data into PyTorch Tensors**
>- Converts numpy arrays into PyTorch tensors.
- Adds an extra dimension to `y_train_tensor` and `y_test_tensor` using `.unsqueeze(1)` to match the model's output shape (`batch_size, 1`).
- Data type is explicitly set to `torch.float`.

In [55]:
X_train_tensor = torch.from_numpy(X_train).float()
X_test_tensor = torch.from_numpy(X_test).float()

y_train_tensor = torch.from_numpy(y_train).float()
y_test_tensor = torch.from_numpy(y_test).float()

---
## **Define the Neural Network Model**
>Defines a simple single-layer neural network (logistic regression).
 - The `forward()` method computes the linear transformation without applying a sigmoid activation.
 - **Input**: `num_features` (number of input features).
 - **Output**: A single value (logit).

In [56]:
class Neuron(nn.Module):
  def __init__(self, num_features):
    super().__init__()
    self.linear = nn.Linear(num_features, 1)
    self.sigmoid = nn.Sigmoid()

  def forward(self, features):
    # return self.sigmoid(self.linear(features))
    return self.linear(features)

---
## **Initialize Model, Loss Function, and Optimizer**
>- **Model**: Instantiates the Neuron model with input size equal to the number of features.
- **Loss Function**: BCEWithLogitsLoss combines the sigmoid activation and binary cross-entropy loss in a numerically stable way.
- **Optimizer**: Stochastic Gradient Descent (SGD) with a learning rate of 0.1.

In [63]:
loss = nn.BCELoss()

In [64]:
model = Neuron(X_train_tensor.shape[1])
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

In [65]:
summary(model, input_data=X_train_tensor)

Layer (type:depth-idx)                   Output Shape              Param #
Neuron                                   [455, 1]                  --
├─Linear: 1-1                            [455, 1]                  31
├─Sigmoid: 1-2                           [455, 1]                  --
Total params: 31
Trainable params: 31
Non-trainable params: 0
Total mult-adds (M): 0.01
Input size (MB): 0.05
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.06

---
## **Training Loop**
>- Loops over 25 epochs to train the model.
- **Forward Pass**: Computes predictions for the training data.
- **Loss Calculation**: Computes the binary cross-entropy loss with logits.
- **Backward Pass**: Calculates gradients using `.backward()`.
- **Optimizer Step**: Updates model parameters using `optimizer.step()`.
- Prints the loss at each epoch to monitor training progress.


In [66]:
for epoch in range(25):
    y_pred = model(X_train_tensor)
    l = loss(y_pred, y_train_tensor.unsqueeze(1).float())

    optimizer.zero_grad()
    l.backward()
    optimizer.step()

    print(f"Epoch: {epoch+1}, Loss: {l.item()}")

Epoch: 1, Loss: 0.7291861772537231
Epoch: 2, Loss: 0.5418397188186646
Epoch: 3, Loss: 0.44454148411750793
Epoch: 4, Loss: 0.3867309093475342
Epoch: 5, Loss: 0.3481771647930145
Epoch: 6, Loss: 0.3203520178794861
Epoch: 7, Loss: 0.2991158664226532
Epoch: 8, Loss: 0.28222930431365967
Epoch: 9, Loss: 0.26837706565856934
Epoch: 10, Loss: 0.2567354738712311
Epoch: 11, Loss: 0.24676178395748138
Epoch: 12, Loss: 0.23808303475379944
Epoch: 13, Loss: 0.23043392598628998
Epoch: 14, Loss: 0.22362013161182404
Epoch: 15, Loss: 0.21749570965766907
Epoch: 16, Loss: 0.2119486927986145
Epoch: 17, Loss: 0.20689138770103455
Epoch: 18, Loss: 0.20225408673286438
Epoch: 19, Loss: 0.19798055291175842
Epoch: 20, Loss: 0.19402477145195007
Epoch: 21, Loss: 0.19034862518310547
Epoch: 22, Loss: 0.18692027032375336
Epoch: 23, Loss: 0.18371275067329407
Epoch: 24, Loss: 0.18070323765277863
Epoch: 25, Loss: 0.17787204682826996


---
## **Model Evaluation on Test Data**
>- **Disables Gradient Calculation**: Using `torch.no_grad()` reduces memory usage during evaluation.
- **Model Prediction**: Passes test data through the model and applies the sigmoid activation to convert logits into probabilities.
- **Thresholding**: Converts probabilities into binary predictions using a threshold of `0.5`.
- **Accuracy Calculation**: Computes the fraction of correct predictions.
- Prints the test accuracy as a percentage.

In [67]:
with torch.no_grad():
  y_pred = model.forward(X_test_tensor)
  y_pred_class = (y_pred > 0.5).float()
  accuracy = (y_pred_class == y_test_tensor).float().mean()

  print(f"Test Accuracy: {accuracy.item()}")

Test Accuracy: 0.5301631093025208
