### import library,data loading and preprocessing

This section imports the core Python libraries used in the notebook (NumPy, Pandas, PyTorch, and scikit-learn), and loads the dataset. Run these cells first to ensure the environment and data are available for the subsequent preprocessing and model steps.

In [None]:
import numpy as np # Import NumPy for numerical operations
import pandas as pd # Import Pandas for data manipulation and analysis
import torch # Import PyTorch for building and training neural networks
from sklearn.model_selection import train_test_split # Import for splitting data into training and testing sets
from sklearn.preprocessing import StandardScaler # Import for feature scaling
from sklearn.preprocessing import LabelEncoder # Import for encoding categorical labels

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/gscdit/Breast-Cancer-Detection/refs/heads/master/data.csv') # Load the dataset from the specified URL into a Pandas DataFrame
df.head() # Display the first 5 rows of the DataFrame to inspect the data

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [None]:
df.shape # Display the dimensions (number of rows, number of columns) of the DataFrame

(569, 33)

In [None]:
df.drop(columns=['id', 'Unnamed: 32'], inplace= True) # Drop 'id' (identifier) and 'Unnamed: 32' (empty) columns from the DataFrame. inplace=True modifies the DataFrame directly.

In [None]:
df.head() # Display the first 5 rows of the DataFrame again to confirm column removal

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### train test split

Splits the dataset into training and test sets. Features and the target column are separated here; adjust `test_size` to change the ratio. Use this split to train on the training set and evaluate on the test set to estimate generalization performance.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], df.iloc[:, 0], test_size=0.2) # Split features (all columns except the first) and target (first column) into training and testing sets. 20% of the data is used for testing.

### scaling

Standardizes features to zero mean and unit variance using `StandardScaler`. Scaling numerical features helps many machine learning models converge faster and perform better; fit the scaler on training data and apply the same transform to the test data.

In [None]:
scaler = StandardScaler() # Initialize a StandardScaler object
X_train = scaler.fit_transform(X_train) # Fit the scaler on the training data and transform it
X_test = scaler.transform(X_test) # Transform the test data using the scaler fitted on the training data

In [None]:
X_train # Display the scaled training features

array([[-6.50565842e-01, -1.48908266e-01, -5.76498693e-01, ...,
         6.15575005e-01,  2.86214223e+00,  2.99362329e+00],
       [ 5.97634634e-01,  9.02319666e-01,  4.91580943e-01, ...,
        -2.20860954e-01,  3.81162965e-02,  4.72656715e-06],
       [-1.58044534e+00, -8.13451034e-01, -1.52901703e+00, ...,
        -1.40800994e-01,  3.84791784e-03,  9.23682295e-01],
       ...,
       [-3.54939414e-01,  6.91610984e-01, -2.60429911e-01, ...,
         1.27282915e+00,  7.53079287e-01,  2.13876454e+00],
       [ 6.96176777e-01,  9.13897066e-01,  7.63236241e-01, ...,
         1.68609104e+00,  2.96027440e+00,  6.24750555e-01],
       [-2.95216903e-01, -2.13741707e-01, -3.74697615e-01, ...,
        -9.77541944e-01, -1.45878879e+00, -1.22744340e+00]])

In [None]:
y_train # Display the training labels (before encoding)

Unnamed: 0,diagnosis
31,M
10,M
114,B
487,M
28,M
...,...
534,B
318,B
229,M
370,M


### Label Encoding

Converts categorical labels to numeric form using `LabelEncoder`. Many PyTorch loss functions expect numeric labels (e.g., 0/1 for binary classification). Keep the encoder if you need to inverse-transform predictions back to original labels.

In [None]:
encoder = LabelEncoder() # Initialize a LabelEncoder object
y_train = encoder.fit_transform(y_train) # Fit the encoder on training labels and transform them
y_test = encoder.transform(y_test) # Transform test labels using the fitted encoder

In [None]:
y_train # Display the training labels after label encoding

array([1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,

### Numpy arrays to PyTorch tensors

Converts NumPy arrays into `torch.Tensor` objects so they can be used with PyTorch models and automatic differentiation. Ensure the tensor `dtype` and device (CPU/GPU) match model expectations before training.

In [None]:
X_train_tensor = torch.from_numpy(X_train) # Convert X_train NumPy array to a PyTorch tensor
X_test_tensor = torch.from_numpy(X_test) # Convert X_test NumPy array to a PyTorch tensor
y_train_tensor = torch.from_numpy(y_train) # Convert y_train NumPy array to a PyTorch tensor
y_test_tensor = torch.from_numpy(y_test) # Convert y_test NumPy array to a PyTorch tensor

In [None]:
X_train_tensor.shape # Display the shape of the X_train PyTorch tensor

torch.Size([455, 30])

In [None]:
y_train_tensor.shape # Display the shape of the y_train PyTorch tensor

torch.Size([455])

### Defining the model

Defines a simple neural network class with parameters (`weights`, `bias`), a `forward` method for predictions, and a `loss_function`. You can replace the architecture or activation functions to experiment with different models. Note: the current implementation uses `torch.sigmoid` for binary output.

In [None]:
class MySimpleNN():

  def __init__(self, X):
    # Initialize weights with random values and bias with zeros. Both are PyTorch tensors requiring gradient calculation.
    self.weights = torch.rand(X.shape[1], 1, dtype=torch.float64, requires_grad=True)
    self.bias = torch.zeros(1, dtype=torch.float64, requires_grad=True)

  def forward(self, X):
    # Perform a linear transformation (matrix multiplication with weights and addition of bias)
    z = torch.matmul(X, self.weights) + self.bias
    # Apply the sigmoid activation function to squash the output between 0 and 1
    y_pred = torch.sigmoid(z)
    return y_pred

  def loss_function(self, y_pred, y):
    # Clamp predictions to a small range to prevent numerical instability (log(0) or log(1))
    epsilon = 1e-7
    y_pred = torch.clamp(y_pred, epsilon, 1 - epsilon)

    # Calculate the Binary Cross-Entropy loss. This is a common loss function for binary classification.
    # The formula is -(y * log(y_pred) + (1 - y) * log(1 - y_pred)).mean()
    loss = -(y_train_tensor * torch.log(y_pred) + (1 - y_train_tensor) * torch.log(1 - y_pred)).mean()
    return loss

### Important Parameters

Defines key hyperparameters used by the training loop, such as `learning_rate` and `epochs`. Tune these values to control optimization speed and training duration; consider using smaller learning rates with more epochs or learning rate schedules for better convergence.

In [None]:
learning_rate = 0.1 # Set the learning rate, which determines the step size during gradient descent
epochs = 25 # Set the number of training epochs (complete passes through the training dataset)

### Training Pipeline

Implements the training loop: forward pass to compute predictions, loss computation, backward pass to compute gradients, and parameter updates using gradient descent. Gradients are then zeroed to avoid accumulation. Monitor the printed loss to track training progress.

In [None]:
# Create an instance of the MySimpleNN model, passing the training features tensor
model = MySimpleNN(X_train_tensor)

# Define the training loop, iterating for the specified number of epochs
for epoch in range(epochs):

  # Forward pass: calculate predicted outputs (y_pred) based on current weights and bias
  y_pred = model.forward(X_train_tensor)

  # Loss calculation: compute the binary cross-entropy loss between predictions and actual labels
  loss = model.loss_function(y_pred, y_train_tensor)

  # Backward pass: compute gradients of the loss with respect to weights and bias
  loss.backward()

  # Parameters update: update weights and bias using gradient descent
  # torch.no_grad() ensures that these operations are not included in the gradient calculation
  with torch.no_grad():
    model.weights -= learning_rate * model.weights.grad # Update weights
    model.bias -= learning_rate * model.bias.grad # Update bias

  # Zero gradients: reset gradients to zero to prevent accumulation across epochs
  model.weights.grad.zero_()
  model.bias.grad.zero_()

  # Print loss in each epoch to monitor training progress
  print(f'Epoch: {epoch + 1}, Loss: {loss.item()}')

Epoch: 1, Loss: 3.8734547280441722
Epoch: 2, Loss: 3.747472466032905
Epoch: 3, Loss: 3.6134186400812918
Epoch: 4, Loss: 3.470439804709892
Epoch: 5, Loss: 3.322739972220773
Epoch: 6, Loss: 3.1694648115531123
Epoch: 7, Loss: 3.0100440752635733
Epoch: 8, Loss: 2.844157902489334
Epoch: 9, Loss: 2.674609873272524
Epoch: 10, Loss: 2.4944661028022916
Epoch: 11, Loss: 2.310878252311848
Epoch: 12, Loss: 2.1262855749988945
Epoch: 13, Loss: 1.9355652547438864
Epoch: 14, Loss: 1.7455326544884913
Epoch: 15, Loss: 1.5644006244127746
Epoch: 16, Loss: 1.3939717578405724
Epoch: 17, Loss: 1.240793872530291
Epoch: 18, Loss: 1.1079115183532575
Epoch: 19, Loss: 0.9976743132283318
Epoch: 20, Loss: 0.9111688924041779
Epoch: 21, Loss: 0.8476079601805322
Epoch: 22, Loss: 0.8040883970545488
Epoch: 23, Loss: 0.776156415200584
Epoch: 24, Loss: 0.759000203855423
Epoch: 25, Loss: 0.7485464460500285


In [None]:
model.bias # Display the final learned bias value after training

tensor([-0.0940], dtype=torch.float64, requires_grad=True)

### Evaluation

Evaluates the trained model on the test set. Gradients are disabled during evaluation to improve efficiency. Predictions are thresholded to obtain binary outputs; adjust the threshold depending on your precision/recall tradeoff.

In [None]:
# Model evaluation: Disable gradient calculations during evaluation
with torch.no_grad():
  y_pred = model.forward(X_test_tensor) # Perform a forward pass on the test data to get predictions
  y_pred = (y_pred > 0.9).float() # Convert predictions to binary (0 or 1) using a threshold of 0.9
  accuracy = (y_pred == y_test_tensor).float().mean() # Calculate the accuracy by comparing predictions to actual test labels
  print(f'Accuracy: {accuracy.item()}') # Print the calculated accuracy

Accuracy: 0.6378886103630066
