<a href="https://colab.research.google.com/github/KedarPanchal/Breast-Cancer-Detector/blob/main/tumor_detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Python Version
This neural network runs on Python 3.12 to ensure compatability with its dependencies. If you are running this notebook in a virtual environment, ensure you have the correct runtime selected by running the below cell.

In [None]:
!python --version

#### Install Dependencies
Installs the following dependencies for use in the notebook:
* **Torch:** The model is built using the PyTorch framework (this is also what limits the Python version to <= 3.12)
* **Torchvision:** Has functions for handling and preparing datasets for PyTorch models
* **Opendatasets:** Download datasets from the Kaggle online repository

In [None]:
%pip install torch
%pip install torchvision
%pip install opendatasets

#### Download and Prepare Datasets for Use
> Prior to running this code block, ensure you have access to your Kaggle username and API Key, as the download will prompt you to enter this information. Visit the Kaggle website for information on how to acquire an API key.
This neural network combines data from two datasets:
* The Breast Ultrasound Images (BUSI) Dataset (Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A. Dataset of breast ultrasound images. Data in Brief. 2020 Feb;28:104863. DOI: 10.1016/j.dib.2019.104863.)
* Vuppala Adithya Sairam's Ultrasound Breat Images for Breast Cancer dataset, for which he has not provided a source other than the fact that it was aggregated from various open breast cancer ultrasound datasets

The BUSI dataset had an additional "normal" class of ultrasounds that had no tumors, but these are deleted as the purpose of this model is to identify whether a detected tumor is malignant of benign. Both datasets have "benign" and "malignant" images which are aggregated together. Sairam's dataset was already split into test and evaluation datasets, but these were combined as this notebook randomly splits the datasets later on.

In [None]:
import opendatasets
opendatasets.download("https://www.kaggle.com/datasets/aryashah2k/breast-ultrasound-images-dataset")

!mkdir data
!rm -rf breast-ultrasound-images-dataset/Dataset_BUSI_with_GT/normal
!mv breast-ultrasound-images-dataset/Dataset_BUSI_with_GT/* data
!rm -rf breast-ultrasound-images-dataset
!find data -type f -name "*_mask*.png" -delete

opendatasets.download("https://kaggle.com/datasets/vuppalaadithyasairam/ultrasound-breast-images-for-breast-cancer")
!mv "ultrasound-breast-images-for-breast-cancer/ultrasound breast classification/train/benign"/* data/benign
!mv "ultrasound-breast-images-for-breast-cancer/ultrasound breast classification/val/benign"/* data/benign
!mv "ultrasound-breast-images-for-breast-cancer/ultrasound breast classification/train/malignant"/* data/malignant
!mv "ultrasound-breast-images-for-breast-cancer/ultrasound breast classification/val/malignant"/* data/malignant
!rm -rf "ultrasound-breast-images-for-breast-cancer"

#### Import Necessary Dependencies
The following dependencies are imported:
* `torch`: Contains various functions for identifying GPUs and manipulating Tensors
* `torch.nn`: Contains various prebuilt layers and classes to develop a deep learning network
* `torch.nn.functional`: Contains implementations of activation functions that add non-linearity to a deep learning network
* `torch.optim`: Contains optimizers used in model training
* `time`: Used in displaying metrics while training the model

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import models
import time

#### Select Device for Training
The following cell selects the best available device for training, testing, and performing inferences with the AI model. If a CUDA GPU is available, all calculations will be performed on the GPU. If an M-series Mac is used, PyTorch's MPS backend is used. Otherwise, all calculations will be done on the CPU.

In [None]:
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available(): # This won't really work with 90% of the features on here but oh well!
    device = "mps"

print(f"Device: {device}")

#### Load and Transform Datasets
The `torchvision` library is used to load and transform the data. The data is turned into a labeled dataset with the following labels:
* Images in `data/benign` will have a label `0`
* Images in `data/malignant` will have a label `1`

Images in the dataset are also transformed. They are converted to Grayscale (ultrasounds are in black and white anyway, so training on 3 color channels is a waste of computation power), transformed to Tensor shapes, and normalized to have a mean and standard deviation of 0.5. The data set is then randomly split into a train and test dataset, with the train dataset containing `80%` of the original dataset and the test dataset containing the remaining `20%`. These two datasets are then loaded, with the training dataset being randomly shuffled every epoch.

A batch size of `1` is used for both datasets as PyTorch expects batches to contain data of the same size. Since this model is designed to handle images of any size and the datasets are shuffled, a batch size of `1` is used to avoid any errors regarding data size mismatches within batches.

In [None]:
from torchvision import datasets
from torchvision.transforms import v2
from torch.utils.data import DataLoader, random_split

transform = v2.Compose([
    v2.Grayscale(num_output_channels=1),
    v2.RandomHorizontalFlip(0.5),
    v2.RandomRotation(30),
    v2.Resize((256, 256)),
    v2.ToImage(),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.5], std=[0.5])
])

dataset = datasets.ImageFolder(root="data", transform=transform)

train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size

train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

train_loader = DataLoader(train_dataset, batch_size=64, num_workers=4, shuffle=True, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False, pin_memory=True)

#### Identifying Dataset Biases
In the source images, there are fewer malignant tumor images than benign tumor images. Since the training and test datasets are randomly split fractions of the total ultrasound dataset, it's safe to assume that in the training data there are more instances of benign tumors than malignant. This cell is meant to verify that assumption.

In [None]:
benign_count = 0
malignant_count = 0
for _, label in train_loader:
    benign_count += label[label == 0].size(0)
    malignant_count += label[label == 1].size(0)

print(f"Benign Image Count: {benign_count}")
print(f"Malignant Image Count: {malignant_count}")

#### Addressing Dataset Biases
Because there tend to be fewer examples of malignant tumors, it's harder for the neural network to identify malignant tumors compared to benign ones. Regular cross-entropy or binary cross-entropy loss functions don't address this issue, but their variant focal loss does. The focal loss formula looks like this:
$$
FL(p_t) = -\alpha_t(1-p_t)^\gamma\log(p_t)
$$

$p_t =$ Probability of the input being of label $t$

$\alpha_t =$ Hyperparameter from $[0, 1]$ that scales down the loss of the label with fewer training instances. In binary classification tasks $\alpha_t = \alpha$ if $p_t = p$ and $\alpha_t = (1 - \alpha)$ if $p_t = (1 - p)$

$\gamma =$ Hyperparameter that is $\geq 0$ that scales down the loss of easily identifiable labels to focus on training harder ones

When adapted for binary classification tasks, the focal loss formula can look like the following:

$$
FL(p, y) = -\alpha y(1-p)^\gamma\log(p) - (1 - \alpha) (1-y)(p)^\gamma\log(1-p)
$$
$y =$ Whether the label is $1$ (malignant) or $0$ (benign)


In [None]:
class BinaryFocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0, reduction="mean"):
        super(BinaryFocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction

    def forward(self, inputs, targets):
        # Convert to float for Binary Cross Entropy
        targets = targets.float()
        cross_entropy_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
        # Sigmoid the inputs to convert them into probabilities
        p = torch.sigmoid(inputs)
        # Since targets is either 0 or 1, this returns the probability for each possible outcome as either targets or (1 - targets) is 0 in these equations
        p_t = p * targets + (1 - p) * (1 - targets)
        alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)
        # Actually apply the focal loss formula
        focal_loss = alpha_t * (1 - p_t) ** self.gamma * cross_entropy_loss

        if self.reduction == "mean":
            return focal_loss.mean()
        elif self.reduction == "sum":
            return focal_loss.sum()
        else:
            return focal_loss

#### Ensemble Part 1: Initialize ResNet18 Model

In [None]:
resnet_component = models.resnet18(pretrained=True)
resnet_component.fc = nn.Sequential(
    nn.Linear(resnet_component.fc.in_features, 128),
    nn.LeakyReLU(0.01),
    nn.Dropout(0.3),
    nn.Linear(128, 1)
)

#### Initialize ResNet18 Model Loss, Optimizer, and Epoch Counts

In [None]:
num_epochs = 10
resnet_component = resnet_component.to(device)
loss_fn = BinaryFocalLoss(alpha=0.8, gamma=2)
optimizer = optim.AdamW(resnet_component.parameters(), lr=1e-3, weight_decay=1e-4)

#### Train the ResNet18 Component

In [None]:
for epoch in range(num_epochs):
    i = 0
    start_time = time.time()
    resnet_component.train()
    current_loss = 0.0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = resnet_component(inputs)
        loss = loss_fn(outputs.view(-1), labels)

        loss.backward()
        optimizer.step()

        current_loss += loss.item()
        if i % 100 == 99 or i == len(train_loader) - 1:
            end_time = time.time()
            print(f"[Epoch: {epoch + 1}/{num_epochs}, Batch: {i + 1}/{len(train_loader)}] Loss: {current_loss:0.5f}, Time Elapsed: {end_time - start_time:0.5f}s")
            current_loss = 0.0
            start_time = end_time
        i += 1

print("Training Complete!")

#### Save ResNet18 Weights

In [None]:
torch.save(resnet_component.state_dict(), "restnet18_weights.pth")

#### Evaluate the Resnet18 Model

In [None]:
threshold = 0.3

def evaluate_model(model, data_loader, threshold, data_name="dataset"):
    total = 0
    correct = 0
    total_positive = 0
    predicted_positive = 0
    predicted_positive_correct = 0
    with torch.no_grad():
        model.eval()
        for images, labels in data_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            predicted = torch.sigmoid(outputs.data)
            predicted = (predicted > threshold).long()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            total_positive += (labels == 1).sum().item()
            predicted_positive += (predicted == 1).sum().item()
            predicted_positive_correct += (predicted == labels and predicted == 1).sum().item()

    accuracy = correct/total
    precision = predicted_positive_correct/predicted_positive
    recall = predicted_positive_correct/total_positive
    print(f"{data_name.title()} Accuracy: {correct}/{total} => {accuracy:0.7f}")
    print(f"{data_name.title()} Precision: {predicted_positive_correct}/{predicted_positive} => {precision:0.7f}")
    print(f"{data_name.title()} Recall: {predicted_positive_correct}/{total_positive} => {recall:0.7f}")
    print(f"{data_name.title()} F1 Score: {(2 * precision * recall/(precision + recall)):0.7f}")

evaluate_model(resnet_component, test_loader, threshold, "test data")

#### Load Cross-Validation Dataset

In [None]:
opendatasets.download("https://kaggle.com/datasets/sayedmeeralishah/breast-cancer-segmentation-dataset-preprocessed")
!mkdir "cross_validation"
!mv "breast-cancer-segmentation-dataset-preprocessed/Breast-canser_preprocessed dataset/benign" cross_validation
!mv "breast-cancer-segmentation-dataset-preprocessed/Breast-canser_preprocessed dataset/malignant" cross_validation
!find cross_validation -type f -name "*_mask.png" -delete

!rm -rf breast-cancer-segmentation-dataset-preprocessed

#### Evaluate Against Cross-Validation Data

In [None]:
## Change for ensemble cross-validation once ensemble is finalized
cross_validation_transform = v2.Compose([
    v2.Grayscale(num_output_channels=1),
    v2.ToImage(),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.5], std=[0.5])
])
cross_validation_dataset = datasets.ImageFolder(root="cross_validation", transform=cross_validation_transform)
cross_validation_loader = DataLoader(cross_validation_dataset, batch_size=1, shuffle=False)
evaluate_model(cancer_net, cross_validation_loader, threshold, "cross validation")