<a href="https://colab.research.google.com/github/KedarPanchal/Breast-Cancer-Detector/blob/main/tumor_detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Python Version
This neural network runs on Python 3.12 to ensure compatability with its dependencies. If you are running this notebook in a virtual environment, ensure you have the correct runtime selected by running the below cell.

In [None]:
!python --version

#### Install Dependencies
Installs the following dependencies for use in the notebook:
* **Torch:** The model is built using the PyTorch framework (this is also what limits the Python version to <= 3.12)
* **Torchvision:** Has functions for handling and preparing datasets for PyTorch models
* **Opendatasets:** Download datasets from the Kaggle online repository

In [None]:
%pip install torch
%pip install torchvision
%pip install opendatasets
%pip install scikit-learn

#### Download and Prepare Datasets for Use
> Prior to running this code block, ensure you have access to your Kaggle username and API Key, as the download will prompt you to enter this information. Visit the Kaggle website for information on how to acquire an API key.
This neural network combines data from two datasets:
* The Breast Ultrasound Images (BUSI) Dataset (Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A. Dataset of breast ultrasound images. Data in Brief. 2020 Feb;28:104863. DOI: 10.1016/j.dib.2019.104863.)
* Vuppala Adithya Sairam's Ultrasound Breat Images for Breast Cancer dataset, for which he has not provided a source other than the fact that it was aggregated from various open breast cancer ultrasound datasets

The BUSI dataset had an additional "normal" class of ultrasounds that had no tumors, but these are deleted as the purpose of this model is to identify whether a detected tumor is malignant of benign. Both datasets have "benign" and "malignant" images which are aggregated together. Sairam's dataset was already split into test and evaluation datasets, but these were combined as this notebook randomly splits the datasets later on.

In [None]:
import opendatasets
opendatasets.download("https://www.kaggle.com/datasets/aryashah2k/breast-ultrasound-images-dataset")

!mkdir data
!rm -rf breast-ultrasound-images-dataset/Dataset_BUSI_with_GT/normal
!mv breast-ultrasound-images-dataset/Dataset_BUSI_with_GT/* data
!rm -rf breast-ultrasound-images-dataset
!find data -type f -name "*_mask*.png" -delete

#### Import Necessary Dependencies
The following dependencies are imported:
* `torch`: Contains various functions for identifying GPUs and manipulating Tensors
* `torch.nn`: Contains various prebuilt layers and classes to develop a deep learning network
* `torch.nn.functional`: Contains implementations of activation functions that add non-linearity to a deep learning network
* `torch.optim`: Contains optimizers used in model training
* `time`: Used in displaying metrics while training the model

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
from torchvision import models
import time
import copy
import os

#### Select Device for Training
The following cell selects the best available device for training, testing, and performing inferences with the AI model. If a CUDA GPU is available, all calculations will be performed on the GPU. If an M-series Mac is used, PyTorch's MPS backend is used. Otherwise, all calculations will be done on the CPU.

In [None]:
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available(): # This won't really work with 90% of the features on here but oh well!
    device = "mps"

print(f"Device: {device}")

#### Delete .DS_Store Files

In [None]:
!find . -name ".DS_Store" -print -delete

#### Initialize and Transform Datasets
The `torchvision` library is used to load and transform the data. The data is turned into a labeled dataset with the following labels:
* Images in `data/benign` will have a label `0`
* Images in `data/malignant` will have a label `1`

Images in the dataset are also transformed. They are converted to Grayscale (ultrasounds are in black and white anyway, so training on 3 color channels is a waste of computation power), transformed to Tensor shapes, and normalized to have a mean and standard deviation of 0.5. The data set is then randomly split into a train and test dataset, with the train dataset containing `80%` of the original dataset and the test dataset containing the remaining `20%`. These two datasets are then loaded, with the training dataset being randomly shuffled every epoch.

A batch size of `1` is used for both datasets as PyTorch expects batches to contain data of the same size. Since this model is designed to handle images of any size and the datasets are shuffled, a batch size of `1` is used to avoid any errors regarding data size mismatches within batches.

In [None]:
from torchvision import datasets
from torchvision.transforms import v2

transform = v2.Compose([
    v2.Grayscale(num_output_channels=1),
    v2.RandomHorizontalFlip(0.5),
    v2.RandomRotation(20),
    v2.RandomAutocontrast(0.3),
    v2.RandomAdjustSharpness(0.3),
    v2.RandomEqualize(0.2),

    v2.Resize((224, 224)),

    v2.ToImage(),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.5], std=[0.5])
])

dataset = datasets.ImageFolder(root="data", transform=transform)

#### Define Training Function

In [None]:
def train_model(model, data_loader, optimizer, loss_fn, scheduler, current_fold, num_epochs=20, device=device):
    model.train()
    for epoch in range(num_epochs):
        i = 0
        start_time = time.time()
        current_loss = 0.0

        for inputs, labels in data_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs.view(-1), labels.float())

            loss.backward()
            optimizer.step()
            scheduler.step()

            current_loss += loss.item()
            if i % 10 == 9 or i == len(data_loader) - 1:
                end_time = time.time()
                print(f"[Fold: {current_fold + 1}, Epoch: {epoch + 1}/{num_epochs}, Batch: {i + 1}/{len(data_loader)}] Loss: {current_loss:0.5f}, Time Elapsed: {end_time - start_time:0.5f}s")
                current_loss = 0.
                start_time = end_time
            i += 1


    print("Training Complete!")

#### Define Evaluation Function

In [None]:
def evaluate_model(model, data_loader, threshold, device=device):
    total = 0
    correct = 0
    total_positive = 0
    predicted_positive = 0
    predicted_positive_correct = 0
    with torch.no_grad():
        model.eval()
        for images, labels in data_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            predicted = torch.sigmoid(outputs.data)
            predicted = (predicted > threshold).long()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            total_positive += (labels == 1).sum().item()
            predicted_positive += (predicted == 1).sum().item()
            predicted_positive_correct += (predicted == labels and predicted == 1).sum().item()

    accuracy = correct/total
    precision = predicted_positive_correct/predicted_positive
    recall = predicted_positive_correct/total_positive
    f1_score = 2 * precision * recall/(precision + recall)
    return (accuracy, precision, recall, f1_score)


#### Initialize ShuffleNetV2-0.5x Model

In [None]:
cancer_net = models.efficientnet_b1(weights=models.EfficientNet_B1_Weights.DEFAULT)
cancer_net.features[0] = nn.Sequential(
    nn.Conv2d(1, 32, kernel_size=3, stride=2, padding=1, bias=False),
    nn.BatchNorm2d(32),
    nn.SiLU(inplace=True)
)
cancer_net.classifier = nn.Sequential(
    nn.Dropout(0.5),
    nn.Linear(cancer_net.classifier[1].in_features, 512),
    nn.SiLU(inplace=True),
    nn.Dropout(0.5),
    nn.Linear(512, 128),
    nn.SiLU(inplace=True),
    nn.Dropout(0.5),
    nn.Linear(128, 1)    
)


#### Initialize ShuffleNetV2-0.5x Model Loss and State Dict

In [None]:
num_epochs = 10
cancer_net = cancer_net.to(device)
loss_fn = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([0.8 * len(os.listdir("./data/benign"))/len(os.listdir("./data/malignant"))]).to(device))
state_dict = copy.deepcopy(cancer_net.state_dict())

#### Train the ShuffleNetV2-0.5x Model Using K-Fold Cross Validation

In [None]:
from sklearn.model_selection import KFold
from torch.utils.data import DataLoader, SubsetRandomSampler

folds = 5
batch_size = 32

k_fold = KFold(n_splits=folds, shuffle=True)
for fold, (train_i, test_i) in enumerate(k_fold.split(dataset)):
    cancer_net.load_state_dict(state_dict)
    optimizer = optim.AdamW(cancer_net.parameters(), lr=1e-3, weight_decay=1e-4)
    scheduler = lr_scheduler.CyclicLR(optimizer, base_lr=3e-4, max_lr=1e-3, step_size_up=2, mode="exp_range")
    train_loader = DataLoader(dataset=dataset, batch_size=batch_size, sampler=SubsetRandomSampler(train_i))
    test_loader = DataLoader(dataset=dataset, batch_size=1, sampler=SubsetRandomSampler(test_i))

    train_model(cancer_net, train_loader, optimizer, loss_fn, scheduler, current_fold=fold, num_epochs=num_epochs)
    accuracy, precision, recall, f1_score = evaluate_model(cancer_net, data_loader=test_loader, threshold=0.3)
    print(f"Test Accuracy: {accuracy:0.7f}")
    print(f"Test Precision: {precision:0.7f}")
    print(f"Test Recall: {recall:0.7f}")
    print(f"Test F1 Score: {f1_score:0.7f}")

#### Save ShuffleNetV2-0.5x Weights

In [None]:
torch.save(cancer_net.state_dict(), "shufflenet_weights.pth")

#### Load Cross-Validation Dataset

In [None]:
import opendatasets
opendatasets.download("https://kaggle.com/datasets/fhabibimoghaddam/breast-ultrasound-images")
!mkdir "cross_validation"
!mv "breast-ultrasound-images/Breast Ultrasound Images Dataset/benign" cross_validation
!mv "breast-ultrasound-images/Breast Ultrasound Images Dataset/malignant" cross_validation

!rm -rf breast-ultrasound-images

#### Evaluate Against Cross-Validation Data

In [None]:
## Change for ensemble cross-validation once ensemble is finalized
cross_validation_transform = v2.Compose([
    v2.Grayscale(num_output_channels=1),
    v2.ToImage(),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.5], std=[0.5])
])
cross_validation_dataset = datasets.ImageFolder(root="cross_validation", transform=cross_validation_transform)
cross_validation_loader = DataLoader(cross_validation_dataset, batch_size=1, shuffle=False)
threshold = 0.3
evaluate_model(cancer_net, cross_validation_loader, threshold, data_name="cross validation")