# COGS 118A - Final Project

## Group members
- Michael Nodini
- Alex Cagle
- Saransh Malik
- Arthur Hewig
- Maryam Usman

# Abstract 
Our goal for this project is to develop a deep learning / computer vision model to identify breast cancer based on pathology images from the PatchCamelyon (PCam) dataset, using binary classification. The PatchCamelyon dataset consists of 327,680 color images of lymph node samples, each with a size of 96x96 pixels. The images are labeled with one of two classes either indicating the presence or absence of metastatic tissue (which indicates cancerous tissue). We will first pre-process the data by binarizing the pixels to reduce noise. Then, we will be training the last layer of a convolutional neural network on this data. The neural network will be trained to classify the images to a binary label (the detection of cancerous tissue). Success will be measured using sensitivity and specificity. Our main goal is to reduce the false negative rate (by maximizing sensitivity), since false negatives are far more fatal than false positives in this scenario.

# Background

The article Artificial intelligence and computational pathology [Nature], published in 2021 by researchers Miao Cui and David Y. Zhang, discusses the immense potential that artificial intelligence has in the field of clinical pathology. Computational pathology uses technology to improve patient care in pathology and lab medicine. It combines different types of medical data, like images and patient information, to make better diagnoses and treatment plans. However, there are many challenges to overcome, like integrating the data and making sure the technology is used ethically. By utilizing modern computational methods, such as genomics, bioinformatics, and machine learning, and applying them to pathology, computational pathology has the potential to “create personalized diagnosis and treatment plans for patients,” “improve clinical workflow efficiency, diagnostic quality,” and offer a “better-integrated solution to whole-slide images, multi-omics data, and clinical informatics.” Most importantly, computational pathology can “reduce errors in diagnosis and classification.” The article states that machine learning algorithms evaluated by CAMELYON16, which evaluates how well algorithms can detect cancer, “[have] achieved encouraging results with a 92.4% sensitivity in tumor detection rate. In contrast, a pathology could only achieve 73.2% sensitivity.” Artificial intelligence is useful for the task of detecting cancer in pathology slide images because it can be trained to find patterns in data and is able to autonomously learn to solve novel problems. Currently, machine learning methods are being used to assist pathologic diagnosis by looking at cancer cells, cell nuclei, cell divisions, ducts, blood vessels, and more. Deep learning and artificial neural networks (ANNs) resemble human cognition, being composed of nodes (i.e. neurons), which make up an input layer, hidden layers, and an output layer. Convolutional neural networks (CNNs) are particularly adept at handling image classification tasks.

The article Histopathological Cancer Detection with Deep Neural Networks [TowardsDataScience] discusses the use of deep neural networks for histopathological cancer detection. It explains how these advanced algorithms can analyze tissue samples to identify cancer cells with high accuracy. This article highlights the potential of deep neural networks in improving cancer diagnosis and treatment through automated precise detection methods.  

The article The detection of cancer cells in histopathology based on machine vision [ScienceDirect-1], published July 20222 in Computers in Biology and Medicine, discusses why breast cancer detection is so important and how machine vision is able to improve traditional breast cancer detection. Breast cancer is an incredibly pressing issue facing the modern world. It is estimated that ⅛ women will develop breast cancer at some point, and by some metrics, it has overtaken lung leading cause of death out of all cancers worldwide. Furthermore, discovery in its early stages causes the possibility of a successful cure to rise to 80%. Traditional cancer detection through histopathological images is time consuming, difficult, hard to perform on large sets of data, and introduces a large amount of inconsistency from pathologist to pathologist. That’s why we are turning to machine image classification as it has the potential to be more accurate, more consistent, and much more efficient than individual pathologists could ever be.

The article Current applications and challenges of artificial intelligence in pathology [ScienceDirect-2] explores the role of AI and ML in computational pathology. It discusses how AI algorithms can analyze medical images such as histopathology slides to assist in diagnosing diseases. The article emphasizes the potential of AI in improving accuracy, efficiency, and personalized medicine in pathology, while acknowledging the challenges and ethical considerations associated with its implementation. 

[Nature]: https://www.nature.com/articles/s41374-020-00514-0
[TowardsDataScience]: https://towardsdatascience.com/histopathological-cancer-detection-with-deep-neural-networks-3399be879671 
[ScienceDirect-1]: https://www.sciencedirect.com/science/article/pii/S0010482522004280 
[ScienceDirect-2]: https://www.sciencedirect.com/science/article/pii/S2772736X22000081 

# Problem Statement

The problem we are facing is a binary classification problem: determining whether a 96x96 RGBimage is an instance of breast cancer or not. Traditionally this task is determined by a radiologist using their past expertise and training to determine whether someone has breast cancer or not; but we hope to train a  Convolutional Neural Network to make the decision.

# Data

We are using the PatchCamelyon ([PCam](https://patchcamelyon.grand-challenge.org/)) dataset, which was created by Bas Veeling and is used for training and benchmarking machine learning models. It contains 327,680 color images (96x96) of pathology slides with a binary classification that denotes the presence (labeled 1) or absence of cancer (labeled 0). The dataset is in the form of a .h5 file and consists of arrays with the corresponding pixel values (from 0 to 255) for each image.

In order to load in the data, we first needed to unzip the files, since they came in a .gz format. Then, we had to read the .h5 files using their corresponding keys. Each .h5 file for the x_train, x_test, and x_valid data as well as the y_train, y_test, and y_valid data were loaded in as numpy arrays. The files with prefix “x” contained the pixel values for each image and the files with prefix “y” contained the binary labels for each image. After loading the files into memory, we were able to use a dataloader to pass the data into our model. We ran into problems on Datahub when trying to load all the files in at once, since we exceeded the storage limits, likely due to the gzipped x_train file and the unzipped x_train file taking up a combined 13.5gb. To counteract this, we unzipped the data locally and then uploaded the .h5 files to Datahub.

# Proposed Solution

Our solution to this problem will be to train a convolutional neural network on the breast cancer image dataset to determine whether a tumor exists in the given image or not. Convolutional neural networks are typically applied to image classification tasks as they are able to learn features from the training images in service of the given task (binary classification). This means we won’t have to manually derive what pixels or groups of pixels in the image are indicative of a positive and negative class, the model will learn through gradient descent. We plan on building our own CNN model using PyTorch / PyTorch Lightning as well as loading a pre-trained ResNet50 model and retraining the last layer to map to our labels. We also plan on using a Naive Bayes model that just predicts the most popular class every time as our benchmark model. Our solution will be tested by splitting our dataset into a training, validation, and test set. The training and validation set will be used during model training with the test set being used to report final evaluation metrics.

# Evaluation Metrics

An evaluation metric we plan on using is the ROC-AUC. This metric will be useful for measuring performance across different thresholds. Breast cancer diagnosis is not something to be taken lightly so we want to make sure we’re choosing a classification threshold that balances true positive rate and false positive rate appropriately. It will also allow us to directly compare against different models using a single scalar value. We also want to take into account the F-score and accuracy of our models to take into account all aspects of true and false diagnosis.

# Results

## Initial Tests

Starting with transfer learning, we imported many ImageNet pre-trained models from the PyTorch library. Replacing the last layer with a fully connected layer, we froze all other layers and trained just that singular layer. We tried different image sizes, batch sizes, optimisers, etc. These models yielded a wide variety of results, but none cracked the 80% test accuracy boundary, and seemed to stagnate after 5 or so epochs.

<img src="grid_search_1.png"  width="500">

Models such as ResNet50 are optimised for images of size 224x224 or larger, so images from this dataset, of size 96x96 (out of which only a center crop of 32x32 was supposed to be checked) did not perform well on the model. [Geert Litjens](https://geertlitjens.nl/post/getting-started-with-camelyon/), one of the publishers of this dataset, recommends a 6 layer convolutional, 3 layer dense, 2 layer dropout network, as a relatively shallow network equipped with dealing with small images. Following this strategy without using any transfer learning, we started getting relatively high training accuracies, breaking the 80% boundary. The code for this model is provided below this writeup.

After trying over 50 models and parameter combinations over 3 weeks, we started fine tuning our model.

## Fine Tuning

Using this custom-built model, we experimented on a wide number of factors. First of all, we discovered that small batch sizes were preferable to large batch sizes for this model. Although large batches converged to a better minima on the training data, it led to significant overfitting. Following suit with [research](https://arxiv.org/abs/1609.04836) recommending the use of small batch sizes, we decided to use batches of 32-64 images instead of 256-512.

### Model 1

**Stochastic Gradient Descent** _(10 epochs, 5e-3 rate, 0.9 momentum, 0.4 dropout (2 dropout layers), 64 image batch size)_: 83.32% test accuracy

<img src="model1.png" width="300">

As seen, however, the validation accuracy does not decrease significantly after 3 epochs, indicating a level of overfitting on the training data. Although 83% is a good accuracy, it is likely this won't generalise well with similar pictures from different sets. This model is likely overfit on this dataset's colours, detail, etc. We decided to try more techniques to deal with this.

### Model 2

**Stochastic Gradient Descent** _(10 epochs, 1e-3 rate, 0.9 momentum, 0.3 dropout (3 dropout layers), 64 image batch size, random flips, grayscale)_

<img src="model2.png" width="300">

Here, to deal with the overfitting, I introduced a third dropout layer (between the convolutional layer and the first dense layer), and introduced data augmentation. The first augmentation is random vertical and horizontal flipping of training data, effectively quadrupling the "amount" of training data to reduce overfitting. Secondly, I grayscaled the images to reduce overfitting on colour saturation, since some [experts](https://cs230.stanford.edu/projects_winter_2019/reports/15813329.pdf) claim colour holds no meaning in cancer detection. This test showed that grayscale doesn't significantly reduce test accuracy, while reducing overfitting (as seen on the first 10 epochs) using SGD.

### Model 3

**Stochastic Gradient Descent** _(30 epochs, 1e-3 rate, 0.92 momentum, 0.35 dropout (3 dropout layers), 32 image batch size, random flips, grayscale, scheduler (2 patience, 0.4 factor) on validation accuracy)_: 84.87% test accuracy

<img src="model4.png" width="300">

Giving us a fantastic, almost 85% accuracy, this model is incredibly accurate at finding the cancerous examples of patches, while not significantly overfitting on the training data. Trained in about 1 hour on a 1080Ti, this model is both quick to train and accurate to test on unseen data.

# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Extra Credit

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
