## Predicting breast cancer from digitized images of breast mass"

by Tiffany A. Timbers & Melissa Lee
2023/11/09

In [296]:
import numpy as np
import pandas as pd
import requests
import os
import zipfile
import altair as alt
from sklearn import set_config
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer

# Summary

Here we attempt to build a classification model using the k-nearest neighbours algorithm which can use breast cancer tumour image measurements to predict whether a newly discovered breast cancer tumour is benign (i.e., is not harmful and does not require treatment) or malignant (i.e., is harmful and requires treatment intervention). Our final classifier performed fairly well on an unseen test data set, with Cohen’s Kappa score of 0.9 and an overall accuracy calculated to be 0.97. On the 142 test data cases, it correctly predicted 138. However it incorrectly predicted 4 cases, and importantly these cases were false negatives; predicting that a tumour is benign when in fact it is malignant. These kind of incorrect predictions could have a severly negative impact on a patients health outcome, thus we recommend continuing study to improve this prediction model before it is put into production in the clinic.


# Introduction

Women have a 12.1% lifetime probability of developing breast cancer, and although cancer treatment has improved over the last 30 years, the projected death rate for women's breast cancer is 22.4 deaths per 100,000 in 2019 (Canadian Cancer Statistics Advisory Committee 2019). Early detection has been shown to improve outcomes (Canadian Cancer Statistics Advisory Committee 2019), and thus methods, assays and technologies that help to improve diagnosis may be beneficial for improving outcomes further. 

Here we ask if we can use a machine learning algorithm to predict whether a newly discovered tumour is benign or malignant given tumour image measurements. Answering this question is important because traditional methods for tumour diagnosis are quite subjective and can depend on the diagnosing physicians skill as well as experience (Street, Wolberg, and Mangasarian 1993). Furthermore, benign tumours are not normally dangerous; the cells stay in the same place and the tumour stops growing before it gets very large. By contrast, in malignant tumours, the cells invade the surrounding tissue and spread into nearby organs where they can cause serious damage. Thus, if a machine learning algorithm can accurately and effectively predict whether a newly discovered tumour benign or malignant given tumour image measurements this could lead to less subjective, and more scalable breast cancer tumour diagnosis which could contribute to better patient outcomes.

# Methods

## Data
The data set used in this project is of digitized breast cancer image features created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin, Madison (Street, Wolberg, and Mangasarian 1993).  It was sourced from the UCI Machine Learning Repository (Street, Wolberg, and Mangasarian 1993) and can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), specifically [this file](http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). Each row in the data set represents summary statistics from measurements of an image of a tumour sample, including the diagnosis (benign or malignant) and several other measurements (e.g., nucleus texture, perimeter, area, etc.). Diagnosis for each image was conducted by physicians. 

## Analysis
The k-nearest neighbors (k-nn) algorithm was used to build a classification model to predict whether a tumour mass was benign or malignant (found in the class column of the data set). All variables included in the original data set, with the exception of the standard error of fractal dimension, smoothness, symmetry and texture were used to fit the model. The hyperparameter $K$ was chosen using 30-fold cross validation with Cohen's Kappa as the classification metric. The R and Python programming languages (R Core Team 2019; Van Rossum and Drake 2009) and the following R and Python packages were used to perform the analysis: caret (Jed Wing et al. 2019), docopt (de Jonge 2018), feather (Wickham 2019), knitr (Xie 2014), tidyverse (Wickham 2017), docopt (Keleshev 2014), os (Van Rossum and Drake 2009), feather (McKinney 2019) Pandas (McKinney 2010). The code used to perform the analysis and create this report can be found here: https://github.com/ttimbers/breast_cancer_predictor_py.


# Results & Discussion

To look at whether each of the predictors might be useful to predict the tumour class, we plotted the distributions of each predictor from the training data set and coloured the distribution by class (benign: blue and malignant: orange). In doing this we see that class distributions for all of the mean and max predictors for all the measurements overlap somewhat, but do show quite a difference in their centres and spreads. This is less so for the standard error (se) predictors. In particular, the standard errors of fractal dimension, smoothness, symmetry and texture look very similar in both the distribution centre and spread. Thus, we choose to omit these from our model.

In [297]:
# download data as zip and extract
url = "https://archive.ics.uci.edu/static/public/15/breast+cancer+wisconsin+original.zip"

request = requests.get(url)
with open("../data/raw/breast+cancer+wisconsin+original.zip", 'wb') as f:
        f.write(request.content)

with zipfile.ZipFile("../data/raw/breast+cancer+wisconsin+original.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/raw")

In [298]:
# pre-process data (e.g., scale and split into train & test)
# read in data
colnames = ["id",
            "class",
            "mean_radius",
            "mean_texture",
            "mean_perimeter", 
            "mean_area",
            "mean_smoothness",
            "mean_compactness",
            "mean_concavity",
            "mean_concave_points",
            "mean_symmetry",
            "mean_fractal_dimension",
            "se_radius",
            "se_texture",
            "se_perimeter", 
            "se_area",
            "se_smoothness",
            "se_compactness",
            "se_concavity",
            "se_concave_points",
            "se_symmetry",
            "se_fractal_dimension",
            "max_radius",
            "max_texture",
            "max_perimeter", 
            "max_area",
            "max_smoothness",
            "max_compactness",
            "max_concavity",
            "max_concave_points",
            "max_symmetry",
            "max_fractal_dimension"]

cancer = pd.read_csv("../data/raw/wdbc.data", names=colnames, header=None)
cancer = cancer.drop(['id'], axis=1)


In [299]:
np.random.seed(522)
set_config(transform_output="pandas")

# re-label Class 'M' as 'Malignant', and Class 'B' as 'Benign'
cancer['class'] = cancer['class'].replace({
    'M' : 'Malignant',
    'B' : 'Benign'
})

# create the split
cancer_train, cancer_test = train_test_split(
    cancer, train_size=0.75, stratify=cancer["class"]
)

cancer_train.to_csv("../data/processed/cancer_train.csv")
cancer_test.to_csv("../data/processed/cancer_test.csv")

In [300]:
cancer_preprocessor = make_column_transformer(
    (StandardScaler(), make_column_selector(dtype_include='number')),
    remainder='passthrough',
    verbose_feature_names_out=False
)

cancer_preprocessor.fit(cancer_train)
scaled_cancer_train = cancer_preprocessor.transform(cancer_train)
scaled_cancer_test = cancer_preprocessor.transform(cancer_test)

scaled_cancer_train.to_csv("../data/processed/scaled_cancer_train.csv")
scaled_cancer_test.to_csv("../data/processed/scaled_cancer_test.csv")

In [301]:
# gather for plotting via facets 
# and make columns names nicer for plotting
cancer_train_melted = scaled_cancer_train.melt(
    id_vars=['class'],
    #value_vars=['mean_radius', 'mean_texture', 'se_concavity'],
    var_name='predictor', value_name='value'
)

cancer_train_melted['predictor'] = cancer_train_melted['predictor'].str.replace('_',' ')

In [302]:
# exploratory data analysis - visualize predictor distributions across classes
alt.data_transformers.enable('vegafusion')

alt.Chart(cancer_train_melted).transform_density(
    'value',
    groupby=['class', 'predictor']
).mark_area(opacity=0.7).encode(
    x="value:Q",
    y=alt.Y('density:Q').stack(False),
    color='class:N'
).facet(
    'predictor:N',
    columns=4
).resolve_axis(
    x='independent',
    y='independent'
).resolve_scale(
    x='independent', 
    y='independent'
)

Figure 1. Comparison of the empirical distributions of training data predictors between benign and malignant tumour masses.

In [305]:
# drop se_smoothness, se_symmetry, se_texture

cancer_train = cancer_train.drop(columns=["se_smoothness", "se_symmetry", "se_texture", "se_fractal_dimension"])

We chose to use a simple classification model using the k-nearest neighbours algorithm. To find the model that best predicted whether a tumour was benign or malignant, we performed 30-fold cross validation using Cohen's Kappa as our metric of model prediction performance to select K (number of nearest neighbours). We observed that the optimal K was 5.

In [306]:
# tune model (here, find K for k-nn using 30 fold cv with Cohen's Kappa)
knn = KNeighborsClassifier()
cancer_tune_pipe = make_pipeline(cancer_preprocessor, knn)

parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 100, 3),
}

cancer_tune_grid = GridSearchCV(
    estimator=cancer_tune_pipe,
    param_grid=parameter_grid,
    cv=10,
    scoring=make_scorer(fbeta_score, pos_label='Malignant', beta=2)
)

In [358]:
accuracies_grid = pd.DataFrame(
    cancer_tune_grid.fit(
        cancer_train.drop("class", axis=1),
        cancer_train["class"]
    ).cv_results_
)

In [359]:
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "std_test_score"
    ]]
    .assign(sem_test_score=accuracies_grid["std_test_score"] / 10**(1/2))
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
    .drop(columns=["std_test_score"]))



In [360]:
accuracies_grid = accuracies_grid.assign(sem_test_score_lower= accuracies_grid["mean_test_score"] - (accuracies_grid["sem_test_score"]/2))
accuracies_grid = accuracies_grid.assign(sem_test_score_upper= accuracies_grid["mean_test_score"] + (accuracies_grid["sem_test_score"]/2))


In [362]:
accuracies_grid.sort_values("mean_test_score", ascending=False).head(10)

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score,sem_test_score_lower,sem_test_score_upper
0,1,0.913579,0.021552,0.902803,0.924355
3,10,0.912678,0.02322,0.901068,0.924288
2,7,0.908672,0.023103,0.89712,0.920223
4,13,0.90699,0.024199,0.89489,0.919089
1,4,0.90435,0.024899,0.891901,0.916799
11,34,0.903597,0.025382,0.890906,0.916289
9,28,0.902657,0.026055,0.88963,0.915685
7,22,0.902411,0.025193,0.889814,0.915007
6,19,0.901797,0.023788,0.889903,0.91369
8,25,0.901471,0.025866,0.888538,0.914404


In [368]:
line_n_point = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False) 
        .title("Fbeta score")
)

bar = alt.Chart(accuracies_grid).mark_errorbar().encode(
    alt.Y("sem_test_score_upper:Q").scale(zero=False).title("Fbeta score"),
    alt.Y2("sem_test_score_lower:Q"),
    alt.X("n_neighbors:Q").title("Neighbors")
)

line_n_point + bar

Figure 2. Results from 30-fold cross validation to choose K. Fbeta score (with beta = 2) was used as the classification metric as K was varied.

Our prediction model performed quite well on test data, with a final Cohen’s Kappa score of 0.9 and an overall accuracy calculated to be 0.97. Other indicators that our model performed well come from the confusion matrix, where it only made 4 mistakes. However all 4 mistakes were predicting a malignant tumour as benign, given the implications this has for patients health, this model is not good enough to yet implement in the clinic.

Table 1. Confusion matrix of model performance on test data.

In [None]:
# test model on unseen data

To further improve this model in future with hopes of arriving one that could be used in the clinic, there are several things we can suggest. First, we could look closely at the 4 misclassified observations and compare them to several observations that were classified correctly (from both classes). The goal of this would be to see which feature(s) may be driving the misclassification and explore whether any feature engineering could be used to help the model better predict on observations that it currently is making mistakes on. Additionally, we would try seeing whether we can get improved predictions using other classifiers. One classifier we might try is random forest forest because it automatically allows for feature interaction, where k-nn does not. Finally, we also might improve the usability of the model in the clinic if we output and report the probability estimates for predictions. If we cannot prevent misclassifications through the approaches suggested above, at least reporting a probability estimates for predictions would allow the clinician to know how confident the model was in its prediction. Thus the clinician may then have the ability to perform additional diagnostic assays if the probability estimates for prediction of a given tumour class is not very high.


# References

Canadian Cancer Statistics Advisory Committee. 2019. “Canadian Cancer Statistics.” Canadian Cancer Society. http://cancer.ca/Canadian-Cancer-Statistics-2019-EN.

de Jonge, Edwin. 2018. Docopt: Command-Line Interface Specification Language. https://CRAN.R-project.org/package=docopt.

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

Jed Wing, Max Kuhn. Contributions from, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, et al. 2019. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.

Keleshev, Vladimir. 2014. Docopt: Command-Line Interface Description Language. https://github.com/docopt/docopt.

McKinney, Wes. 2010. “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 51–56.

———. 2019. Feather: Simple Wrapper Library to the Apache Arrow-Based Feather File Format. https://github.com/wesm/feather.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Street, W. Nick, W. H. Wolberg, and O. L. Mangasarian. 1993. “Nuclear feature extraction for breast tumor diagnosis.” In Biomedical Image Processing and Biomedical Visualization, edited by Raj S. Acharya and Dmitry B. Goldgof, 1905:861–70. International Society for Optics; Photonics; SPIE. https://doi.org/10.1117/12.148698.

Van Rossum, Guido, and Fred L. Drake. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

———. 2019. Feather: R Bindings to the Feather ’Api’. https://CRAN.R-project.org/package=feather.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.