# Programming for Data Analytics

<br/>

---

## Project II


# Wisconsin Breast Cancer Dataset

<br/>

Author: Jamie Tohall<br/>
Student Number: G00411380<br/>
Lecturer: Brian McGinley<br/>

<br/>

---

Contents

Problem Statement<br/>
Introduction<br/>
Dataset Description<br/>
Different types of Classifiers<br/>
Literature Review<br/>
Importing Relevant Modules<br/>
Reading in the Dataset<br/>
Preprocessing of the Dataset<br/>
Cleaning and Preparing the Dataset<br/>
Statistical Analysis<br/>
Training a Set of Classifiers<br/>
Review of Results<br/>
Investigation of Dataset Extension<br/>
References<br/>

---

<br/>

## Problem Statement
<br/>

This project will investigate the Wisconsin Breast Cancer dataset. The following list presents the requirements of the project:
<br/>

* Undertake an analysis/review of the dataset and present an overview and background.
* Provide a literature review on classifiers which have been applied to the dataset and compare their performance
* Present a statistical analysis of the dataset
* Using a range of machine learning algorithms, train a set of classifiers on the dataset (using SKLearn etc.) and present classification performance results. Detail your rationale for the parameter selections you made while training the classifiers.
* Compare, contrast and critique your results with reference to the literature
* Discuss and investigate how the dataset could be extended – using data synthesis of new tumour datapoints
* Document your work in a Jupyter notebook. 
* As a suggestion, you could use Pandas, Seaborn, SKLearn, etc. to perform your analysis. 
* Please use GitHub to demonstrate research, progress and consistency.

<br/>

---

<br/>

## Introduction

Breast cancer occurs in every single country of the world in women at any postpubescent age but with increasing rates in later life. <br/>
The World Health Organisation states that in 2020 there were 2.3 million women diagnosed with breast cancer, and 685,000 deaths globally, making breast cancer the most common cancer in women. One in eight women are diagnosed with breast cancer in a lifetime, fortunately, this rate is decreasing with awareness but still many women are not aware of the early signs or symptoms.
<br/>

Breast cancer arises in the lining cells of the ducts or lobules in the glandular tissue of the breast. Initially, the cancerous growth is confined to the duct or lobule, where it generally causes no symptoms and has minimal potential for spread. However over time and without any treatment, these stage 1 cancers may progress and invade the surrounding breast tissue, developing into an invastive breast cancer which will then spread to the nearby lymph nodes or to other organs in the body, this is also known as distant metastasis. If a woman dies from breast cancer, it is because of widespread metastasis. 

Breast cancer treatment can be highly effective, especially when the disease is identified early. Treatment of breast cancer often consists of a combination of surgical removal, radiation therapy and medication (hormonal therapy, chemotherapy and/or targeted biological therapy) to treat the microscopic cancer that has spread from the breast tumor through the blood. Such treatment, which can prevent cancer growth and spread, thereby saves lives.

<br/>

For this project, I will analyse the Wisconsin breast cancer dataset. The dataset was developed in 1995 by researchers at the University of Wisconsin, and includes the measurements from digitized images of fine-needle aspirate of a breast mass. A fine-needle aspirate is a type of biopsy performed using a small needle to obtain samples of tissue and fluid from solid or cystic breast lesions. It is one of the many different modalities for diagnosing breast masses.

The breast cancer dataset includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis and 30 are numeric-valued laboratory measurements. The diagnosis is coded as "M" to indicate malignant or "B" to indicate benign.

The other 30 numeric measurements comprise the mean, standard error and worst (i.e. largest) value for 10 different characteristics of the digitized cell nuclei, which are as follows:-

Radius<br/>
Texture<br/>
Perimeter<br/>
Area<br/>
Smoothness<br/>
Compactness<br/>
Concavity<br/>
Concave Points<br/>
Symmetry<br/>
Fractal dimension<br/>





---

<br/>

## Dataset Description
<br/>

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

#### 1. Title: Wisconsin Diagnostic Breast Cancer (WDBC)
<br/>

#### 2. Source Information

a) Creators: 

	Dr. William H. Wolberg, General Surgery Dept., University of
	Wisconsin,  Clinical Sciences Center, Madison, WI 53792
	wolberg@eagle.surgery.wisc.edu

	W. Nick Street, Computer Sciences Dept., University of
	Wisconsin, 1210 West Dayton St., Madison, WI 53706
	street@cs.wisc.edu  608-262-6619

	Olvi L. Mangasarian, Computer Sciences Dept., University of
	Wisconsin, 1210 West Dayton St., Madison, WI 53706
	olvi@cs.wisc.edu 

b) Donor: Nick Street

c) Date: November 1995

#### 3. Past Usage:

first usage:

	W.N. Street, W.H. Wolberg and O.L. Mangasarian 
	Nuclear feature extraction for breast tumor diagnosis.
	IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science
	and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.

OR literature:

	O.L. Mangasarian, W.N. Street and W.H. Wolberg. 
	Breast cancer diagnosis and prognosis via linear programming. 
	Operations Research, 43(4), pages 570-577, July-August 1995.

Medical literature:

	W.H. Wolberg, W.N. Street, and O.L. Mangasarian. 
	Machine learning techniques to diagnose breast cancer from
	fine-needle aspirates.  
	Cancer Letters 77 (1994) 163-171.

	W.H. Wolberg, W.N. Street, and O.L. Mangasarian. 
	Image analysis and machine learning applied to breast cancer
	diagnosis and prognosis.  
	Analytical and Quantitative Cytology and Histology, Vol. 17
	No. 2, pages 77-87, April 1995. 

	W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. 
	Computerized breast cancer diagnosis and prognosis from fine
	needle aspirates.  
	Archives of Surgery 1995;130:511-516.

	W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. 
	Computer-derived nuclear features distinguish malignant from
	benign breast cytology.  
	Human Pathology, 26:792--796, 1995.

See also:
	http://www.cs.wisc.edu/~olvi/uwmp/mpml.html
	http://www.cs.wisc.edu/~olvi/uwmp/cancer.html

Results:

	- predicting field 2, diagnosis: B = benign, M = malignant
	- sets are linearly separable using all 30 input features
	- best predictive accuracy obtained using one separating plane
		in the 3-D space of Worst Area, Worst Smoothness and
		Mean Texture.  Estimated accuracy 97.5% using repeated
		10-fold crossvalidations.  Classifier has correctly
		diagnosed 176 consecutive new patients as of November
		1995. 

#### 4. Relevant information

	Features are computed from a digitized image of a fine needle
	aspirate (FNA) of a breast mass.  They describe
	characteristics of the cell nuclei present in the image.
	A few of the images can be found at
	http://www.cs.wisc.edu/~street/images/

	Separating plane described above was obtained using
	Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
	Construction Via Linear Programming." Proceedings of the 4th
	Midwest Artificial Intelligence and Cognitive Science Society,
	pp. 97-101, 1992], a classification method which uses linear
	programming to construct a decision tree.  Relevant features
	were selected using an exhaustive search in the space of 1-4
	features and 1-3 separating planes.

	The actual linear program used to obtain the separating plane
	in the 3-dimensional space is that described in:
	[K. P. Bennett and O. L. Mangasarian: "Robust Linear
	Programming Discrimination of Two Linearly Inseparable Sets",
	Optimization Methods and Software 1, 1992, 23-34].


	This database is also available through the UW CS ftp server:

	ftp ftp.cs.wisc.edu
	cd math-prog/cpo-dataset/machine-learn/WDBC/

#### 5. Number of instances: 569 
<br/>

#### 6. Number of attributes: 32 (ID, diagnosis, 30 real-valued input features)
<br/>

#### 7. Attribute information:
<br/>

1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

	a) radius (mean of distances from center to points on the perimeter)
	b) texture (standard deviation of gray-scale values)
	c) perimeter
	d) area
	e) smoothness (local variation in radius lengths)
	f) compactness (perimeter^2 / area - 1.0)
	g) concavity (severity of concave portions of the contour)
	h) concave points (number of concave portions of the contour)
	i) symmetry 
	j) fractal dimension ("coastline approximation" - 1)

Several of the papers listed above contain detailed descriptions of
how these features are computed. 

The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features.  For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

#### 8. Missing attribute values: none
<br/>

#### 9. Class distribution: 357 benign, 212 malignant



<br/>




---

<br/>

## Different types of Classifiers


There are different types of classifiers, a classifier is an algorithm that maps the input data to a specific category. Now, let us take a look at the different types of classifiers:

Perceptron<br/>
Naive Bayes<br/>
Decision Tree<br/>
Logistic Regression<br/>
K-Nearest Neighbor<br/>
Artificial Neural Networks/Deep Learning<br/>
Support Vector Machine<br/>

Then there are the ensemble methods: Random Forest, Bagging, AdaBoost, etc.

---

<br/>

## Literature Reviews

<br/>


### <u>Cluster Analysis of Wisconsin Breast Cancer Dataset Using Self-Organizing Maps<u/>
Authors: Stefan PANTAZI, Yuri KAGOLOVSKY, Jochen R. MOEHR

<br/>
 
**Introduction**

This study deals with multidimensional data analysis, precisely cluster analysis applied to the Wisconsin Breast Cancer dataset. This paper proved how both, data visualization and cluster analysis of a dataset, can be achieved by using the Kohonen model of self organizing artificial neural networks and also how the results of this analysis can be used to compare the performances of classification processes.<br/>
Cluster analysis is a method of exploratory data analysis that aims at partitioning a set of data items into groups, based on a measure of distance (or dissimilarity). The groups are called clusters and their number may be pre-assigned or determined by the algorithms.

<br/>

**Findings**

For the 342 cases used in this study as a testing subset, the linear distance based classifier accuracy was 95.32%, i.e. 326 patterns were correctly classified. It was also found that other classifiers, employing more sophisticated decision techniques (such as fuzzy rules extraction, hybrid systems rule extraction, Bayesian networks) were reporting comparable accuracies in their classifications when analyzing the same dataset. 

<br/>
    
**Conclusion**

Pantazi, Kagolovsky and Moehre concluded that exploratory data analysis should be the first step that should be taken when analyzing a new data set because beside giving a first impression about the data itself it can also point to the data errors.<br/>
They also concluded that the Wisconsin Breast Cancer dataset should not be used as a benchmark for classification algorithms, since any linear distance classifier will probably perform with an accuracy of over 90% and the non-linear classifiers' accuracies will not show notable differences as these differences will derive only from the sparse, potentially erroneous, remaining non-linearity present in the data.

<br/>
<br/>


### <u>Analysis of Wisconsin Breast Cancer original dataset using data mining and machine learning algorithms for breast cancer prediction<u/>
Authors: Md. Toukir Ahmed, Md. Niaz Imtiaz and Animesh Karmakar
    
<br/>
    
**Introduction**
    
The main focus of this paper is to perform different machine learning classification algorithms to correctly predict the target class and improve it by checking the effectiveness of particular attributes of original Wisconsin Breast Cancer dataset for breast cancer diagnosis prediction.<br/>
The algorithms used are - Naïve Bayes, Support Vector Machine (SVM), Multilayer Perceptron (MLP), J48 and Random Forest. For comparing the result the study used performance metrics: Accuracy, Kappa statistic, precision, recall, F-measure, MCC, ROC area, PRC area.


**Findings**
    
Naïve Bayes classifier performs better in terms of Accuracy and Specificity. Besides, it is also noticeable that, Multilayer Perceptron gave highest Sensitivity. But with respect to other performances metrics, Naïve Bayes classifier shows it supremacy. As Naïve Bayes classifier worked best among our proposed classifiers, we tried to optimize result 
further. We tried to find the effectiveness of each feature and their effects on the performance. So, we extracted the features from data by removing the attributes one by one and check out the performance to know the effectiveness of classifier then got to a point that “Single Epithelial Cell Size” have less impact in dataset have negative effect on accuracy. By removing this we get better accuracy with better results with other parameters also.

**Conclusion**
    
This study used the best five classification algorithms and came to the conclusion that Naïve Bayes is superior to
others compared with standard parameters. After the analysis of the dataset to extract the features of attributes to know the effectiveness of different attributes, the performance was reviewed and the result was found to be better than prior results and the algorithm performed better. 

<br/>
<br/>
    

### <u>Machine Learning Classifiers on Breast Cancer Recurrences<u/>
Authors: Vincent Peter C. Magboo, Ma. Sheila A. Magboo

<br/>
    
**Introduction**
    
This study used a data mining approach, it compared four popular machine learning models (Logistic Regression, Naïve Bayes, K-Nearest Neighbors, and Support Vector Machines). The study uses the Wisconsin Prognostic Breast Cancer Data Set for classifying breast cancer recurrences on four different configurations: a) only scaling applied, b) scaling with PCA, c) scaling with PCA and oversampling of minority class, and d) only with selected features.

<br/>

**Findings**
    
Study results showed that Logistic Regression provided the best scores in almost all metrics (precision, recall, accuracy, f1 score (weighted), AUROC, AUPROC, and Cohen Kappa statistic in all four configurations, followed by Support Vector Machines, and then by K-Nearest Neighbors.<br/>
Naive Bayes performed poorly especially in the scaling with PCA configuration, however, when one of the many features that have high correlations with each other were retained, Naïve Bayed performance improved. KNN improved its recall with oversampling while SVM’s accuracy score has been fairly constant in all four configurations.
    
<br/>

**Conclusion**
    
Based on this study, the Logistic Regression model can serve as a potential model for predicting breast cancer recurrence that would enable clinicians to propose treatment options based on whether patient’s features correspond to a good or bad prognosis (recurrence). This indicates the clinical utility of data mining methods for the early detection of breast cancer recurrence in post-surgical patients to save lives.

    
<br/>
<br/>
    

### <u>Study and Analysis of Breast Cancer Data<u/>
Authors : Sonali Nandish Manoli, Padma S.K

<br/>

**Introduction**

In this study analysis is done by using the original data with all the features where the missing value of attribute has been obtained by taking the mean of the other values of the attribute. Two classifiers namely the Naive-Bayes and the SVM are used to analyse the data. Further feature extraction principle is used to eliminate the redundant features of the data. This is done by reducing the dimensionality of the data using Principal Component Analysis. The reduced feature data is again classified using Naives-Bayes and the SVM classifiers to improve the accuracy of the data prediction.

<br/> 

**Findings**

This study found the accuracy of classification obtained is 95.71% for Naive Bayes and 97.14% for SVM for the whole data. Further accuracy of classification obtained is 97.14% for Naive Bayes and 97.86% for SVM for PCA reduced data by considering only 2 features of the PCA reduced data set.<br/>
This study reveals that the SVM classifier is a better classifier which provides an accuracy of 97.86% which is 1.96% more than the accuracy of 95.90% reported as the highest in the UCI Machine Learning Repository.

<br/>

**Conclusion**

From the analysis it was concluded that the SVM model is useful in predicting breast cancer from tumour data, there is also  scope for analysis using other classifiers and dimensionality reduction techniques which may help in better understanding larger data sets with many more features in the near future. 
    
<br/>
<br/>

---

<br/>

## Importing relevant modules

In [30]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk
import matplotlib.pyplot as plt

---

<br/>

## Reading in the dataset

In [31]:
# Opening the dataset in read and labelling as db

db = pd.read_csv("wisc_bc_data.csv")

---

<br/>

## Preprocessing of the Dataset

In [32]:
# Shape will show the number of rows and columns the dataset contains

db.shape

(569, 32)

In [33]:
# Columns will give an index of all 32 columns

db.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'points_mean', 'symmetry_mean', 'dimension_mean', 'radius_se',
       'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'points_se', 'symmetry_se',
       'dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst',
       'area_worst', 'smoothness_worst', 'compactness_worst',
       'concavity_worst', 'points_worst', 'symmetry_worst', 'dimension_worst'],
      dtype='object')

---

In [34]:
# Head will print out the top 5 row as default, however I specified 20 rows. This gives a nice example of how the 
# dataset is formatted and how the information is displayed.

db.head(20)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,points_worst,symmetry_worst,dimension_worst
0,87139402,B,12.32,12.39,78.85,464.1,0.1028,0.06981,0.03987,0.037,...,13.5,15.64,86.97,549.1,0.1385,0.1266,0.1242,0.09391,0.2827,0.06771
1,8910251,B,10.6,18.95,69.28,346.4,0.09688,0.1147,0.06387,0.02642,...,11.88,22.94,78.28,424.8,0.1213,0.2515,0.1916,0.07926,0.294,0.07587
2,905520,B,11.04,16.83,70.92,373.2,0.1077,0.07804,0.03046,0.0248,...,12.41,26.44,79.93,471.4,0.1369,0.1482,0.1067,0.07431,0.2998,0.07881
3,868871,B,11.28,13.39,73.0,384.8,0.1164,0.1136,0.04635,0.04796,...,11.92,15.77,76.53,434.0,0.1367,0.1822,0.08669,0.08611,0.2102,0.06784
4,9012568,B,15.19,13.21,97.65,711.8,0.07963,0.06934,0.03393,0.02657,...,16.2,15.73,104.5,819.1,0.1126,0.1737,0.1362,0.08178,0.2487,0.06766
5,906539,B,11.57,19.04,74.2,409.7,0.08546,0.07722,0.05485,0.01428,...,13.07,26.98,86.43,520.5,0.1249,0.1937,0.256,0.06664,0.3035,0.08284
6,925291,B,11.51,23.93,74.52,403.5,0.09261,0.1021,0.1112,0.04105,...,12.48,37.16,82.28,474.2,0.1298,0.2517,0.363,0.09653,0.2112,0.08732
7,87880,M,13.81,23.75,91.56,597.8,0.1323,0.1768,0.1558,0.09176,...,19.2,41.85,128.5,1153.0,0.2226,0.5209,0.4646,0.2013,0.4432,0.1086
8,862989,B,10.49,19.29,67.41,336.1,0.09989,0.08578,0.02995,0.01201,...,11.54,23.31,74.22,402.8,0.1219,0.1486,0.07987,0.03203,0.2826,0.07552
9,89827,B,11.06,14.96,71.49,373.9,0.1033,0.09097,0.05397,0.03341,...,11.92,19.9,79.76,440.0,0.1418,0.221,0.2299,0.1075,0.3301,0.0908


In [35]:
# Describe will output some basic statistical details of the dataset

db.describe

<bound method NDFrame.describe of             id diagnosis  radius_mean  texture_mean  perimeter_mean  \
0     87139402         B        12.32         12.39           78.85   
1      8910251         B        10.60         18.95           69.28   
2       905520         B        11.04         16.83           70.92   
3       868871         B        11.28         13.39           73.00   
4      9012568         B        15.19         13.21           97.65   
..         ...       ...          ...           ...             ...   
564  911320502         B        13.17         18.22           84.28   
565     898677         B        10.26         14.71           66.20   
566     873885         M        15.28         22.41           98.92   
567     911201         B        14.53         13.98           93.86   
568    9012795         M        21.37         15.10          141.30   

     area_mean  smoothness_mean  compactness_mean  concavity_mean  \
0        464.1          0.10280           0.

---

<br/>

## Clean and prepare the dataset

Drop the columns like id and unnamed which are not required.
Map the diagnosis column into numeric form; the diagnosis column is not numeric and has to be converted into a numeric form so that algorithm can understand the data.

In [36]:
db.drop('id', axis=1, inplace=True)

db.drop('Unnamed: 32', axis=1, inplace=True)

db.head()

KeyError: "['Unnamed: 32'] not found in axis"

### Statistical Analysis

---

### Training a Set of Classifiers

---

### Review of Results

---

### Investigation of Dataset Extension

---

## References

[1] https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29<br/>
[2]https://www.researchgate.net/publication/311950799_Analysis_of_the_Wisconsin_Breast_Cancer_Dataset_and_Machine_Learning_for_Breast_Cancer_Detection<br/>
[3] https://data.world/health/breast-cancer-wisconsin/workspace/file?filename=DatasetDescription.txt <br/>
[4] https://pubmed.ncbi.nlm.nih.gov/15460731/ <br/>
[5] https://1library.net/document/download/q5wodwwq?page=1 <br/>
[6] https://node1.123dok.com/dt01pdf/123dok_us/000/791/791788.pdf.pdf?X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=7PKKQ3DUV8RG19BL%2F20230109%2F%2Fs3%2Faws4_request&X-Amz-Date=20230109T203933Z&X-Amz-SignedHeaders=host&X-Amz-Expires=600&X-Amz-Signature=05680f6a49547c0f47ffeb4662633159d5433730fc1692feb29475f0a6318434<br/>
[7] https://www.youtube.com/watch?v=2ncx2q5GHbQ<br/>
[8] http://dx.doi.org/10.4236/oalib.1100660<br/>
[9] https://www.researchgate.net/publication/355027167_Machine_Learning_Classifiers_on_Breast_Cancer_Recurrences<br/>
[10] https://www.who.int/news-room/fact-sheets/detail/breast-cancer<br/>
[11] https://www.cancer.org/healthy/cancer-facts/cancer-facts-for-women.html<br/>
[12] https://machinemantra.in/breast-cancer-wisconsin-dataset/<br/>
[13] https://www.ijert.org/research/study-and-analysis-of-breast-cancer-data-IJERTCONV5IS21015.pdf<br/>
[14] https://www.geeksforgeeks.org/ml-kaggle-breast-cancer-wisconsin-diagnosis-using-knn/?ref=gcse