#                                                Project II

##                                       Programming for Data Analysis

###                                 Investigation of the Wisconsin Breast Cancer dataset

This project will investigate the Wisconsin Breast Cancer dataset. The following list presents the
requirements of the project
* Undertake an analysis/review of the dataset and present an overview and background.
* Provide a literature review on classifiers which have been applied to the dataset and
compare their performance
* Present a statistical analysis of the dataset
* Using a range of machine learning algorithms, train a set of classifiers on the dataset (using
SKLearn etc.) and present classification performance results. Detail your rationale for the
parameter selections you made while training the classifiers.
* Compare, contrast and critique your results with reference to the literature
* Discuss and investigate how the dataset could be extended – using data synthesis of new
tumour datapoints
* Document your work in a Jupyter notebook.
* As a suggestion, you could use Pandas, Seaborn, SKLearn, etc. to perform your analysis.
* Please use GitHub to demonstrate research, progress and consistency.
---


## Review of the Wisconsin Breast Cancer dataset
---
The experimental data used in the current notebook was obtained from the Breast Cancer Wisconsin subdirectory of the University of California Irvine Machine Learning repository, available at https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29; specifically, the dataset I will use is the original Wisconsin Breast Cancer dataset. The dataset is comprised of 699 samples, 683 of which are complete data; 16 samples are missing the Bare nuclei attribute for 16 different instances. The dataset contains 2 classes (malignant and benign); of these, 458 were classified as benign, and 241 were malignant. The values for the 9 integer-valued attributes ranged from 1 to 10 where 10 was the most abnormal

|S/N| Attribute| Domain|
|:---:|:----:|:---:|
|1| Clump thickness| 1–10|
|2 |Uniformity of cell size |1–10
3 |Uniformity of cell shape| 1–10
4 |Marginal Adhesion| 1–10
5 |Single Epithelial cell |1–10
6| Bare Nuclei| 1–10
7 |Bland Chromatin| 1–10
8 |Normal Nucleoli| 1–10
9 |Mitoses |1–10
10 |Class |2 for benign 4 for malignant

modified from [Divyavani, M., and G. Kalpana. "An analysis on SVM & ANN using breast cancer dataset." Aegaeum J 8 (2021): 369-379.](file:///C:/Users/bridc/Downloads/biomedinformatics-02-00022-v2.pdf)

# Literature Review

Breast cancer is the second leading cause of death among women worldwide [^1]. Irrespective of the cancer type, early detection is the best way to increase the chance of treatment and survivability [^2]. 


The method that can confirm malignancy accurately with a high sensitivity is a surgical biopsy, a costly and painful procedure.
To this end modern classification techniques attempt to replicate the accuracy of a biopsy, without the negative
aspects of a surgical biopsy.

Machine learning is a data analysis technique that teaches a computer what comes as an output with different algorithms. Decision tree, k-means clustering, and neural networks are the most common algorithms for machine learning applications. While there is no better way to diagnose breast cancer, early diagnosis can be accepted as the first step of treatment and risk assessment to minimize factors. It allows a person to control risk factors, although some breast cancer risk factors cannot be changed [4^].

There are many algorithms for classification and prediction of breast cancer outcomes. Among the most
influential data mining algorithms in the research community and among the top 10 data mining algorithms are Support Vector Machine (SVM), Decision Tree (C4.5), Naive Bayes (NB) and k Nearest Neighbors (k-NN)[^3].
The main objective is to assess the correctness in classifying data with respect to efficiency and effectiveness of each algorithm in terms of accuracy, precision, sensitivity and specificity

# Breast cancer detection research using different machine learning algorithms. 
adapted from Mohammed SA et al 2020 [2^]

|Paper title |Datasets |Algorithms |Results |
| :--------: | :------------: |:--------: | :--------:|
|A study on prediction of breast cancer recurrence using data mining techniques [4], 2017 |WPBC	|Classification: KNN, SVM, NB and C5.0, Clustering: K-means, EM, PAM and Fuzzy c-means |Classification accuracy is better than clustering, SVM & C5.0:81% | 
|Predicting breast cancer recurrence using effective classification and feature selection technique [5], 2016|WPBM|NB, C4.5, SVM|NB: 67.17%, C4.5: 73.73%, SVM: 75.75%|
|Using machine learning algorithms for breast cancer risk prediction and diagnosis [^3], 2016|WBC|SVM, C4.5, NB, KNN|SVM outperform others: 97.13%|
|Study and analysis of breast cancer cell detection using Naïve Bayes, SVM and ensemble algorithms [7], 2016|	WDBC|	NB, SVM, Ensemble|	SVM: 98.5%, NB & Ensemble: 97.3%|
|Analysis of Wisconsin breast cancer dataset and machine learning for breast cancer detection [8], 2015|	WDBC|	NB, J48	|NB: 97.51%, J48: 96.5%|
|A novel approach for breast cancer detection using data mining techniques [10], 2014|	WBC|SMO, IBK, BF Tree|SMO: 96.19%, IBK: 95.90%, BF Tree: 95.46%|
|Experiment comparison of classification for breast cancer diagnosis [11], 2012	|WBC,WDBC,WPBC|J48, SMO, MLP, NB, IBK|	In WBC: MLP & J48: 97.2818%. In WDBC: SMO: 97.7% or fusion on SMO & MLP: 97.7% In WPBC: fusion of MLP, J48, SMO and IBK: 77%|
|Analysis of feature selection with classification: breast cancer datasets [12], 2011|WBC, WDBC|Decision Tree with and without feature selection|Feature selection enhances the results WBC: 96.99% WDBC: 94.77%|










[^1]: U.S. Cancer Statistics Working Group. United States Cancer Statistics: 1999–2008 Incidence and Mortality Web-based Report. Atlanta (GA): Department of Health and Human Services, Centers for Disease Control


[^2]: [Mohammed SA, Darrab S, Noaman SA, Saake G. Analysis of Breast Cancer Detection Using Different Machine Learning Techniques. Data Mining and Big Data. 2020 Jul 11;1234:108–17. doi: 10.1007/978-981-15-7205-0_10. PMCID: PMC7351679.](https://file.scirp.org/pdf/OALibJ_2016031015403611.pdf)


[^3]: [Asri, Hiba, et al. "Using machine learning algorithms for breast cancer risk prediction and diagnosis." Procedia Computer Science 83 (2016): 1064-1069.](https://www.sciencedirect.com/science/article/pii/S1877050916302575)

[^4]: [Ak, M.F. A Comparative Analysis of Breast Cancer Detection and Diagnosis Using Data Visualization and Machine Learning Applications. Healthcare 2020, 8, 111.](https://doi.org/10.3390/healthcare8020111)

[Silva, Jesús, et al. "Integration of data mining classification techniques and ensemble learning for predicting the type of breast cancer recurrence." International Conference on Green, Pervasive, and Cloud Computing. Springer, Cham, 2019.](https://link.springer.com/chapter/10.1007/978-3-030-19223-5_2)

Ojha U., Goel, S.: A study on prediction of breast cancer recurrence using data mining techniques. In: 7th International Conference on Cloud Computing, Data Science & Engineering-Confluence, IEEE, pp. 527–530, 201

Pritom, A.I., Munshi, M.A.R., Sabab, S.A., Shihab, S.: Predicting breast cancer recurrence using effective classification and feature selection technique. In: 19th International Conference on Computer and Information Technology (ICCIT), pp. 310–314. IEEE (2016)



[Vig, L. (2014). Comparative analysis of different classifiers for the Wisconsin breast cancer dataset. Open Access Library Journal, 1(06), 1.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7351679/)

[Mohammed SA, Darrab S, Noaman SA, Saake G. Analysis of Breast Cancer Detection Using Different Machine Learning Techniques. Data Mining and Big Data. 2020 Jul 11;1234:108–17. doi: 10.1007/978-981-15-7205-0_10. PMCID: PMC7351679.](https://file.scirp.org/pdf/OALibJ_2016031015403611.pdf)

### K-Nearest Neighbor (KNN)
KNN is a supervised learning technique that means the label of the data is identified before making predictions. Clustering and regression are two purposes to use it. K represents a numerical value for the nearest neighbors. KNN algorithm does not have a training phase. Predictions are made based on the Euclidean distance to k-nearest neighbors. This technique is applied to the prediction of breast cancer dataset since it already has labels such as malignant and benign.
### Support Vector Machine
Support vector machine is one of the most common machine learning techniques. The objective of the algorithm is to find a hyperplane in N-dimensions that classifies the data points. The major part of this algorithm is finding the plane that maximizes the margin. N dimension diversifies based on the feature numbers. Comparing two features could be done smoothly. However, if there are several features for classification, it is not always that straightforward. Maximizing the margin provides more accurate prediction results. SVM has a small tradeoff between large margin and accurate classification. If the exact classification without sacrificing any individual sample is applied, the margin could be very narrow, which could lead to a lower accuracy level. On the other hand, by maximizing the margin between classes to get a better accuracy, support vectors that are closest to the hyperplane could be considered with other class members.
### Naïve Bayes
Naïve Bayes is a straightforward and also fast algorithm for classification. Its working process is based on Bayes theorem. It is represented below:
P(X|Y)=P(Y|X)P(X)P(Y)
(2)
The fundamentals of this algorithm assume that each variable contributes to the outcome independently and equally. In this case, each feature will not be dependent on each other and will affect the output with the same weight. Therefore, the naïve Bayes theorem does not apply to real-life problems, and it is possible to get low accuracies while using this algorithm. Gaussian Naïve Bayes is one kind of naïve Bayes application. It assumes that features follow a normal distribution. The possibility of features is considered to be Gaussian and has a conditional probability. Gaussian naïve Bayes theorem is given below:
P(xi|y)=12πσ^2y−−−−−−−√exp(−(xi−μy)22σ2y)

### Decision Tree
A decision tree (DT) is one of the most common supervised learning techniques. Regression and classification are two main goals to use it. It seeks to solve problems by drawing a tree figure. Features are known as decision nodes, and outputs are leaf nodes. Feature values are considered as categorical in the decision tree algorithm. At the very beginning of this algorithm, it is essential to choose the best attribute and place it at the top on tree figure and then split the tree. Gini index and information gain are two methods for the selection of features.
Randomness or uncertainty of feature x is defined as entropy and can be calculated as follows:
H(x)=Ex[I(x)]=−∑p(x)logp(x)
(4)
Entropy values for each variable are calculated, and by subtracting these values from one, information values can be obtained. A higher information gain makes an attribute better and places it on top of the tree.
Gini index is a measure of how often a randomly chosen element would be incorrectly identified. Therefore, a lower Gini index value means better attributes. Gini index can be found with the given formula:
G=∑pi∗(1−pi) for i=1,…n
(5)
A decision tree is easy to understand. However, if data contain various features it might cause problems that are called overfitting. Therefore, it is crucial to know when to stop growing trees. Two methods are typical for restricting the model from overfitting: pre-pruning, which stops growing early, but it is hard to choose a stopping point; and post-pruning, which is a cross-validation used to check whether expanding the tree will make improvements or lead to overfitting [42,43]. DT structure consists of a root node, splitting, decision node, terminal node, sub-tree, and parent node [4]

### Random and Rotation Forest
Random forest is an ensemble learning model that can be used for both regression and classification. Indeed, a random forest consists of many decision trees. Therefore, in some cases, it is more logical to use random forest rather than a decision tree.
The rotation forest algorithm consists of generating a classifier that is based on the extraction of attributes. The attribute set is randomly grouped into K different subsets. It aims to create accurate and significant classifiers [45].

[^4]


touch references.bib
