## <center>Title: Investigation of the Original Wisconsin Breast Cancer Dataset</center>

https://www.kaggle.com/datasets/mariolisboa/breast-cancer-wisconsin-original-data-set

The original Wisconsin breast cancer dataset was created by dr. William H. Wolberg of the University of Wisconsin Hospitals. It is composed of 10 attributes and 699 instances. All instances are of one of two classes; either benign (non-cancerous) or malignant (the presence of cancer). 16 Data points with missing values were removed leaving 683 instances in this version of the dataset. It is used for classification problems. A scale of number from 1 to 10 was used to indicate the degree of abnormality, with 10 the most abnormal:

In [30]:
import pandas as pd
datafr = pd.read_csv("tumor.csv")
datafr.head(30)

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
5,1017122,8,10,10,8,7,10,9,7,1,4
6,1018099,1,1,1,1,2,10,3,1,1,2
7,1018561,2,1,2,1,2,1,3,1,1,2
8,1033078,2,1,1,1,2,1,1,1,5,2
9,1033078,4,2,1,1,2,1,2,1,1,2


In [7]:
datafr.dtypes

Sample code number             int64
Clump Thickness                int64
Uniformity of Cell Size        int64
Uniformity of Cell Shape       int64
Marginal Adhesion              int64
Single Epithelial Cell Size    int64
Bare Nuclei                    int64
Bland Chromatin                int64
Normal Nucleoli                int64
Mitoses                        int64
Class                          int64
dtype: object

In [8]:
datafr.describe()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0
mean,1076720.0,4.442167,3.150805,3.215227,2.830161,3.234261,3.544656,3.445095,2.869693,1.603221,2.699854
std,620644.0,2.820761,3.065145,2.988581,2.864562,2.223085,3.643857,2.449697,3.052666,1.732674,0.954592
min,63375.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,877617.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,1171795.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,1238705.0,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [9]:
datafr["Clump Thickness"].count()

683

<b>Bayesian Network Classifier:

This classifier assumes independence between variables. "A Bayesian network classifier is simply a Bayesian network applied to classification, that is, the prediction of the probability $P(c | x)$ of some discrete (class) variable C given some features $X$."
"A Bayesian network classifier is a Bayesian network used for predicting a discrete class variable C. It assigns x, an observation of n predictor variables (features) $X = (X1, . . . , Xn)$, to the most probable class:  $c∗ = arg maxP(c | x) = arg maxP(x, c)$.
"The classifier factorizes $P(x, c)$ according to a Bayesian network $B = hG, θi$. $G$ is a directed acyclic graph with a node for each variable in $(X, C)$, encoding conditional independencies: a variable $X$ is independent of its nondescendants in $G$ given the values $pa(x)$ of its parents. G thus factorizes the joint into local (conditional) distributions over subsets of variables: $P(x, c) = P(c | pa(c))Yn i = 1 P(xi | pa(xi))$." [1]


Using the Bayesian Networks classifier, accuracy in one study was 97.14%, when discretization of data points was applied the accuracy was 97.28% and when both discretization and equal frequency mode was applied the accuracy was 97.42%. Equal frequency is when data points are divided equally into separate bins. Discretization sorts continuous variables into a discrete format to improve the performance of a classifier. [2]

The J48 classifier performed best when the original data was used, with an accuracy result of 94.56%. After discretization was applied, the accuracy was 94.42% and with discretization and equal frequency mode applied, the accuracy was 93.56%. When missing values were replaced the mean got from training data, the accuracy was 95.14% and when the missing values were removed, accuracy rose to 96.05%. Discretizing the replaced and removed missing values produced accuracies of 94.42% and 93.41%. Testing the classifier after removing attributes reduced the accuracy. The function "Select Attributes" was used to calculate the worth of an attribute by measuring the information gain with respect to the class. These results were used to determine the order in which to remove attributes from the data which had the missing values removed. "8 plus the class" had accuracy 95.75%, "7 plus the class": 95.9%, "3 plus the class": 95.61%. An interesting result was that the false-negative rate of 1.61% was the same for when all attributes were included and when "3 plus the class" was used. 

In [10]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn import preprocessing
%matplotlib inline
import matplotlib.pyplot as plt

In [11]:
datafr.drop('Sample code number',axis=1,inplace=True)

In [12]:
datafr.columns

Index(['Clump Thickness', 'Uniformity of Cell Size',
       'Uniformity of Cell Shape', 'Marginal Adhesion',
       'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
       'Normal Nucleoli', 'Mitoses', 'Class'],
      dtype='object')

In [13]:
data_df = list(datafr.columns[1:31]) 
data_df_main = datafr.loc[:,data_df]

In [14]:
data_df_main

Unnamed: 0,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1,1,1,2,1,3,1,1,2
1,4,4,5,7,10,3,2,1,2
2,1,1,1,2,2,3,1,1,2
3,8,8,1,3,4,3,7,1,2
4,1,1,3,2,1,3,1,1,2
...,...,...,...,...,...,...,...,...,...
678,1,1,1,3,2,1,1,1,2
679,1,1,1,2,1,1,1,1,2
680,10,10,3,7,3,8,10,2,4
681,8,6,4,3,4,10,6,1,4


In [15]:
datafr['Class'].unique()

array([2, 4], dtype=int64)

In [16]:
datafr['Mitoses'].unique()

array([ 1,  5,  4,  2,  3,  7, 10,  8,  6], dtype=int64)

In [17]:
X = data_df_main
y = datafr['Class']

In [18]:
svm_model = SVC()

parameters = [
              {'C': [1, 10, 100, 1000], 
               'kernel': ['linear']
              },
              
 ]

In [19]:
grid_svm = GridSearchCV(svm_model, parameters, cv=20, scoring="accuracy")
grid_svm.fit(X,y)

GridSearchCV(cv=20, estimator=SVC(),
             param_grid=[{'C': [1, 10, 100, 1000], 'kernel': ['linear']}],
             scoring='accuracy')

In [20]:
print(grid_svm.best_score_)

1.0


In [21]:
data_df = list(datafr.columns[0:9]) 
data_df_main = datafr.loc[:,data_df]

In [22]:
data_df_main

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses
0,5,1,1,1,2,1,3,1,1
1,5,4,4,5,7,10,3,2,1
2,3,1,1,1,2,2,3,1,1
3,6,8,8,1,3,4,3,7,1
4,4,1,1,3,2,1,3,1,1
...,...,...,...,...,...,...,...,...,...
678,3,1,1,1,3,2,1,1,1
679,2,1,1,1,2,1,1,1,1
680,5,10,10,3,7,3,8,10,2
681,4,8,6,4,3,4,10,6,1


In [23]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(data_df_main, datafr['Class'], test_size=0.33, random_state=42)

In [24]:
svc = SVC()

In [25]:
svc.fit(xtrain, ytrain)

SVC()

Predictions:

In [26]:
preds = svc.predict(xtest)

In [27]:
from sklearn.metrics import classification_report, confusion_matrix

The confusion matrix contains true positives, false postives, false negatives and true negatives:

In [28]:
confusion_matrix(ytest, preds)

array([[139,   3],
       [  8,  76]], dtype=int64)

An accuracy of $95\%$ achieved using support vector machines. So $95\%$ of the predictions were correct. The f1-score is related to how many false positives and false negatives were detected:

In [35]:
classification_report(ytest, preds, output_dict=True)

{'2': {'precision': 0.9455782312925171,
  'recall': 0.9788732394366197,
  'f1-score': 0.9619377162629758,
  'support': 142},
 '4': {'precision': 0.9620253164556962,
  'recall': 0.9047619047619048,
  'f1-score': 0.9325153374233128,
  'support': 84},
 'accuracy': 0.9513274336283186,
 'macro avg': {'precision': 0.9538017738741067,
  'recall': 0.9418175720992623,
  'f1-score': 0.9472265268431443,
  'support': 226},
 'weighted avg': {'precision': 0.9516913071938756,
  'recall': 0.9513274336283186,
  'f1-score': 0.9510019648358443,
  'support': 226}}

In one study, an accuracy of $97.1\%$ using the mean for missing values and $97.8\%$ using the median for missing values was obtained using the SVM model on the original dataset. [3].

The recall and precision results from the study:

In [33]:
                     Mean           Median
   Accuracy          97.1%          97.8%
   Recall            97.0%          98.0%
   Precision         97.0%          97.0%

SyntaxError: invalid syntax (<ipython-input-33-0857ac1d0d6f>, line 1)

### References:

[1] https://cran.r-project.org/web/packages/bnclassify/vignettes/overview.pdf

[2] https://www.researchgate.net/publication/311950799_Analysis_of_the_Wisconsin_Breast_Cancer_Dataset_and_Machine_Learning_for_Breast_Cancer_Detection

[3] C:/Users/HP6550b/Downloads/biomedinformatics-02-00022-v2%20(2).pdf