# Breast Cancer Diagnosis - Ensemble and Pipelines

## Dataset Information

This dataset is called "Breast cancer wisconsin (diagnostic) dataset" and it contains 30 features for 569 examples that describe various properties of tumours identified.

Some of these features are:

- radius (mean of distances from center to points on the perimeter)

- texture (standard deviation of gray-scale values)

- perimeter

- area

- smoothness (local variation in radius lengths)

- compactness (perimeter^2 / area - 1.0)

- concavity (severity of concave portions of the contour)

- concave points (number of concave portions of the contour)

- symmetry

- fractal dimension (“coastline approximation” - 1)

## Classification Task

**We use the given information about a tumour to predict whether it is malignant (0) or benign (1)**

## Separation into Training/Test Sets

In [70]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

breast_cancer = datasets.load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target
### 0 - malignant, 1 - benign

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

## Classifier and Pipeline Initializations 

#### Feature Scaling
Is done do have all feature vectors in a similar range; is being performed using **StandardScaler**

#### Feature Transformers
A feature transformer we are using is PCA - Principal Component Analysis - which performs dimensionality reduction.

#### Classifiers and Pipelines

##### Perceptron
The perceptron pipeline uses the Perceptron model with a learning rate of 1.0 after performing feature scaling and dimensionality reduction to 5 features

##### Decision Tree
The decision tree pipeline uses the Decision Tree learning model with a maximum tree depth of 6 and Gini impurity measurement for Information Gain calculations, after performing feature scaling and dimensionality reduction to 5 features

##### SVM
The SVM pipeline uses the Support Vector Machine model with RBF kernel after performing feature scaling (no dimensionality reduction here)

In [71]:
import numpy as np

## transformers
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## classifiers
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC 

## pipeline
from sklearn.pipeline import make_pipeline

## cross validation
from sklearn.model_selection import cross_val_score

## perceptron pipeline
perceptron_pipe = make_pipeline(StandardScaler(), PCA(n_components=5), Perceptron(eta0=1.0, random_state=1))

## decision tree pipeline
tree_pipe = make_pipeline(StandardScaler(), 
                                PCA(n_components=5), 
                                DecisionTreeClassifier(max_depth=6, criterion='gini', random_state=1))

## SVM pipeline
svm_pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', C=10.0, gamma=0.1, random_state=1))

## 10-Fold Cross Validation

Note: I have carried out the cross validation process multiple times to tune some hyperparameters first.

It looks like Perceptron pipeline would generalize best to the test set according to 10-fold cross validation, with SVM pipeline coming second, followed by Decision Tree model.

In [72]:
## labels of classifiers used
clf_labels = ['Perceptron', 'Decision tree', 'SVM']

print('10-fold cross validation:\n')

for clf, label in zip([perceptron_pipe, tree_pipe, svm_pipe], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='accuracy')
    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " [" + label + "]")

10-fold cross validation:

Accuracy: 0.97 Stdev: 0.028 [Perceptron]
Accuracy: 0.95 Stdev: 0.036 [Decision tree]
Accuracy: 0.96 Stdev: 0.039 [SVM]


## Creating the Ensemble Model

Here we use Majority Voting Rule to create an ensemble model.

After performing cross validation on the ensemble (along with other pipelines),
we see that it's predicted to generalize best to the test data

In [73]:
## ensemble model
from sklearn.ensemble import VotingClassifier

ensemble_clf = VotingClassifier(estimators=[('p', perceptron_pipe), ('dt', tree_pipe), ('svm', svm_pipe)])

clf_labels += ['Majority voting']
all_clf = [perceptron_pipe, tree_pipe, svm_pipe, ensemble_clf]

for clf, label in zip(all_clf, clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='accuracy')
    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " [" + label + "]")

Accuracy: 0.97 Stdev: 0.028 [Perceptron]
Accuracy: 0.95 Stdev: 0.036 [Decision tree]
Accuracy: 0.96 Stdev: 0.039 [SVM]
Accuracy: 0.98 Stdev: 0.03 [Majority voting]


In [74]:
## perceptron pipeline final testing

for clf, label in zip(all_clf, clf_labels):
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    print('------' + label + '------')
    print('Misclassified test set examples:', (y_test != y_pred).sum())
    print('Out of a total of:', y_test.shape[0])
    print('Accuracy:', clf.score(X_test, y_test))

------Perceptron------
Misclassified test set examples: 9
Out of a total of: 171
Accuracy: 0.9473684210526315
------Decision tree------
Misclassified test set examples: 13
Out of a total of: 171
Accuracy: 0.9239766081871345
------SVM------
Misclassified test set examples: 10
Out of a total of: 171
Accuracy: 0.9415204678362573
------Majority voting------
Misclassified test set examples: 5
Out of a total of: 171
Accuracy: 0.9707602339181286


## Results

Indeed, as predicted by 10-fold cross validation,
the Majority Voting Ensemble model performs best on the test set, followed by Perceptron, SVM, and Decision Tree pipelines.