# Ensembles Assignment

## Read data

In [15]:
from sklearn.datasets import load_breast_cancer

# Load the breast cancer dataset
breast_cancer = load_breast_cancer()

# Print the number of instances and features in the dataset
print("Number of instances:", breast_cancer.data.shape[0])
print("Number of features:", breast_cancer.data.shape[1])
print("Number of classes:", len(set(breast_cancer.target)))

Number of instances: 569
Number of features: 30
Number of classes: 2


### Description of the data:
- This code loads the breast cancer dataset from the scikit-learn library, which contains data on breast cancer tumors. 
- The number of classes in the dataset is 2. 
- The classes are binary, indicating whether a breast mass is benign or malignant.
<br>

### Description of classification task:

- I am performing <strong>binary classification</strong> as there are only two possible classes for each instance - malignant or benign.
<br>
<br>

## Divide data into training & testing sets

In [16]:
from sklearn.model_selection import train_test_split

X, y = breast_cancer.data, breast_cancer.target

X_train, X_test, y_train, y_test =\
        train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

<br>

## Comparing classification algorithms using Pipelines & Cross-Validation

In [17]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
import time


pipe1 = make_pipeline(StandardScaler(), SVC(kernel='linear', C=1.0, random_state=1))

pipe2 = make_pipeline(DecisionTreeClassifier(max_depth=8,
                                             criterion='entropy',
                                             random_state=0))

pipe3 = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=15,
                                                             p=2,
                                                             metric='minkowski'))


clf_labels = ['SVM', 'Decision tree', 'KNN']

print('20-fold cross validation:\n')
for clf, label in zip([pipe1, pipe2, pipe3], clf_labels):
    
    start_time = time.time()

    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=20,
                             scoring='accuracy')
    
    end_time = time.time()

    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " Time: " + str(round(end_time - start_time, 3)) +
          " [" + label + "]")

20-fold cross validation:

Accuracy: 0.97 Stdev: 0.029 Time: 0.166 [SVM]
Accuracy: 0.91 Stdev: 0.059 Time: 0.285 [Decision tree]
Accuracy: 0.95 Stdev: 0.044 Time: 0.445 [KNN]


### Descriptions of the models used in each pipelines:

- Pipe 1: <strong>SVM classifier</strong> using a linear kernel and configured with a regularization parameter C of 1.0 and a random state of 1.
- Pipe 2: <strong>Decision Tree Classifier</strong> is configured with a maximum depth of 8, uses entropy as the splitting criterion, and a random state of 0.
- Pipe 3: <strong>K-Nearest Neighbors (KNN) classifier</strong> is configured with n_neighbors equal to 15, p parameter to 2, metric parameter as 'minkowski'.

Note that data preprocessing step has been done in pipe 1 and pipe 3 that scales the data using the <strong>StandardScaler</strong> method

<br>

### Compare the performance of three models:

- Above code snippet also performs a <strong>20-fold cross-validation</strong> using three different models (SVM, Decision Tree, & KNN).
- <strong>cross_val_score()</strong> function is used to evaluate the accuracy of the current model (clf) using 'accuracy' as scoring parameter.
- The <strong>mean</strong> and <strong>standard deviation</strong> of these accuracy scores are also calculated.
- <strong>Evaluation time</strong> is also calculated (it varies and dependent on computer it runs)

<br>

### Initial Results:

- <strong>SVM</strong> achieved the highest accuracy of 0.97, with a low standard deviation of 0.029, and it took the least amount of time to train and evaluate the model on each fold, with an average time of 0.192 (in my machine - it varies).
<br>

- <strong>Decision Tree</strong> achieved an accuracy of 0.91, which is lower than SVM, and has a higher standard deviation of 0.059. It took longer than SVM to train and evaluate the model on each fold, with an average time of 0.327 (in my machine - it varies).
<br>

- <strong>KNN</strong>The KNN algorithm achieved an accuracy of 0.95, which is higher than the Decision Tree but lower than SVM, and it took the longest time to train and evaluate the model on each fold, with an average time of 0.565 (in my machine - it varies).
<br>

In general, the SVM algorithm seems to have the best overall performance, with the highest accuracy, lowest standard deviation, and shortest training and evaluation time. 

<br>

## Ensemble Learning with Voting Classifier

In [18]:
from sklearn.ensemble import VotingClassifier

mv_clf = VotingClassifier(estimators=[('p', pipe1), ('dt', pipe2), ('kn', pipe3)])

clf_labels += ['Majority voting']
all_clf = [pipe1, pipe2, pipe3, mv_clf]

print('20-fold cross validation:\n')
for clf, label in zip(all_clf, clf_labels):
    
    start_time = time.time()

    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=20,
                             scoring='accuracy')

    end_time = time.time()

    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " Time: " + str(round(end_time - start_time, 3)) +          
          " [" + label + "]")

20-fold cross validation:

Accuracy: 0.97 Stdev: 0.029 Time: 0.172 [SVM]
Accuracy: 0.91 Stdev: 0.059 Time: 0.267 [Decision tree]
Accuracy: 0.95 Stdev: 0.044 Time: 0.42 [KNN]
Accuracy: 0.96 Stdev: 0.042 Time: 0.814 [Majority voting]


### Description of the ensemble method:
- I used an ensemble method called <strong>Majority Voting</strong>. 
- The <strong>VotingClassifier</strong> class is used to combine the predictions of three different pipelines into a single prediction by taking the majority vote of their predicted class labels.
- By combining them, the ensemble can benefit from the strengths of each individual model while reducing the impact of any weaknesses.

### Summary of cross-validation results:
- The results show the accuracy and standard deviation of each algorithm, as well as the time it took to run each algorithm. 
<br>

- <strong>SVM</strong> has the highest accuracy of 0.97 with a standard deviation of 0.029 and took the least time to run, 0.196 seconds. 
<br>

- <strong>Decision Tree</strong> has an accuracy of 0.91 with a higher standard deviation of 0.059 and took longer to run, 0.267 seconds. 
<br>

- <strong>KNN</strong> has an accuracy of 0.95 with a standard deviation of 0.044 and took longer to run than SVM, 0.477 seconds. 
<br>

- <strong>Majority Voting</strong> has an accuracy of 0.96 with a standard deviation of 0.042 and took the longest time to run, 0.901 seconds.
<br>

In summary, SVM seems to be the best performing algorithm in terms of accuracy, standard deviation, and computation time, while Decision Tree has the lowest accuracy and the highest standard deviation. KNN and Majority Voting have moderate accuracy and standard deviation but took longer to run than SVM.

## Individually train each of the pipelines & their performance

### SVM

In [19]:
pipe1.fit(X_train, y_train)

y_pred = pipe1.predict(X_test)

print('SVM:')

print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', pipe1.score(X_test, y_test))

SVM:
Misclassified test set examples: 8
Out of a total of: 171
Accuracy: 0.9532163742690059


### Decision Tree

In [20]:
pipe2.fit(X_train, y_train)

y_pred = pipe2.predict(X_test)

print('Decision Tree:')

print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', pipe2.score(X_test, y_test))

Decision Tree:
Misclassified test set examples: 9
Out of a total of: 171
Accuracy: 0.9473684210526315


### KNN

In [21]:
pipe3.fit(X_train, y_train)

y_pred = pipe3.predict(X_test)

print('KNN:')

print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', pipe3.score(X_test, y_test))

KNN:
Misclassified test set examples: 8
Out of a total of: 171
Accuracy: 0.9532163742690059


### Majority Voting

In [22]:
mv_clf.fit(X_train, y_train)

y_pred = mv_clf.predict(X_test)

print('Majority Voting:')

print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', mv_clf.score(X_test, y_test))

Majority Voting:
Misclassified test set examples: 8
Out of a total of: 171
Accuracy: 0.9532163742690059


### Summary of the results of testing on the testing data:

The results of testing on the testing data show that all four models (SVM, decision tree, KNN, and majority voting) have high accuracy in predicting the test set examples.
<br>

- The <strong>SVM</strong> model achieved an accuracy of 0.953, with only 8 misclassified test set examples out of a total of 171. 
<br>

- The <strong>KNN</strong> and <strong>majority voting</strong> models also achieved an accuracy of 0.953, with only 8 misclassified examples each. 
<br>

- The <strong>decision tree</strong> model achieved a slightly lower accuracy of 0.947, with 9 misclassified examples.

Overall, the testing results are <strong>consistent</strong> with those of cross-validation, which showed high accuracy for all four models. However, the testing results do show some variation in accuracy compared to the cross-validation results, particularly for the decision tree model. Nonetheless, all models performed well on the testing data, indicating their potential usefulness for predicting new data.