# Papers With[out] code

This is the notebook from our attempt at replicating the Kumar *et al.* 2022 paper [Optimized Stacking Ensemble Learning Model for Breast Cancer Detection and Classification Using Machine Learning](https://www.mdpi.com/2071-1050/14/21/13998) in class.


This paper uses the breast_cancer data we've looked at before:

In [9]:
import seaborn as sns
from sklearn.datasets import load_breast_cancer

# Load the data to remind ourselves what we have
breast_cancer = load_breast_cancer(as_frame=True) # We need to load the data from sklearn rather than sns
breast_cancer.frame.head()                   

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [10]:
# Load in the data as nice, X, y datasets
X, y = load_breast_cancer(return_X_y=True)

In [11]:
# Split into training/testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [12]:
# Run a Knearest neighbors model
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Train the KNN classifier
knn.fit(X_train, y_train)

# Predict the labels for the test set
y_pred = knn.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9370629370629371


In [13]:
# Run logistic regression model

import seaborn as sns
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report, confusion_matrix



model = Pipeline([
    ('logistic', LogisticRegression(solver = 'liblinear'))])


model.fit(X_train, y_train)
model.score(X, y)

predictions = model.predict(X_test)


print(f'Model score on training data: {model.score(X_train,y_train)}')
print(f'Model score on testing data: {model.score(X_test,y_test)}')

Model score on training data: 0.9577464788732394
Model score on testing data: 0.951048951048951


In [14]:
# Run SVM model
from sklearn.datasets import make_blobs, make_circles
from sklearn.datasets import fetch_lfw_people

from sklearn.svm import SVC # "Support vector classifier"
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split



model = Pipeline([
   ('svc', SVC(kernel='rbf', class_weight='balanced'))
  ])
param_grid = {'svc__C': [0.1,0.5,0.75,1,1.25],
              'svc__gamma': [ 0.003 , 0.004, 0.005,0.006]}
grid = GridSearchCV(model, param_grid)

%time grid.fit(X_train, y_train)
print(grid.best_params_)

model = grid.best_estimator_
y_fit = model.predict(X_test)

print(classification_report(y_test, y_fit))

CPU times: user 1.58 s, sys: 0 ns, total: 1.58 s
Wall time: 1.59 s
{'svc__C': 1.25, 'svc__gamma': 0.004}
              precision    recall  f1-score   support

           0       0.87      0.96      0.91        55
           1       0.98      0.91      0.94        88

    accuracy                           0.93       143
   macro avg       0.92      0.94      0.93       143
weighted avg       0.93      0.93      0.93       143



In [15]:
# Run Random Forest model
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score,confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV
from sklearn.tree import plot_tree
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn import metrics

model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(X_train, y_train)

print(f'Train Accuracy: {accuracy_score(model.predict(X_train),y_train)}')
print(f'Test Accuracy: {accuracy_score(model.predict(X_test),y_test)}')

Train Accuracy: 1.0
Test Accuracy: 0.951048951048951


## Individual models

Each of our individual models did reasonably well with the set of hyperparameters used. We didn't have much time to test other settings, but in most cases, are within a few % points of what the paper reported for thier individual models.

## Stack the models

This is where we didn't have much time to work, and also didn't get into the detail of how the paper implemented the stacking. We went with the logistic regression method we used earlier.

The results are a bit underwhelming...worse than many individual models. So, this needs some more work, but should be a decent start.

In [8]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from numpy import mean
from numpy import std

def get_stacking():
    # define the base models
    level0 = list()
    level0.append(('lr', LogisticRegression(solver = 'liblinear')))
    level0.append(('knn', KNeighborsClassifier(n_neighbors=5)))
    level0.append(('svm', SVC(kernel='rbf', class_weight='balanced', C=1.25, gamma=0.005)))
    level0.append(('randfor',  RandomForestClassifier(n_estimators=100, random_state=0)))
    # define meta learner model
    level1 = LogisticRegression(max_iter=100000)
    # define the stacking ensemble
    model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
    return model

# evaluate a given model using cross-validation
def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    return scores

scores = evaluate_model(model, X_test, y_test)
print('>%s %.3f (%.3f)' % ("Stacking classifier: ", mean(scores), std(scores)))

>Stacking classifier:  0.933 (0.065)


## Take home

1. You can read papers and implement their methods
1. You can make ML models that are similar to what is publishable results from only a few years ago, and do it in class with 10 minutes to work!
1. Hyperparameters mater! We would need to play with these to do better.
1. Publish your code!! People can't replicate your work without your code!