In [28]:
# Imports as always...
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Show all outputs.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

The aim of this notebook is the motivate and derive the model that we will be using. It will be a binary classification model, predicting the diagnosis (malignant or benign).

Once we have derived the model, we will use an evolutionary algorithm (EA) to select a subset of the features to run the model with. The accuracy of the model with that selection of features is our fitness for that candidate solution (the selection of features), which will be used to evolve the popoulation. In the next generation of solutions, the model will then be used again to generate fitness scores for each candidate solution, and so on and so on.

For this reason, the model needs to be pretty small and efficient -- training should be relatively quick and computationally cheap. The purpose of this research is to demonstrate the effectiveness of EAs as a means to make feature selection a more efficient process, and so we do not need our model to be brilliant! Showing that EA can do the job is all that we concern ourselves with.

See https://www.learndatasci.com/glossary/binary-classification/ and https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html for more information about our simple model and implementation.

In [37]:
# Read in the data (ignoring the 'Unnamed: 32' feature, and removing the id).
data = pd.read_csv(r'data.csv').drop(columns=['id', 'Unnamed: 32'])

In [38]:
# Convert diagnoses into binary output (0 or 1).
data = pd.get_dummies(data, 'diagnosis').drop(columns=['diagnosis_B']).rename({'diagnosis_M' : 'diagnosis'}, axis=1)

In [22]:
# Our X and y are the selected features and the diagnosis.
X = data.drop(columns=['diagnosis'])
y = data['diagnosis']

# Split the data into training and testing sets (arbitrarily an 80-20% split).
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.2, random_state=0)

In [24]:
# Normlaise the data for numerical stability.
# We normalise after splitting to prevent data leakage.
ss_train = StandardScaler()
X_train = ss_train.fit_transform(X_train)

ss_test = StandardScaler()
X_test = ss_test.fit_transform(X_test)

In [26]:
# Define and train a logistic regression model.
model = LogisticRegression()

model.fit(X_train, y_train)

In [30]:
# Determine the accuracy of the model (and hence the fitness of the candidate solution).
y_pred = model.predict(X_test)
accuracy_score(y_true=y_test, y_pred=y_pred)

0.956140350877193