# Cancer Prediction Lab for Supervised Machine Learning: Classification

Notebook Author: Tan Song Xin Alastair

Dataset Source: Kaggle

Dataset Source URL: https://www.kaggle.com/datasets/rabieelkharoua/cancer-prediction-dataset

Accessed Date: 02 February 2025

## Pip Requirements:

### Python Version: 3.13.1

anyio==4.8.0

argon2-cffi==23.1.0

argon2-cffi-bindings==21.2.0

arrow==1.3.0

asttokens==3.0.0

async-lru==2.0.4

attrs==25.1.0

babel==2.17.0

beautifulsoup4==4.12.3

bleach==6.2.0

certifi==2025.1.31

cffi==1.17.1

charset-normalizer==3.4.1

comm==0.2.2

contourpy==1.3.1

cycler==0.12.1

debugpy==1.8.12

decorator==5.1.1

defusedxml==0.7.1

executing==2.2.0

fastjsonschema==2.21.1

fonttools==4.55.8

fqdn==1.5.1

h11==0.14.0

httpcore==1.0.7

httpx==0.28.1

idna==3.10

ipykernel==6.29.5

ipython==8.32.0

ipywidgets==8.1.5

isoduration==20.11.0

jedi==0.19.2

Jinja2==3.1.5

joblib==1.4.2

json5==0.10.0

jsonpointer==3.0.0

jsonschema==4.23.0

jsonschema-specifications==2024.10.1

jupyter==1.1.1

jupyter-console==6.6.3

jupyter-events==0.11.0

jupyter-lsp==2.2.5

jupyter_client==8.6.3

jupyter_core==5.7.2

jupyter_server==2.15.0

jupyter_server_terminals==0.5.3

jupyterlab==4.3.5

jupyterlab_pygments==0.3.0

jupyterlab_server==2.27.3

jupyterlab_widgets==3.0.13

kiwisolver==1.4.8

MarkupSafe==3.0.2

matplotlib==3.10.0

matplotlib-inline==0.1.7

mistune==3.1.1

nbclient==0.10.2

nbconvert==7.16.6

nbformat==5.10.4

nest-asyncio==1.6.0

notebook==7.3.2

notebook_shim==0.2.4

numpy==2.2.2

overrides==7.7.0

packaging==24.2

pandas==2.2.3

pandocfilters==1.5.1

parso==0.8.4

pexpect==4.9.0

pillow==11.1.0

platformdirs==4.3.6

prometheus_client==0.21.1

prompt_toolkit==3.0.50

psutil==6.1.1

ptyprocess==0.7.0

pure_eval==0.2.3

pycparser==2.22

Pygments==2.19.1

pyparsing==3.2.1

python-dateutil==2.9.0.post0

python-json-logger==3.2.1

pytz==2025.1

PyYAML==6.0.2

pyzmq==26.2.1

referencing==0.36.2

requests==2.32.3

rfc3339-validator==0.1.4

rfc3986-validator==0.1.1

rpds-py==0.22.3

scikit-learn==1.6.1

scipy==1.15.1

seaborn==0.13.2

Send2Trash==1.8.3


setuptools==75.8.0

six==1.17.0

sniffio==1.3.1

soupsieve==2.6

stack-data==0.6.3

terminado==0.18.1

threadpoolctl==3.5.0

tinycss2==1.4.0

tornado==6.4.2

traitlets==5.14.3

types-python-dateutil==2.9.0.20241206

tzdata==2025.1

uri-template==1.3.0

urllib3==2.3.0

wcwidth==0.2.13

webcolors==24.11.1

webencodings==0.5.1

websocket-client==1.8.0

widgetsnbextension==4.0.13

# Dataset Summary

The dataset is a collection of anonymised patient data with statistics on their age, gender, and health characters and history such as their BMI, whether the patient is smoking, their genetic risk of cancer, their number of hours per week of physical activity, the number of units of alcohol consumed per week, whether the patient has had cancer before, and the label column, on whether they have been diagnosed with cancer.

This is a Kaggle dataset structured for purposes of testing prediction models.

In [1]:
import pandas as pd
import numpy as np
import sklearn, statistics
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from scipy.stats import kstest
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#Get rid of warnings
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

In [3]:
# Read dataset
pd_dataset = pd.read_csv("cancer_pred_dataset.csv")

# Check if file is read properly.
print("DataFrame Check:")
print(pd_dataset.head())

# Check if there are null/na values to deal with
print("NA/NULL count:")
print(pd_dataset.isna().sum())

print(pd_dataset[["Gender"]].value_counts())
print(pd_dataset[["GeneticRisk"]].value_counts())
print(pd_dataset[["Smoking"]].value_counts())
print(pd_dataset[["CancerHistory"]].value_counts())
print(pd_dataset[["Diagnosis"]].value_counts())
print(pd_dataset.query("Gender==0")[["Diagnosis"]].value_counts())
print(pd_dataset.query("Gender==1")[["Diagnosis"]].value_counts())

print(f"Accuracy of naive model that predicts everyone has cancer: {557/1500}.")
precision = 557 / (943 + 557)
recall = 1

print(f"F1 Score of naive model that predicts everyone has cancer: {2 * (precision * recall) / (precision + recall)}.")

DataFrame Check:
   Age  Gender        BMI  Smoking  GeneticRisk  PhysicalActivity  \
0   58       1  16.085313        0            1          8.146251   
1   71       0  30.828784        0            1          9.361630   
2   48       1  38.785084        0            2          5.135179   
3   34       0  30.040296        0            0          9.502792   
4   62       1  35.479721        0            0          5.356890   

   AlcoholIntake  CancerHistory  Diagnosis  
0       4.148219              1          1  
1       3.519683              0          0  
2       4.728368              0          1  
3       2.044636              0          0  
4       3.309849              0          1  
NA/NULL count:
Age                 0
Gender              0
BMI                 0
Smoking             0
GeneticRisk         0
PhysicalActivity    0
AlcoholIntake       0
CancerHistory       0
Diagnosis           0
dtype: int64
Gender
0         764
1         736
Name: count, dtype: int64
GeneticRisk

## Categorical Value Counts:

|GeneticRisk| | |Smoking| | |CancerHistory| | |Diagnosis| | 
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|0|895| |0|1096| |0|1284| |0|943|
|1|447| |1|404| |1|216| |1|557|
|2|158| | | | |  | | | | |

Some imbalance, but stratification will be attempted as a first simple step solution to this issue first. Notably, women are more likely in this dataset to have cancer than men. There are no null or NaN values, implying the dataset is high quality in terms of cleanliness.

Aim to beat Accuracy of 0.371 and F1 Score of 0.542

In [4]:
#Basic Age Two-Sided Kolmogorov-Smirnov Test. p > 0.05 means p is likely normal

stat, p = kstest(pd_dataset["Age"], 'norm', args=(pd_dataset["Age"].mean(), pd_dataset["Age"].std()))
print(f"Age Statistics={stat}, p-value={p}")

stat, p = kstest(np.log(pd_dataset["Age"]), 'norm', args=(np.log(pd_dataset["Age"]).mean(), np.log(pd_dataset["Age"]).std()))
print(f"Log Age Statistics={stat}, p-value={p}")

boxcox_result, _ = scipy.stats.boxcox(pd_dataset["Age"])

stat, p = kstest(boxcox_result, 'norm', args=(boxcox_result.mean(), boxcox_result.std()))
print(f"Boxcox Age Statistics={stat}, p-value={p}")

stat, p = kstest(np.sqrt(pd_dataset["Age"]), 'norm', args=(np.sqrt(pd_dataset["Age"]).mean(), np.sqrt(pd_dataset["Age"]).std()))
print(f"Sqrt Age Statistics={stat}, p-value={p}")

robust_scaler_age = RobustScaler()
pd_dataset["Scaled_Age"] = robust_scaler_age.fit_transform(pd_dataset[["Age"]])

stat, p = kstest(pd_dataset["Scaled_Age"], 'norm', args=(pd_dataset["Scaled_Age"].mean(), pd_dataset["Scaled_Age"].std()))
print(f"Robust Scaler Scaled Age Statistics={stat}, p-value={p}")

Age Statistics=0.07162216580865866, p-value=3.891956357447314e-07
Log Age Statistics=0.09156205580510224, p-value=2.1516391261112486e-11
Boxcox Age Statistics=0.07219479869679402, p-value=3.0371878232906383e-07
Sqrt Age Statistics=0.07501624938587448, p-value=8.695030230524278e-08
Robust Scaler Scaled Age Statistics=0.07162216580865877, p-value=3.8919563574471276e-07


## Age Normalisation Result Attempts:

Unmodified Age p-value=3.891956357447314e-07

Log Age p-value=2.1516391261112486e-11

Boxcox Age p-value=3.0371878232906383e-07

Sqrt Age p-value=8.695030230524278e-08

Robust Scaler Scaled Age p-value=3.8919563574471276e-07

None of the values are greater than 0.05. For the purposes of Logistic Regression, will be using the robust scaler version, as it is more resilient to outliers.

In [5]:
#Basic BMI Two-Sided Kolmogorov-Smirnov Test. p > 0.05 means p is likely normal

stat, p = kstest(pd_dataset["BMI"], 'norm', args=(pd_dataset["BMI"].mean(), pd_dataset["BMI"].std()))
print(f"BMI Statistics={stat}, p-value={p}")

stat, p = kstest(np.log(pd_dataset["BMI"]), 'norm', args=(np.log(pd_dataset["BMI"]).mean(), np.log(pd_dataset["BMI"]).std()))
print(f"Log BMI Statistics={stat}, p-value={p}")

boxcox_result, _ = scipy.stats.boxcox(pd_dataset["BMI"])

stat, p = kstest(boxcox_result, 'norm', args=(boxcox_result.mean(), boxcox_result.std()))
print(f"Boxcox BMI Statistics={stat}, p-value={p}")

stat, p = kstest(np.sqrt(pd_dataset["BMI"]), 'norm', args=(np.sqrt(pd_dataset["BMI"]).mean(), np.sqrt(pd_dataset["BMI"]).std()))
print(f"Sqrt BMI Statistics={stat}, p-value={p}")

robust_scaler_bmi = RobustScaler()
pd_dataset["Scaled_BMI"] = robust_scaler_bmi.fit_transform(pd_dataset[["BMI"]])

stat, p = kstest(pd_dataset["Scaled_BMI"], 'norm', args=(pd_dataset["Scaled_BMI"].mean(), pd_dataset["Scaled_BMI"].std()))
print(f"Robust Scaler Scaled BMI Statistics={stat}, p-value={p}")


BMI Statistics=0.06083960549816192, p-value=2.8679341466106453e-05
Log BMI Statistics=0.08150339501149417, p-value=4.0836749374851926e-09
Boxcox BMI Statistics=0.06405942746565263, p-value=8.547553639087e-06
Sqrt BMI Statistics=0.06825455305362982, p-value=1.607386546593516e-06
Robust Scaler Scaled BMI Statistics=0.06083960549816192, p-value=2.8679341466106453e-05


## BMI Normalisation Result Attempts:

BMI Statistics=0.06083960549816192, p-value=2.8679341466106453e-05

Log BMI Statistics=0.08150339501149417, p-value=4.0836749374851926e-09

Boxcox BMI Statistics=0.06405942746565263, p-value=8.547553639087e-06

Sqrt BMI Statistics=0.06825455305362982, p-value=1.607386546593516e-06

Robust Scaler Scaled BMI Statistics=0.06083960549816192, p-value=2.8679341466106453e-05

None of the values are greater than 0.05. For the purposes of Logistic Regression, will be using the robust scaler version, as it is more resilient to outliers.

In [6]:
#Basic Physical Activity Two-Sided Kolmogorov-Smirnov Test. p > 0.05 means p is likely normal

stat, p = kstest(pd_dataset["PhysicalActivity"], 'norm', args=(pd_dataset["PhysicalActivity"].mean(), pd_dataset["PhysicalActivity"].std()))
print(f"PhysicalActivity Statistics={stat}, p-value={p}")

stat, p = kstest(np.log(pd_dataset["PhysicalActivity"]), 'norm', args=(np.log(pd_dataset["PhysicalActivity"]).mean(), np.log(pd_dataset["PhysicalActivity"]).std()))
print(f"Log PhysicalActivity Statistics={stat}, p-value={p}")

pd_dataset["Boxcox_PhysicalActivity"], lambda_boxcox_phy = scipy.stats.boxcox(pd_dataset["PhysicalActivity"])

stat, p = kstest(pd_dataset["Boxcox_PhysicalActivity"], 'norm', args=(boxcox_result.mean(), boxcox_result.std()))
print(f"Boxcox PhysicalActivity Statistics={stat}, p-value={p}")

stat, p = kstest(np.sqrt(pd_dataset["PhysicalActivity"]), 'norm', args=(np.sqrt(pd_dataset["PhysicalActivity"]).mean(), np.sqrt(pd_dataset["PhysicalActivity"]).std()))
print(f"Sqrt PhysicalActivity Statistics={stat}, p-value={p}")

robust_scaler_phy = RobustScaler()
pd_dataset["Scaled_PhysicalActivity"] = robust_scaler_phy.fit_transform(pd_dataset[["PhysicalActivity"]])

stat, p = kstest(pd_dataset["Scaled_PhysicalActivity"], 'norm', args=(pd_dataset["Scaled_PhysicalActivity"].mean(), pd_dataset["Scaled_PhysicalActivity"].std()))
print(f"Robust Scaler Scaled PhysicalActivity Statistics={stat}, p-value={p}")

# Scale boxcox
robust_scaler_boxcox_phy = RobustScaler()
pd_dataset["Scaled_Boxcox_PhysicalActivity"] = robust_scaler_phy.fit_transform(pd_dataset[["Boxcox_PhysicalActivity"]])

stat, p = kstest(pd_dataset["Scaled_Boxcox_PhysicalActivity"], 'norm', args=(pd_dataset["Scaled_Boxcox_PhysicalActivity"].mean(), pd_dataset["Scaled_Boxcox_PhysicalActivity"].std()))
print(f"Boxcox Robust Scaler Scaled PhysicalActivity Statistics={stat}, p-value={p}")

PhysicalActivity Statistics=0.06320157936067106, p-value=1.1873079156051877e-05
Log PhysicalActivity Statistics=0.14901562666999968, p-value=1.5348360737987005e-29
Boxcox PhysicalActivity Statistics=0.9981014559780864, p-value=0.0
Sqrt PhysicalActivity Statistics=0.07240336217258803, p-value=2.7735304269668635e-07
Robust Scaler Scaled PhysicalActivity Statistics=0.06320157936067106, p-value=1.1873079156051877e-05
Boxcox Robust Scaler Scaled PhysicalActivity Statistics=0.06185621834371713, p-value=1.9702041957745945e-05


## PhysicalActivity Normalisation Result Attempts

PhysicalActivity p-value=1.1873079156051877e-05

Log PhysicalActivity p-value=1.5348360737987005e-29

Boxcox PhysicalActivity p-value=1.9128977936562205e-05

Sqrt PhysicalActivity p-value=2.7735304269668635e-07

Robust Scaler Scaled PhysicalActivity p-value=1.1873079156051877e-05

There is an improvement of the p-value when using Boxcox, so we will use Robust Scaler on the results scaled by boxcos to get the p-value of 1.9702041957745945e-05.

In [7]:
# Basic Alcohol Intake Two-Sided Kolmogorov-Smirnov Test. p > 0.05 means p is likely normal

stat, p = kstest(pd_dataset["AlcoholIntake"], 'norm', args=(pd_dataset["AlcoholIntake"].mean(), pd_dataset["AlcoholIntake"].std()))
print(f"AlcoholIntake Statistics={stat}, p-value={p}")

stat, p = kstest(np.log(pd_dataset["AlcoholIntake"]), 'norm', args=(np.log(pd_dataset["AlcoholIntake"]).mean(), np.log(pd_dataset["AlcoholIntake"]).std()))
print(f"Log AlcoholIntake Statistics={stat}, p-value={p}")

pd_dataset["Boxcox_AlcoholIntake"], _ = scipy.stats.boxcox(pd_dataset["AlcoholIntake"])

stat, p = kstest(pd_dataset["Boxcox_AlcoholIntake"], 'norm', args=(boxcox_result.mean(), boxcox_result.std()))
print(f"Boxcox AlcoholIntake Statistics={stat}, p-value={p}")

stat, p = kstest(np.sqrt(pd_dataset["AlcoholIntake"]), 'norm', args=(np.sqrt(pd_dataset["AlcoholIntake"]).mean(), np.sqrt(pd_dataset["AlcoholIntake"]).std()))
print(f"Sqrt AlcoholIntake Statistics={stat}, p-value={p}")

robust_scaler_alc = RobustScaler()
pd_dataset["Scaled_AlcoholIntake"] = robust_scaler_alc.fit_transform(pd_dataset[["AlcoholIntake"]])

stat, p = kstest(pd_dataset["Scaled_AlcoholIntake"], 'norm', args=(pd_dataset["Scaled_AlcoholIntake"].mean(), pd_dataset["Scaled_AlcoholIntake"].std()))
print(f"Robust Scaler Scaled AlcoholIntake Statistics={stat}, p-value={p}")

AlcoholIntake Statistics=0.05897210324534585, p-value=5.624101360878484e-05
Log AlcoholIntake Statistics=0.1573101740462859, p-value=6.894405513893202e-33
Boxcox AlcoholIntake Statistics=0.999834969922996, p-value=0.0
Sqrt AlcoholIntake Statistics=0.07346645003055041, p-value=1.738732116775778e-07
Robust Scaler Scaled AlcoholIntake Statistics=0.05897210324534585, p-value=5.624101360878484e-05


## AlcoholIntake Normalisation Result Attempts

AlcoholIntake p-value=5.624101360878484e-05

Log AlcoholIntake p-value=6.894405513893202e-33

Boxcox AlcoholIntake p-value=1.1360139376053743e-262

Sqrt AlcoholIntake p-value=1.738732116775778e-07

Robust Scaler Scaled AlcoholIntake p-value=5.624101360878484e-05

None of the values are greater than 0.05. For the purposes of Logistic Regression, will be using the robust scaler version, as it is more resilient to outliers.

In [8]:
# A Standard Scaler will be used on GeneticRisk to convert it to values between 0 and 1.

standard_scaler_genetic = StandardScaler()

pd_dataset["GeneticRisk"] = standard_scaler_genetic.fit_transform(pd_dataset[["GeneticRisk"]])

# Get dummies will be used on gender. And they will be renamed for interpretabiliy's sake.

pd_dataset = pd.get_dummies(pd_dataset, columns=["Gender"])

pd_dataset.rename(columns={"Gender_0" : "Gender_Male", "Gender_1" : "Gender_Female"}, inplace=True)

# Logistic Regression Classification

No one-hot encoding will be done on GeneticRisk as there is some relationship between 0, 1 and 2.

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

X = pd_dataset[["Scaled_Age", "Gender_Male", "Gender_Female", "Scaled_BMI", "Scaled_Boxcox_PhysicalActivity", "Scaled_AlcoholIntake", "GeneticRisk", "Smoking", "CancerHistory"]]
y = pd_dataset[["Diagnosis"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=1)

clf = LogisticRegression(random_state=0)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Basic Logistic Regression model accuracy: {acc}, F1 Score: {f1}")

for index in range(len(X.columns)):
    print(f"{X.columns[index]} coefficient: {np.abs(clf.coef_[0][index])}")

Basic Logistic Regression model accuracy: 0.8666666666666667, F1 Score: 0.8113207547169812
Scaled_Age coefficient: 1.348329439310833
Gender_Male coefficient: 0.9683508451568882
Gender_Female coefficient: 0.9736233027039473
Scaled_BMI coefficient: 1.2599341489373848
Scaled_Boxcox_PhysicalActivity coefficient: 1.1764590217258286
Scaled_AlcoholIntake coefficient: 1.2833416156722883
GeneticRisk coefficient: 1.003716085902707
Smoking coefficient: 1.7485106497652243
CancerHistory coefficient: 3.7105488282194896


## Results:

Basic Logistic Regression model accuracy: 0.87, F1 Score: 0.8151658767772512

Scaled_Age coefficient: 1.348329439310833

Gender_Male coefficient: 0.9683508451568882

Gender_Female coefficient: 0.9736233027039473

Scaled_BMI coefficient: 1.2599341489373848

Scaled_Boxcox_PhysicalActivity coefficient: 1.1764590217258286

Scaled_AlcoholIntake coefficient: 1.2833416156722883

GeneticRisk coefficient: 1.003716085902707

Smoking coefficient: 1.7485106497652243

CancerHistory coefficient: 3.7105488282194896

Quite reasonably, CancerHistory has the greatest impact on the model's predictions. Females are slightly more likely to have cancer in this dataset than males, even though there are (slightly) more males in this dataset than females. Smoking, Age and Alcohol intake all next have the highest impact. The genetic risk impact is lower than expected, being only higher than gender.

Stratified K-Fold will be used to get another overview of the data

In [10]:
from sklearn.model_selection import StratifiedKFold

n_splits = 5
acc_mean = 0.0
f1_mean = 0.0

age_coef_mean = 0.0
male_coef_mean = 0.0
female_coef_mean = 0.0
bmi_coef_mean = 0.0
phyact_coef_mean = 0.0
alcohol_coef_mean = 0.0
genetic_coef_mean = 0.0
smoking_coef_mean = 0.0
history_coef_mean = 0.0

skf = StratifiedKFold(n_splits=n_splits)

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    clf = LogisticRegression(random_state=0)
    clf.fit(X.iloc[train_index], y.iloc[train_index])

    y_pred = clf.predict(X.iloc[test_index])

    acc = accuracy_score(y.iloc[test_index], y_pred)
    acc_mean += acc
    f1 = f1_score(y.iloc[test_index], y_pred)
    f1_mean += f1
    
    age_coef_mean += np.abs(clf.coef_[0][0])
    male_coef_mean += np.abs(clf.coef_[0][1])
    female_coef_mean += np.abs(clf.coef_[0][2])
    bmi_coef_mean += np.abs(clf.coef_[0][3])
    phyact_coef_mean += np.abs(clf.coef_[0][4])
    alcohol_coef_mean += np.abs(clf.coef_[0][5])
    genetic_coef_mean += np.abs(clf.coef_[0][6])
    smoking_coef_mean += np.abs(clf.coef_[0][7])
    history_coef_mean += np.abs(clf.coef_[0][8])

print(f"{n_splits} Stratified K-Folds: Mean accuracy: {acc_mean/n_splits}, Mean F1 Score: {f1_mean/n_splits}")

print(f"{X.columns[0]} coefficient: {age_coef_mean/n_splits}")
print(f"{X.columns[1]} coefficient: {male_coef_mean/n_splits}")
print(f"{X.columns[2]} coefficient: {female_coef_mean/n_splits}")
print(f"{X.columns[3]} coefficient: {bmi_coef_mean/n_splits}")
print(f"{X.columns[4]} coefficient: {phyact_coef_mean/n_splits}")
print(f"{X.columns[5]} coefficient: {alcohol_coef_mean/n_splits}")
print(f"{X.columns[6]} coefficient: {genetic_coef_mean/n_splits}")
print(f"{X.columns[7]} coefficient: {smoking_coef_mean/n_splits}")
print(f"{X.columns[8]} coefficient: {history_coef_mean/n_splits}")

5 Stratified K-Folds: Mean accuracy: 0.8513333333333334, Mean F1 Score: 0.7892168261182981
Scaled_Age coefficient: 1.5114603754516618
Gender_Male coefficient: 0.986768307923724
Gender_Female coefficient: 0.9892179805229308
Scaled_BMI coefficient: 1.3515557102419415
Scaled_Boxcox_PhysicalActivity coefficient: 1.1298400099817125
Scaled_AlcoholIntake coefficient: 1.3948561844521046
GeneticRisk coefficient: 1.0058234238941828
Smoking coefficient: 1.832735041256917
CancerHistory coefficient: 3.8359681337882905


# Evaluation of Logistic Regression (and Stratified K-Fold Logistic Regression)

|Values|Stratified Model|5-Fold Stratified Mean Model|
|-----|-----|-----|
|Accuracy|0.870|0.851|
|F1 Score|0.815|0.789|
||||
|Age Coefficient|1.348|1.511|
|Male Coefficient|0.968|0.987|
|Female Coefficient|0.974|0.989|
|BMI Coefficient|1.260|1.352|
|Physical Activity Coefficient|1.176|1.130|
|Alcohol Intake Coefficient|1.283|1.395|
|Genetic Risk Coefficient|1.004|1.006|
|Smoking Coefficient|1.749|1.833|
|Cancer History Coefficient|3.711|3.836|

Both the basic Logistic Regression (and the stratified K-Fold accuracy and F1-score means) are greater than the naive accuracy score of 0.623 and the naive F1 Score of 0.772.

The 5-Fold model places a greater emphasis on age, BMI, alcohol intake, smoking and cancer history, but less on physical impact. It also has a smaller difference between male and female gender coefficients.

## Decision Tree Model

Iterations on decision tree models will be attempted. Max depth will be used as a hyperparameter.

In [11]:
from sklearn.tree import DecisionTreeClassifier

for i in range(1, 17):    
    dec_tree = DecisionTreeClassifier(max_depth=i, random_state=1)
    dec_tree.fit(X_train, y_train)
    y_pred = dec_tree.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f"Decision Tree model with max depth \"{i}\" accuracy: {acc}, F1 Score: {f1}")

n_splits = 5
acc_mean = 0.0
f1_mean = 0.0

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    dec_tree = DecisionTreeClassifier(max_depth=8, random_state=1)
    dec_tree.fit(X.iloc[train_index], y.iloc[train_index])
    y_pred = dec_tree.predict(X.iloc[test_index])
    
    acc = accuracy_score(y.iloc[test_index], y_pred)
    acc_mean += acc
    
    f1 = f1_score(y.iloc[test_index], y_pred)
    f1_mean += f1

print(f"{n_splits} Stratified K-Folds for Decision Tree Classifier, Max Depth of 8: Mean accuracy: {acc_mean/n_splits}, Mean F1 Score: {f1_mean/n_splits}")

acc_mean = 0.0
f1_mean = 0.0

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    dec_tree = DecisionTreeClassifier(max_depth=9, random_state=1)
    dec_tree.fit(X.iloc[train_index], y.iloc[train_index])
    y_pred = dec_tree.predict(X.iloc[test_index])
    
    acc = accuracy_score(y.iloc[test_index], y_pred)
    acc_mean += acc
    
    f1 = f1_score(y.iloc[test_index], y_pred)
    f1_mean += f1

print(f"{n_splits} Stratified K-Folds for Decision Tree Classifier, Max Depth of 9: Mean accuracy: {acc_mean/n_splits}, Mean F1 Score: {f1_mean/n_splits}")

acc_mean = 0.0
f1_mean = 0.0

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    dec_tree = DecisionTreeClassifier(max_depth=10, random_state=1)
    dec_tree.fit(X.iloc[train_index], y.iloc[train_index])
    y_pred = dec_tree.predict(X.iloc[test_index])
    
    acc = accuracy_score(y.iloc[test_index], y_pred)
    acc_mean += acc
    
    f1 = f1_score(y.iloc[test_index], y_pred)
    f1_mean += f1

print(f"{n_splits} Stratified K-Folds for Decision Tree Classifier, Max Depth of 10: Mean accuracy: {acc_mean/n_splits}, Mean F1 Score: {f1_mean/n_splits}")

acc_mean = 0.0
f1_mean = 0.0

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    dec_tree = DecisionTreeClassifier(max_depth=11, random_state=1)
    dec_tree.fit(X.iloc[train_index], y.iloc[train_index])
    y_pred = dec_tree.predict(X.iloc[test_index])
    
    acc = accuracy_score(y.iloc[test_index], y_pred)
    acc_mean += acc
    
    f1 = f1_score(y.iloc[test_index], y_pred)
    f1_mean += f1

print(f"{n_splits} Stratified K-Folds for Decision Tree Classifier, Max Depth of 11: Mean accuracy: {acc_mean/n_splits}, Mean F1 Score: {f1_mean/n_splits}")

acc_mean = 0.0
f1_mean = 0.0

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    dec_tree = DecisionTreeClassifier(max_depth=12, random_state=1)
    dec_tree.fit(X.iloc[train_index], y.iloc[train_index])
    y_pred = dec_tree.predict(X.iloc[test_index])
    
    acc = accuracy_score(y.iloc[test_index], y_pred)
    acc_mean += acc
    
    f1 = f1_score(y.iloc[test_index], y_pred)
    f1_mean += f1

print(f"{n_splits} Stratified K-Folds for Decision Tree Classifier, Max Depth of 12: Mean accuracy: {acc_mean/n_splits}, Mean F1 Score: {f1_mean/n_splits}")

acc_mean = 0.0
f1_mean = 0.0

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    dec_tree = DecisionTreeClassifier(max_depth=13, random_state=1)
    dec_tree.fit(X.iloc[train_index], y.iloc[train_index])
    y_pred = dec_tree.predict(X.iloc[test_index])
    
    acc = accuracy_score(y.iloc[test_index], y_pred)
    acc_mean += acc
    
    f1 = f1_score(y.iloc[test_index], y_pred)
    f1_mean += f1

print(f"{n_splits} Stratified K-Folds for Decision Tree Classifier, Max Depth of 13: Mean accuracy: {acc_mean/n_splits}, Mean F1 Score: {f1_mean/n_splits}")

Decision Tree model with max depth "1" accuracy: 0.71, F1 Score: 0.4
Decision Tree model with max depth "2" accuracy: 0.7733333333333333, F1 Score: 0.6136363636363636
Decision Tree model with max depth "3" accuracy: 0.7633333333333333, F1 Score: 0.5798816568047337
Decision Tree model with max depth "4" accuracy: 0.7666666666666667, F1 Score: 0.6276595744680851
Decision Tree model with max depth "5" accuracy: 0.79, F1 Score: 0.64
Decision Tree model with max depth "6" accuracy: 0.8266666666666667, F1 Score: 0.7547169811320755
Decision Tree model with max depth "7" accuracy: 0.8633333333333333, F1 Score: 0.8056872037914692
Decision Tree model with max depth "8" accuracy: 0.9066666666666666, F1 Score: 0.8691588785046729
Decision Tree model with max depth "9" accuracy: 0.9133333333333333, F1 Score: 0.8796296296296297
Decision Tree model with max depth "10" accuracy: 0.9133333333333333, F1 Score: 0.8807339449541285
Decision Tree model with max depth "11" accuracy: 0.9, F1 Score: 0.863636363

## Decision Tree Evaluation

Decision trees prefer to use greedy splits to maximise the purity/gini of the splits.

|Model Type|Max Depth|Accuracy|F1 Score|
|-----|-----|-----|-----|
|Decision Tree|1|0.710|0.400|
|Decision Tree|2|0.773|0.614|
|Decision Tree|3|0.763|0.580|
|Decision Tree|4|0.767|0.628|
|Decision Tree|5|0.790|0.640|
|Decision Tree|6|0.827|0.757|
|Decision Tree|7|0.863|0.806|
|Decision Tree|8|0.907|0.869|
|Decision Tree|9|0.913|0.880|
|Decision Tree|10|0.913|0.881|
|Decision Tree|11|0.900|0.864|
|Decision Tree|12|0.890|0.851|
|Decision Tree|13|0.887|0.847|
|Decision Tree|14|0.883|0.843|
|Decision Tree|15|0.893|0.856|
|Decision Tree|16|0.893|0.856|
|5-Fold Decision Tree (Mean)|8|0.887|0.837|
|5-Fold Decision Tree (Mean)|9|0.887|0.838|
|5-Fold Decision Tree (Mean)|10|0.881|0.831|
|5-Fold Decision Tree (Mean)|11|0.879|0.828|
|5-Fold Decision Tree (Mean)|12|0.874|0.823|
|5-Fold Decision Tree (Mean)|13|0.872|0.821|

The basic decision tree has the best F1 at a max depth of 10, and shows no change in accuracy or F1 Score past a max depth of 15. For the 5-Fold Decision Tree, the best accuracy and F1 score comes at the max depth of 9.

The accuracy and F1 scores of the decision trees are better than the Logistic Regression results.

## Random Forest Classifier Evaluation

Random Forest classifiers are an ensemble model that uses multiple decision trees in a voting manner. This aims to replicate the features of the Decision Tree classifier, but taking the 'average' to minimise overfitting. This comes at the cost of interpretability, and training time.

We will be using the max depths of 8, 9 and 10 discovered as the best hyperparameters for this particular case.

In [12]:
from sklearn.ensemble import RandomForestClassifier

max_depth_list = [8, 9, 10, None]
n_estimators_list = [2, 5, 10, 50, 100, 500, 1000, 5000, 10000]

for max_depth in max_depth_list:
    for n_estimator in n_estimators_list:
        rfc = RandomForestClassifier(n_estimators=n_estimator, max_depth=max_depth, n_jobs=-1, random_state=1)
        rfc.fit(X_train, y_train)
        
        y_pred = rfc.predict(X_test)
        
        acc = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        print(f"Basic Random Forest Classifier (max depth {max_depth}) with n_estimators {n_estimator} model accuracy: {acc}, F1 Score: {f1}")

Basic Random Forest Classifier (max depth 8) with n_estimators 2 model accuracy: 0.84, F1 Score: 0.7623762376237624
Basic Random Forest Classifier (max depth 8) with n_estimators 5 model accuracy: 0.8766666666666667, F1 Score: 0.821256038647343
Basic Random Forest Classifier (max depth 8) with n_estimators 10 model accuracy: 0.89, F1 Score: 0.8374384236453202
Basic Random Forest Classifier (max depth 8) with n_estimators 50 model accuracy: 0.9133333333333333, F1 Score: 0.875
Basic Random Forest Classifier (max depth 8) with n_estimators 100 model accuracy: 0.9333333333333333, F1 Score: 0.9074074074074074
Basic Random Forest Classifier (max depth 8) with n_estimators 500 model accuracy: 0.9266666666666666, F1 Score: 0.897196261682243
Basic Random Forest Classifier (max depth 8) with n_estimators 1000 model accuracy: 0.9233333333333333, F1 Score: 0.8909952606635071
Basic Random Forest Classifier (max depth 8) with n_estimators 5000 model accuracy: 0.9266666666666666, F1 Score: 0.89622641

## Random Forest Classifier Results (Non-K Fold)

| Num Estimators | Max Depth | Accuracy | F1    |  | Num Estimators | Max Depth | Accuracy | F1    |  | Num Estimators | Max Depth | Accuracy | F1    |  | Num Estimators | Max Depth | Accuracy | F1    |
| -------------- | --------- | -------- | ----- | ----- | -------------- | --------- | -------- | ----- | ----- | -------------- | --------- | -------- | ----- | ----- | -------------- | --------- | -------- | ----- |
| 2              | 8         | 0.84     | 0.762 |  | 2              | 9         | 0.803    | 0.712 |  | 2              | 10        | 0.827    | 0.74  |  | 2              | None      | 0.813    | 0.682 |
| 5              | 8         | 0.877    | 0.821 |  | 5              | 9         | 0.89     | 0.847 |  | 5              | 10        | 0.89     | 0.847 |  | 5              | None      | 0.887    | 0.833 |
| 10             | 8         | 0.89     | 0.837 |  | 10             | 9         | 0.913    | 0.877 |  | 10             | 10        | 0.913    | 0.877 |  | 10             | None      | 0.897    | 0.849 |
| 50             | 8         | 0.913    | 0.875 |  | 50             | 9         | 0.927    | 0.897 |  | 50             | 10        | 0.937    | 0.912 |  | 50             | None      | 0.92     | 0.888 |
| 100            | 8         | 0.933    | 0.907 |  | 100            | 9         | 0.933    | 0.907 |  | 100            | 10        | 0.94     | 0.917 |  | 100            | None      | 0.937    | 0.912 |
| 500            | 8         | 0.927    | 0.897 |  | 500            | 9         | 0.937    | 0.912 |  | 500            | 10        | 0.937    | 0.912 |  | 500            | None      | 0.933    | 0.908 |
| 1000           | 8         | 0.923    | 0.891 |  | 1000           | 9         | 0.94     | 0.917 |  | 1000           | 10        | 0.937    | 0.912 |  | 1000           | None      | 0.933    | 0.908 |
| 5000           | 8         | 0.927    | 0.896 |  | 5000           | 9         | 0.933    | 0.907 |  | 5000           | 10        | 0.937    | 0.912 |  | 5000           | None      | 0.933    | 0.908 |
| 10000          | 8         | 0.927    | 0.896 |  | 10000          | 9         | 0.933    | 0.907 |  | 10000          | 10        | 0.937    | 0.912 |  | 10000          | None      | 0.933    | 0.908 |

The best values range from 50 to 1000 estimators, and a max depth of 9 and 10. We will use 5-Fold validation to get a better mean accuracy and mean F1 score of all the models.

In [13]:
n_splits = 5

max_depth_list = [9, 10]
n_estimators_list = [50, 100, 500, 1000]
    
for max_depth in max_depth_list:
    for n_estimator in n_estimators_list:
        acc_mean = 0.0
        f1_mean = 0.0
        for i, (train_index, test_index) in enumerate(skf.split(X, y)):                
            rfc = RandomForestClassifier(n_estimators=n_estimator, max_depth=max_depth, n_jobs=-1, random_state=1)
            rfc.fit(X.iloc[train_index], y.iloc[train_index])

            y_pred = rfc.predict(X.iloc[test_index])
            
            acc_mean += accuracy_score(y.iloc[test_index], y_pred)
            f1_mean += f1_score(y.iloc[test_index], y_pred)

        print(f"{n_splits}-Fold Random Forest Classifier (max depth {max_depth}) with n_estimators {n_estimator} model accuracy: {acc_mean/n_splits}, F1 Score: {f1_mean/n_splits}")

5-Fold Random Forest Classifier (max depth 9) with n_estimators 50 model accuracy: 0.9153333333333332, F1 Score: 0.8762190231522409
5-Fold Random Forest Classifier (max depth 9) with n_estimators 100 model accuracy: 0.9193333333333333, F1 Score: 0.883273507919853
5-Fold Random Forest Classifier (max depth 9) with n_estimators 500 model accuracy: 0.9206666666666667, F1 Score: 0.8853504244201919
5-Fold Random Forest Classifier (max depth 9) with n_estimators 1000 model accuracy: 0.9186666666666667, F1 Score: 0.8820235523892478
5-Fold Random Forest Classifier (max depth 10) with n_estimators 50 model accuracy: 0.9179999999999999, F1 Score: 0.8811959529485304
5-Fold Random Forest Classifier (max depth 10) with n_estimators 100 model accuracy: 0.9213333333333333, F1 Score: 0.8854519128127644
5-Fold Random Forest Classifier (max depth 10) with n_estimators 500 model accuracy: 0.9233333333333335, F1 Score: 0.8888772441293451
5-Fold Random Forest Classifier (max depth 10) with n_estimators 100

## Random Forest Classifier Results (5-Fold)

| Num Estimators | Max Depth | Accuracy | F1    | -- | Max Depth | Accuracy | F1    |
| -------------- | --------- | -------- | ----- | -- | --------- | -------- | ----- |
| 50             | 9         | 0.915    | 0.876 |    | 10        | 0.918    | 0.881 |
| 100            | 9         | 0.919    | 0.883 |    | 10        | 0.921    | 0.885 |
| 500            | 9         | 0.921    | 0.885 |    | 10        | 0.923    | 0.889 |
| 1000           | 9         | 0.919    | 0.882 |    | 10        | 0.921    | 0.886 |

The model performs best at 500 estimators, at a max depth of 10, when evaluated with 5-fold.

# Key Findings/Analysis

From the Logistic Regression model, the best predictors of cancer are the Cancer History, smoking, and age columns. This ties in with common wisdom that people who have had cancer previously are more likely to get cancer in the future, and that smoking increases the risks of cancer, and older people are more likely to have cancer than younger people. Notably, women are only slightly likely to have cancer than men, even though the data is skewed to having more a larger proportion of women with cancer, as compared to men. It has less of an impact than Genetic Risk and Physical Activity, which have a small impact on the Logistic Regression model.

Both the Random Forest Classifier and the Decision Tree Classifiers have better performance in terms of accuracy and F1 score as compared to the Logistic Regression. In terms of this dataset, using a Random Forest Classifier of around 500 estimators, with a max depth of 10, will provide close to the best accuracy and F1 score on the validation dataset.

It is recommended that the Random Forest Classifier be used.

## Future Work
Deep Learning models may be able to provide a better model of predicting cancer. Other methods to be tested include oversampling, undersampling, and dropping columns such as gender to see if the predictive ability of the model improves with less noisy data.