# Cancer Prediction Lab for Supervised Machine Learning: Classification

Notebook Author: Tan Song Xin Alastair

Dataset Source: Kaggle

Dataset Source URL: https://www.kaggle.com/datasets/rabieelkharoua/cancer-prediction-dataset

Accessed Date: 02 February 2025

In [19]:
import pandas as pd
import numpy as np
import sklearn, statistics
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from scipy.stats import kstest
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns

In [30]:
# Read dataset
pd_dataset = pd.read_csv("cancer_pred_dataset.csv")

# Check if file is read properly.
print("DataFrame Check:")
print(pd_dataset.head())

# Check if there are null/na values to deal with
print("NA/NULL count:")
print(pd_dataset.isna().sum())

print(pd_dataset[["GeneticRisk"]].value_counts())
print(pd_dataset[["Smoking"]].value_counts())
print(pd_dataset[["CancerHistory"]].value_counts())
print(pd_dataset[["Diagnosis"]].value_counts())

DataFrame Check:
   Age  Gender        BMI  Smoking  GeneticRisk  PhysicalActivity  \
0   58       1  16.085313        0            1          8.146251   
1   71       0  30.828784        0            1          9.361630   
2   48       1  38.785084        0            2          5.135179   
3   34       0  30.040296        0            0          9.502792   
4   62       1  35.479721        0            0          5.356890   

   AlcoholIntake  CancerHistory  Diagnosis  
0       4.148219              1          1  
1       3.519683              0          0  
2       4.728368              0          1  
3       2.044636              0          0  
4       3.309849              0          1  
NA/NULL count:
Age                 0
Gender              0
BMI                 0
Smoking             0
GeneticRisk         0
PhysicalActivity    0
AlcoholIntake       0
CancerHistory       0
Diagnosis           0
dtype: int64
GeneticRisk
0              895
1              447
2              158
Nam

## Categorical Value Counts:

|GeneticRisk| | |Smoking| | |CancerHistory| | |Diagnosis| | 
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|0|895| |0|1096| |0|1284| |0|943|
|1|447| |1|404| |1|216| |1|557|
|2|158| | | | |  | | | | |

Some imbalance, but stratification will be attempted as a first simple step solution to thjis issue first.

In [31]:
#Basic Age Two-Sided Kolmogorov-Smirnov Test. p > 0.05 means p is likely normal

stat, p = kstest(pd_dataset["Age"], 'norm', args=(pd_dataset["Age"].mean(), pd_dataset["Age"].std()))
print(f"Age Statistics={stat}, p-value={p}")

stat, p = kstest(np.log(pd_dataset["Age"]), 'norm', args=(np.log(pd_dataset["Age"]).mean(), np.log(pd_dataset["Age"]).std()))
print(f"Log Age Statistics={stat}, p-value={p}")

boxcox_result, _ = scipy.stats.boxcox(pd_dataset["Age"])

stat, p = kstest(boxcox_result, 'norm', args=(boxcox_result.mean(), boxcox_result.std()))
print(f"Boxcox Age Statistics={stat}, p-value={p}")

stat, p = kstest(np.sqrt(pd_dataset["Age"]), 'norm', args=(np.sqrt(pd_dataset["Age"]).mean(), np.sqrt(pd_dataset["Age"]).std()))
print(f"Sqrt Age Statistics={stat}, p-value={p}")

robust_scaler = RobustScaler()
pd_dataset["Scaled_Age"] = robust_scaler.fit_transform(pd_dataset[["Age"]])

stat, p = kstest(pd_dataset["Scaled_Age"], 'norm', args=(pd_dataset["Scaled_Age"].mean(), pd_dataset["Scaled_Age"].std()))
print(f"Robust Scaler Scaled Age Statistics={stat}, p-value={p}")

Age Statistics=0.07162216580865866, p-value=3.891956357447314e-07
Log Age Statistics=0.09156205580510224, p-value=2.1516391261112486e-11
Boxcox Age Statistics=0.07219479869679402, p-value=3.0371878232906383e-07
Sqrt Age Statistics=0.07501624938587448, p-value=8.695030230524278e-08
Robust Scaler Scaled Age Statistics=0.07162216580865877, p-value=3.8919563574471276e-07


## Age Normalisation Result Attempts:

Unmodified Age p-value=3.891956357447314e-07

Log Age p-value=2.1516391261112486e-11

Boxcox Age p-value=3.0371878232906383e-07

Sqrt Age p-value=8.695030230524278e-08

Robust Scaler Scaled Age p-value=3.8919563574471276e-07

None of the values are greater than 0.05. For the purposes of Logistic Regression, will be using the robust scaler version, as it is more resilient to outliers.

In [27]:
#Basic BMI Two-Sided Kolmogorov-Smirnov Test. p > 0.05 means p is likely normal

stat, p = kstest(pd_dataset["BMI"], 'norm', args=(pd_dataset["BMI"].mean(), pd_dataset["BMI"].std()))
print(f"BMI Statistics={stat}, p-value={p}")

stat, p = kstest(np.log(pd_dataset["BMI"]), 'norm', args=(np.log(pd_dataset["BMI"]).mean(), np.log(pd_dataset["BMI"]).std()))
print(f"Log BMI Statistics={stat}, p-value={p}")

boxcox_result, _ = scipy.stats.boxcox(pd_dataset["BMI"])

stat, p = kstest(boxcox_result, 'norm', args=(boxcox_result.mean(), boxcox_result.std()))
print(f"Boxcox BMI Statistics={stat}, p-value={p}")

stat, p = kstest(np.sqrt(pd_dataset["BMI"]), 'norm', args=(np.sqrt(pd_dataset["BMI"]).mean(), np.sqrt(pd_dataset["BMI"]).std()))
print(f"Sqrt BMI Statistics={stat}, p-value={p}")

robust_scaler = RobustScaler()
pd_dataset["Scaled_BMI"] = robust_scaler.fit_transform(pd_dataset[["BMI"]])

stat, p = kstest(pd_dataset["Scaled_BMI"], 'norm', args=(pd_dataset["Scaled_BMI"].mean(), pd_dataset["Scaled_BMI"].std()))
print(f"Robust Scaler Scaled BMI Statistics={stat}, p-value={p}")


BMI Statistics=0.06083960549816192, p-value=2.8679341466106453e-05
Log BMI Statistics=0.08150339501149417, p-value=4.0836749374851926e-09
Boxcox BMI Statistics=0.06405942746565263, p-value=8.547553639087e-06
Sqrt BMI Statistics=0.06825455305362982, p-value=1.607386546593516e-06
Robust Scaler Scaled BMI Statistics=0.06083960549816192, p-value=2.8679341466106453e-05


## BMI Normalisation Result Attempts:

BMI Statistics=0.06083960549816192, p-value=2.8679341466106453e-05

Log BMI Statistics=0.08150339501149417, p-value=4.0836749374851926e-09

Boxcox BMI Statistics=0.06405942746565263, p-value=8.547553639087e-06

Sqrt BMI Statistics=0.06825455305362982, p-value=1.607386546593516e-06

Robust Scaler Scaled BMI Statistics=0.06083960549816192, p-value=2.8679341466106453e-05

None of the values are greater than 0.05. For the purposes of Logistic Regression, will be using the robust scaler version, as it is more resilient to outliers.

In [28]:
#Basic Physical Activity Two-Sided Kolmogorov-Smirnov Test. p > 0.05 means p is likely normal

stat, p = kstest(pd_dataset["PhysicalActivity"], 'norm', args=(pd_dataset["PhysicalActivity"].mean(), pd_dataset["PhysicalActivity"].std()))
print(f"PhysicalActivity Statistics={stat}, p-value={p}")

PhysicalActivity Statistics=0.06320157936067106, p-value=1.1873079156051877e-05
