Diagnosing Patients with Dementia

You are a data scientist working for a healthcare organisation that aims to improve early detection of Dementia. The organisation
has provided you with a rich dataset containing health information for 2,149 patients. Each patient is uniquely identified by a
Patient ID, and the dataset contains a variety of features such as demographic details, lifestyle factors, medical history, clinical
measurements, cognitive assessments, and symptoms.

Your task is to build a machine learning model to predict whether a patient has Dementia based on the available data. This
classification model will help doctors identify high-risk patients and prioritise them for further diagnostic tests or interventions.

The dataset is named "dementia.csv" and can be downloaded from the “Project Datasets” folder on myLMS.

Appendix A has a detailed description of all the columns within the dataset.

1.1.
Performing the necessary preprocessing on the data to get it ready for training the machine learning model. This will
include removing missing values, encoding categorical variables, feature standardisation and any necessary feature engineering.
(5 Marks)

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# load csv data
df = pd.read_csv("dementia.csv", index_col=0)

# 'None'/'XXXConfid' as missing
df.replace({"None": np.nan, "XXXConfid": np.nan}, inplace=True)
for c in ("PatientID", "DoctorInCharge"):
    if c in df.columns:
        df.drop(columns=c, inplace=True)

# drop rows with any missing values
df.dropna(inplace=True)

# target and features
# Diagnosis may be stored as a boolean, strings, or numeric
diag = df["Diagnosis"]

if diag.dtype == object:
    # try map strings to ints
    y = diag.map({"Yes": 1, "No": 0})
    # if mapping left any NA (e.g. some values are numeric strings), try coercing to numeric
    if y.isnull().any():
        y = pd.to_numeric(diag, errors="coerce")
else:
    # numeric already (or boolean)
    y = pd.to_numeric(diag, errors="coerce")

# no missing values remain in the target
if y.isnull().any():
    raise ValueError("Diagnosis column contains values that cannot be converted to a binary target. "
                     "Inspect df['Diagnosis'] for unexpected values.")

y = y.astype(int)
X = df.drop(columns=["Diagnosis"])

# encode categoricals with one-hot
cat_cols = X.select_dtypes(include="object").columns.tolist()
X = pd.get_dummies(X, columns=cat_cols, drop_first=True)

# standardise numeric features
num_cols = X.select_dtypes(include=[np.number]).columns
scaler = StandardScaler()
X[num_cols] = scaler.fit_transform(X[num_cols])

# train/test split for downstream work
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Preprocessing done. Shapes:", X_train.shape, X_test.shape, y_train.shape)

Preprocessing done. Shapes: (1362, 35) (341, 35) (1362,)


  df.replace({"None": np.nan, "XXXConfid": np.nan}, inplace=True)


1.2.
Implement both of the following approaches for feature selection:

a) A filter-based method using Select KBest

b) A wrapper-based method using Recursive Feature Elimination (RFE) with a suitable estimator

Your implementation should clearly show the selected features in each case.
(10 marks)

In [4]:

from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.linear_model import LogisticRegression

# use preprocessed train split
X = X_train.copy()
y = y_train.copy().squeeze()

# choose a small k for simplicity
k = min(10, X.shape[1])

# a) Filter: SelectKBest
skb = SelectKBest(score_func=f_classif, k=k).fit(X, y)
skb_selected = list(X.columns[skb.get_support()])
print(f"SelectKBest (k={k}) selected {len(skb_selected)} features:")
for f in skb_selected:
    print(" -", f)

# b) Wrapper: RFE 
est = LogisticRegression(max_iter=1000, solver="liblinear")
rfe = RFE(estimator=est, n_features_to_select=k).fit(X, y)
rfe_selected = list(X.columns[rfe.support_])
print(f"\nRFE (n_features={k}) selected {len(rfe_selected)} features:")
for f in rfe_selected:
    print(" -", f)


SelectKBest (k=10) selected 10 features:
 - HeadInjury
 - Hypertension
 - CholesterolLDL
 - CholesterolHDL
 - MMSE
 - FunctionalAssessment
 - MemoryComplaints
 - BehavioralProblems
 - ADL
 - Gender_Male

RFE (n_features=10) selected 10 features:
 - CholesterolLDL
 - MMSE
 - FunctionalAssessment
 - MemoryComplaints
 - BehavioralProblems
 - ADL
 - Ethnicity_Other
 - EducationLevel_Higher
 - Smoking_Yes
 - Forgetfulness_Yes


1.3.
For both RFE and SelectKBest:

-Discuss the advantages and limitations of the methods
(5 marks)

-Explain the process for selecting the optimal number of features for each approach 
(5 marks)

SelectKBest — advantages

Fast, simple and model‑agnostic (uses univariate scores).
Easy to interpret and reproducible.
SelectKBest — limitations

Ignores feature interactions and multicollinearity.
May select redundant features with high univariate score.
RFE (wrapper) — advantages

Considers features in combination and how they affect model performance.
Tends to produce feature sets tailored to the chosen estimator.

RFE — limitations

Computationally expensive for many features.
Model‑dependent (selection may not transfer to other estimators) and can overfit if not validated.
How to choose k for SelectKBest (short)

Sweep k (e.g. grid of values) and evaluate downstream CV metric (accuracy/AUC) on X_train.
Plot score vs k and pick the k at maximum or where performance plateaus (or use domain constraints).
Use nested or held‑out CV to avoid optimistic selection.

How to choose number of features for RFE (short)

Use RFECV (RFE with CV) to pick optimal n_features automatically via cross‑validation.
Alternatively run RFE with a grid of n_features and evaluate CV performance; choose the smallest feature set with near‑maximal CV score (1‑SE rule).
Always validate final selection on an independent test set to check generalisation.

In [6]:
# question 2 A

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
import os

def load_feature_list(fname, fallback_cols):
    if os.path.exists(fname):
        return pd.read_csv(fname, header=None).iloc[:,0].astype(str).tolist()
    return fallback_cols

fallback = list(X_train.columns[:10])
skb_selected = globals().get("skb_selected", None)
rfe_selected = globals().get("rfe_selected", None)

if skb_selected is None:
    skb_selected = load_feature_list("selectkbest_selected_features.csv", fallback)
if rfe_selected is None:
    rfe_selected = load_feature_list("rfecv_selected_features.csv", fallback)

models = {"SelectKBest": skb_selected, "RFE": rfe_selected}

for name, feats in models.items():
    Xtr = X_train[feats]
    Xte = X_test[feats]
    clf = LogisticRegression(max_iter=1000, solver="liblinear")
    clf.fit(Xtr, y_train)
    preds = clf.predict(Xte)
    probs = clf.predict_proba(Xte)[:,1] if hasattr(clf, "predict_proba") else clf.decision_function(Xte)
    acc = accuracy_score(y_test, preds)
    auc = roc_auc_score(y_test, probs)
    print(f"{name}: n_features={len(feats)}  Accuracy={acc:.3f}  AUC={auc:.3f}")

SelectKBest: n_features=10  Accuracy=0.880  AUC=0.940
RFE: n_features=10  Accuracy=0.865  AUC=0.943


In [None]:
# question 2 B
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
import os, pandas as pd


fallback = list(X_train.columns[:10])
skb_selected = globals().get("skb_selected", None)
rfe_selected = globals().get("rfe_selected", None)
def load_feature_list(fname, fallback_cols):
    if os.path.exists(fname):
        return pd.read_csv(fname, header=None).iloc[:,0].astype(str).tolist()
    return fallback_cols
if skb_selected is None:
    skb_selected = load_feature_list("selectkbest_selected_features.csv", fallback)
if rfe_selected is None:
    rfe_selected = load_feature_list("rfecv_selected_features.csv", fallback)

for name, feats in {"SelectKBest": skb_selected, "RFE": rfe_selected}.items():
    Xtr = X_train[feats]
    Xte = X_test[feats]
    clf = DecisionTreeClassifier(max_depth=5, random_state=42)
    clf.fit(Xtr, y_train)
    preds = clf.predict(Xte)
    probs = clf.predict_proba(Xte)[:,1]
    acc = accuracy_score(y_test, preds)
    auc = roc_auc_score(y_test, probs)
    print(f"{name}: n_features={len(feats)}  Accuracy={acc:.3f}  AUC={auc:.3f}")


SelectKBest: n_features=10  Accuracy=0.947  AUC=0.949
RFE: n_features=10  Accuracy=0.935  AUC=0.931
