# Scores Classification - Predicting K-Means Clusters

-----

Classification of the questionnaire-edits using the scores. Classifying the clusters from the cluster-model in `/example_notebooks/03-kmeans-clustering.ipynb`. Here, a 10-fold cross-validation is carried out but also with 10 different random seeds resulting in 100 classifiers being trained.

## Load Data

In [72]:
import numpy as np

import qutools.data as qtdata
from qutools.test_classifier import QuScoresClassifier
from qutools.test_classifier import QuClassifierResults
from qutools.core.classifier import ScikitClassifier

In [None]:
quconfig = qtdata.QuConfig.from_yaml("quconfigs/physics-pck.yaml")

qudata = qtdata.QuData(
    quconfig=quconfig,
    df_txt="<NOT IN THIS REPO>/pck-booklets.xlsx",
    df_scr="<NOT IN THIS REPO>/pck-scores.xlsx",
)

df_target = qtdata.read_data("<NOT IN THIS REPO>/clusters_all.csv")

[99, '99', '', '-', '--', 'NA', 'Na', 'na', 'Null', 'null', ' ', 'nan', 'NaN', 'Nan', 'NAN', nan].
Checked ID-matches. ✓
Validated ID-columns. ✓
Validated text-columns. ✓
Cleaning text-data whitespaces. ✓
All scores in correct ranges. ✓
Validated score-columns. ✓


## Classification Analysis

In [74]:
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, f1_score, cohen_kappa_score
from scipy.stats import norm

In [75]:
random_states = np.random.randint(0, 1000000, 10)

In [76]:
def print_normal_ci(name: str, data: list[float], alpha: float = 0.95):
    mean = np.mean(data)
    std = np.std(data)
    z = norm.ppf(1 - (1 - alpha) / 2)
    ci = z * std / np.sqrt(len(data))
    print(f"{name}\n - mean {mean:.3f} - ci: [{mean - ci:.3f}, {mean + ci:.3f}]")


def collect_prediction_metrics(quclf: QuScoresClassifier):
    metrics = {
        "accuracy": [],
        "f1": [],
        "kappa": [],
    }
    for rs in random_states:
        qcr = quclf.random_cross_validate(
            qudata=qudata, 
            df_target=df_target,
            df_strat=df_target[["ID", "cluster"]],
            strat_col="cluster",
            random_state=rs,
            verbose_split=False,
        )
        df_tst = qcr.df_preds[qcr.df_preds["mode"] == "test"]
        for split in df_tst["split"].unique():
            y_true = df_tst[df_tst["split"] == split]["cluster"]
            y_pred = df_tst[df_tst["split"] == split]["cluster_pred"]
            metrics["accuracy"].append(accuracy_score(y_true, y_pred))
            metrics["f1"].append(f1_score(y_true, y_pred, average="weighted"))
            metrics["kappa"].append(cohen_kappa_score(y_true, y_pred))
    
    for name, data in metrics.items():
        print_normal_ci(name, data)

### Dummy Classifier

In [77]:
quclf_dummy = QuScoresClassifier(
    model=ScikitClassifier(DummyClassifier(strategy="most_frequent")),
    target_name="cluster",
    feature_names=quconfig.get_task_names(),
)

In [78]:
collect_prediction_metrics(quclf_dummy)

Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 181.44it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 157.27it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 173.95it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 164.25it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 172.14it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 171.35it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 173.55it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 146.78it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 162.34it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 148.71it/s]

accuracy
 - mean 0.453 - ci: [0.452, 0.454]
f1
 - mean 0.282 - ci: [0.281, 0.283]
kappa
 - mean 0.000 - ci: [0.000, 0.000]





### Logistic Regression

In [79]:
quclf_lr = QuScoresClassifier(
    model=ScikitClassifier(LogisticRegression(penalty="l2", C=1.0, max_iter=200)),
    target_name="cluster",
    feature_names=quconfig.get_task_names(),
)

In [80]:
collect_prediction_metrics(quclf_lr)

Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 38.79it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 36.68it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 41.23it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 42.33it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 41.89it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 38.58it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 35.98it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 39.41it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 40.59it/s]


Found 846 common IDs in the passed `qudata` and target data..


100%|██████████| 10/10 [00:00<00:00, 38.36it/s]

accuracy
 - mean 0.943 - ci: [0.939, 0.948]
f1
 - mean 0.943 - ci: [0.938, 0.948]
kappa
 - mean 0.918 - ci: [0.911, 0.925]



