# Clinical Modality — Preprocessing 

**Goal:** Build clinical feature matrices that are **aligned to the RNA split** (same `patient_id` order),
then preprocess with **train-only fitting** (impute → encode → scale) and save artifacts for early/late fusion.


In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
import joblib

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

BASE = Path("/home/steps4growth/gmriechi/Lung-Cancer-Subtyping")

# Outputs for clinical modality
OUT_DIR = BASE / "Data" / "02_preprocessed" / "clinical"
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Exact clinical file path (as you specified)
CLINICAL_CSV = Path("/home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/clinical_cleaned.csv")

# Exact RNA split patient IDs (patient-level)
RNA_TRAIN_IDS = Path("/home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/02_preprocessed/rna/rna_train_patient_ids.npy")
RNA_TEST_IDS  = Path("/home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/02_preprocessed/rna/rna_test_patient_ids.npy")

# Exact labels from RNA split (so clinical uses SAME y)
RNA_Y_TRAIN = Path("/home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/02_preprocessed/rna/rna_y_train.npy")
RNA_Y_TEST  = Path("/home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/02_preprocessed/rna/rna_y_test.npy")

print("OUT_DIR:", OUT_DIR)
print("Clinical CSV exists?:", CLINICAL_CSV.exists())
print("RNA train ids exists?:", RNA_TRAIN_IDS.exists())
print("RNA test ids exists? :", RNA_TEST_IDS.exists())
print("RNA y train exists?  :", RNA_Y_TRAIN.exists())
print("RNA y test exists?   :", RNA_Y_TEST.exists())


OUT_DIR: /home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/02_preprocessed/clinical
Clinical CSV exists?: True
RNA train ids exists?: True
RNA test ids exists? : True
RNA y train exists?  : True
RNA y test exists?   : True


## 2) Decide the alignment key (CRITICAL)

Clinical data **must align exactly** with RNA and methylation.

You will:
- Use **`patient_id`** as the alignment key
- Load RNA train/test patient IDs
- Subset clinical data to:
  - RNA train patients
  - RNA test patients
- **Never re-split clinical independently**

**Output:** aligned raw clinical train/test dataframes

In [2]:
# Load clinical data
clin_df = pd.read_csv(CLINICAL_CSV)

print("Raw clinical shape:", clin_df.shape)
print("Columns include patient_id:", "patient_id" in clin_df.columns)

# Load RNA split patient IDs
train_patient_ids = np.load(RNA_TRAIN_IDS, allow_pickle=True)
test_patient_ids  = np.load(RNA_TEST_IDS, allow_pickle=True)

print("\nRNA split:")
print("Train patients:", len(train_patient_ids))
print("Test patients :", len(test_patient_ids))

# Subset clinical to RNA patients
clin_train_df = clin_df[clin_df["patient_id"].isin(train_patient_ids)].copy()
clin_test_df  = clin_df[clin_df["patient_id"].isin(test_patient_ids)].copy()

# Enforce identical ordering as RNA
clin_train_df = clin_train_df.set_index("patient_id").loc[train_patient_ids].reset_index()
clin_test_df  = clin_test_df.set_index("patient_id").loc[test_patient_ids].reset_index()

print("\nAligned clinical shapes:")
print("Train:", clin_train_df.shape)
print("Test :", clin_test_df.shape)

# Sanity checks
assert clin_train_df.shape[0] == len(train_patient_ids)
assert clin_test_df.shape[0] == len(test_patient_ids)

print("\nAlignment check PASSED")

Raw clinical shape: (1089, 99)
Columns include patient_id: True

RNA split:
Train patients: 660
Test patients : 165

Aligned clinical shapes:
Train: (660, 99)
Test : (165, 99)

Alignment check PASSED


## 3) Drop non-feature columns (avoid leakage + keep only predictors)

Now that clinical is aligned to the **RNA split**, we remove columns that should **not** be used as model features:

We drop:
- Identifiers (IDs, barcodes, project labels)
- Any columns that directly encode the outcome or are post-diagnosis follow-up outcomes
- Pure admin/metadata fields

At this stage:
- Keep raw values (no imputation/encoding yet)
- Define a clean feature set consistently for train and test

**Output:** `clin_train_X_df`, `clin_test_X_df` (predictor-only clinical tables)

In [3]:
# Columns that are NEVER model inputs (IDs / metadata)
DROP_ALWAYS = {
    "patient_id",
    "bcr_patient_barcode",
    "project",
    "submitter_id",
}

# Common outcome/leakage columns that should not be used as predictors
DROP_LEAKAGE_CANDIDATES = {
    # survival/outcome fields
    "vital_status",
    "days_to_death",
    "days_to_last_follow_up",
    "last_known_disease_status",
    "days_to_last_known_disease_status",
    "progression_or_recurrence",
    "days_to_recurrence",
    "days_to_progression",
    "days_to_progression_free_survival",
    "days_to_disease_free_survival",
    "disease_free_status",
    "overall_survival",
    "overall_survival_time",
    "disease_free_survival",
    "disease_free_survival_time",
    "progression_free_survival",
    "progression_free_survival_time",
    # anything that might directly encode label (sometimes present)
    "subtype",
    "subtype_simple",
    "tumor_type",
    "cancer_type",
}

# Drop only what exists
drop_cols = [c for c in (DROP_ALWAYS | DROP_LEAKAGE_CANDIDATES) if c in clin_train_df.columns]

print("Dropping columns (present):", drop_cols)

clin_train_X_df = clin_train_df.drop(columns=drop_cols).copy()
clin_test_X_df  = clin_test_df.drop(columns=drop_cols).copy()

print("\nClinical predictors shape:")
print("Train:", clin_train_X_df.shape)
print("Test :", clin_test_X_df.shape)

# Quick check: train/test columns identical
assert list(clin_train_X_df.columns) == list(clin_test_X_df.columns)

print("\nRemaining columns (first 20):")
print(list(clin_train_X_df.columns)[:20])


Dropping columns (present): ['days_to_death', 'days_to_last_known_disease_status', 'bcr_patient_barcode', 'vital_status', 'days_to_last_follow_up', 'project', 'submitter_id', 'progression_or_recurrence', 'last_known_disease_status', 'patient_id', 'days_to_recurrence']

Clinical predictors shape:
Train: (660, 88)
Test : (165, 88)

Remaining columns (first 20):
['morphology', 'days_to_diagnosis', 'created_datetime', 'tissue_or_organ_of_origin', 'age_at_diagnosis', 'primary_diagnosis', 'classification_of_tumor', 'tumor_of_origin', 'updated_datetime', 'diagnosis_id', 'site_of_resection_or_biopsy', 'state', 'prior_treatment', 'diagnosis_is_primary_disease', 'synchronous_malignancy', 'ajcc_pathologic_stage', 'laterality', 'prior_malignancy', 'year_of_diagnosis', 'ajcc_staging_system_edition']


## 4) Missingness audit (train-only) and drop extremely sparse features

Clinical data always has missing values.

We will:
- Compute **missingness using TRAIN data only**
- Drop columns with **very high missingness** (e.g. > 40%)
- Apply the **same column drop** to test data

**Output:** reduced clinical feature tables with acceptable missingness

In [4]:
# Compute missing fraction per column using TRAIN data only
missing_frac = clin_train_X_df.isna().mean().sort_values(ascending=False)

# Set threshold (40% is conservative and standard for TCGA clinical)
MISSING_THRESH = 0.40

cols_to_drop_missing = missing_frac[missing_frac > MISSING_THRESH].index.tolist()

print(f"Columns with > {int(MISSING_THRESH*100)}% missing (TRAIN only):")
print(cols_to_drop_missing)
print("\nCount:", len(cols_to_drop_missing))

# Drop these columns from BOTH train and test
clin_train_X_df = clin_train_X_df.drop(columns=cols_to_drop_missing)
clin_test_X_df  = clin_test_X_df.drop(columns=cols_to_drop_missing)

print("\nAfter missingness filtering:")
print("Train shape:", clin_train_X_df.shape)
print("Test  shape:", clin_test_X_df.shape)

# Sanity check
assert list(clin_train_X_df.columns) == list(clin_test_X_df.columns)

print("\nRemaining columns:", clin_train_X_df.shape[1])

Columns with > 40% missing (TRAIN only):
['created_datetime', 'tumor_of_origin', 'cigarettes_per_day', 'treatments_radiation_route_of_administration', 'treatments_radiation_prescribed_dose_units', 'treatments_radiation_prescribed_dose', 'treatments_radiation_number_of_cycles', 'treatments_radiation_therapeutic_agents', 'treatments_pharmaceutical_number_of_fractions', 'treatments_pharmaceutical_treatment_dose', 'treatments_pharmaceutical_treatment_dose_units', 'treatments_pharmaceutical_treatment_anatomic_sites', 'treatments_pharmaceutical_course_number', 'treatments_radiation_clinical_trial_indicator', 'treatments_pharmaceutical_initial_disease_status', 'year_of_birth', 'year_of_death', 'alcohol_history', 'NA.', 'alcohol_intensity', 'figo_stage', 'figo_staging_edition_year', 'tumor_grade', 'treatments_pharmaceutical_route_of_administration', 'treatments_pharmaceutical_prescribed_dose_units', 'treatments_pharmaceutical_prescribed_dose', 'treatments_pharmaceutical_number_of_cycles', 'tre

## 5) Separate numeric vs categorical features (schema definition)

Before we impute or encode anything, we must **explicitly define feature types**.

We will:
- Identify **numeric** features (continuous / integer)
- Identify **categorical** features (strings, enums)
- Freeze this schema so it is identical for train and test

**Output:**  
- `numeric_features`  
- `categorical_features`

In [5]:
# Identify numeric columns
numeric_features = clin_train_X_df.select_dtypes(
    include=["int64", "float64", "int32", "float32"]
).columns.tolist()

# categorical
categorical_features = [
    c for c in clin_train_X_df.columns if c not in numeric_features
]

print("Numeric features:", len(numeric_features))
print("Categorical features:", len(categorical_features))

print("\nFirst 10 numeric features:")
print(numeric_features[:10])

print("\nFirst 10 categorical features:")
print(categorical_features[:10])

assert len(numeric_features) + len(categorical_features) == clin_train_X_df.shape[1]
assert set(clin_train_X_df.columns) == set(numeric_features + categorical_features)

print("\nFeature type separation PASSED")

Numeric features: 6
Categorical features: 38

First 10 numeric features:
['days_to_diagnosis', 'age_at_diagnosis', 'year_of_diagnosis', 'pack_years_smoked', 'age_at_index', 'days_to_birth']

First 10 categorical features:
['morphology', 'tissue_or_organ_of_origin', 'primary_diagnosis', 'classification_of_tumor', 'updated_datetime', 'diagnosis_id', 'site_of_resection_or_biopsy', 'state', 'prior_treatment', 'diagnosis_is_primary_disease']

Feature type separation PASSED


## 6A) Impute missing values (TRAIN only)

We will:
- Numeric columns: fill missing with **train median**
- Categorical columns: fill missing with **train most-frequent**
- Apply the same learned values to test

**Output:**
- `train_num_imp`, `test_num_imp`
- `train_cat_imp`, `test_cat_imp`
- Saved imputation stats

In [6]:
# Split numeric/categorical dataframes
train_num = clin_train_X_df[numeric_features].copy()
test_num  = clin_test_X_df[numeric_features].copy()

train_cat = clin_train_X_df[categorical_features].copy()
test_cat  = clin_test_X_df[categorical_features].copy()

# ---- Numeric imputation: median (train only)
num_medians = train_num.median(numeric_only=True)
train_num_imp = train_num.fillna(num_medians)
test_num_imp  = test_num.fillna(num_medians)

# ---- Categorical imputation: most frequent (train only)
cat_modes = train_cat.mode(dropna=True).iloc[0]  # one mode per column
train_cat_imp = train_cat.fillna(cat_modes)
test_cat_imp  = test_cat.fillna(cat_modes)

print("Imputed numeric shapes:", train_num_imp.shape, test_num_imp.shape)
print("Imputed categorical shapes:", train_cat_imp.shape, test_cat_imp.shape)

# Sanity: no missing after imputation in these blocks
print("\nRemaining NaNs (numeric): train", int(train_num_imp.isna().sum().sum()), "test", int(test_num_imp.isna().sum().sum()))
print("Remaining NaNs (categorical): train", int(train_cat_imp.isna().sum().sum()), "test", int(test_cat_imp.isna().sum().sum()))

# Save imputation artifacts
num_medians.to_csv(OUT_DIR / "clin_numeric_medians.csv")
cat_modes.to_csv(OUT_DIR / "clin_categorical_modes.csv")

print("\nSaved:")
print(" -", OUT_DIR / "clin_numeric_medians.csv")
print(" -", OUT_DIR / "clin_categorical_modes.csv")


Imputed numeric shapes: (660, 6) (165, 6)
Imputed categorical shapes: (660, 38) (165, 38)

Remaining NaNs (numeric): train 0 test 0
Remaining NaNs (categorical): train 0 test 0

Saved:
 - /home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/02_preprocessed/clinical/clin_numeric_medians.csv
 - /home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/02_preprocessed/clinical/clin_categorical_modes.csv


  train_cat_imp = train_cat.fillna(cat_modes)
  test_cat_imp  = test_cat.fillna(cat_modes)


## 6B) One-hot encode categorical features (fit on TRAIN only)

We will:
- Learn the set of categories **from TRAIN only**
- Create one-hot columns for train
- Apply the same column set to test (any unseen categories in test → ignored)
- Ensure train/test have **identical encoded columns in identical order**

**Output:**
- `X_cat_train` and `X_cat_test` (numpy arrays)
- `cat_feature_names`
- Saved one-hot column names for reproducibility

In [7]:
# 1) One-hot encode TRAIN (this "defines" the columns)
train_cat_ohe = pd.get_dummies(train_cat_imp, drop_first=False)

# 2) One-hot encode TEST, then align to TRAIN columns
test_cat_ohe = pd.get_dummies(test_cat_imp, drop_first=False)
test_cat_ohe = test_cat_ohe.reindex(columns=train_cat_ohe.columns, fill_value=0)

# Convert to numpy
X_cat_train = train_cat_ohe.to_numpy(dtype=np.float32)
X_cat_test  = test_cat_ohe.to_numpy(dtype=np.float32)

cat_feature_names = train_cat_ohe.columns.to_numpy()

print("One-hot encoded categorical shapes:")
print("X_cat_train:", X_cat_train.shape)
print("X_cat_test :", X_cat_test.shape)

# Sanity: identical feature dimensions
assert X_cat_train.shape[1] == X_cat_test.shape[1]

print("\nFirst 25 one-hot feature names:")
print(cat_feature_names[:25])

# Save one-hot feature names
np.save(OUT_DIR / "clin_cat_feature_names.npy", cat_feature_names)

print("\nSaved:")
print(" -", OUT_DIR / "clin_cat_feature_names.npy")


One-hot encoded categorical shapes:
X_cat_train: (660, 3215)
X_cat_test : (165, 3215)

First 25 one-hot feature names:
['diagnosis_is_primary_disease' 'age_is_obfuscated' 'morphology_8052/3'
 'morphology_8070/3' 'morphology_8071/3' 'morphology_8072/3'
 'morphology_8083/3' 'morphology_8140/3' 'morphology_8230/3'
 'morphology_8250/3' 'morphology_8252/3' 'morphology_8253/3'
 'morphology_8255/3' 'morphology_8260/3' 'morphology_8310/3'
 'morphology_8480/3' 'morphology_8490/3' 'morphology_8507/3'
 'morphology_8550/3' 'tissue_or_organ_of_origin_Lower lobe, lung'
 'tissue_or_organ_of_origin_Lung, NOS'
 'tissue_or_organ_of_origin_Main bronchus'
 'tissue_or_organ_of_origin_Middle lobe, lung'
 'tissue_or_organ_of_origin_Overlapping lesion of lung'
 'tissue_or_organ_of_origin_Upper lobe, lung']

Saved:
 - /home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/02_preprocessed/clinical/clin_cat_feature_names.npy


## 6C) Standardize numeric features (fit on TRAIN only)

We will:
- Fit `StandardScaler` on **train numeric features only**
- Transform train and test using the same scaler
- Save the scaler for reproducibility

**Output:**
- `X_num_train_scaled`, `X_num_test_scaled`
- Saved scaler (`clin_numeric_scaler.joblib`)

In [8]:
from sklearn.preprocessing import StandardScaler

scaler_num = StandardScaler()

X_num_train_scaled = scaler_num.fit_transform(train_num_imp).astype(np.float32)
X_num_test_scaled  = scaler_num.transform(test_num_imp).astype(np.float32)

print("Scaled numeric shapes:")
print("X_num_train_scaled:", X_num_train_scaled.shape)
print("X_num_test_scaled :", X_num_test_scaled.shape)

print("\nScaled numeric check (first 5 features):")
print("Train mean:", X_num_train_scaled[:, :5].mean(axis=0))
print("Train std :", X_num_train_scaled[:, :5].std(axis=0))

# Save scaler
joblib.dump(scaler_num, OUT_DIR / "clin_numeric_scaler.joblib")

print("\nSaved:")
print(" -", OUT_DIR / "clin_numeric_scaler.joblib")


Scaled numeric shapes:
X_num_train_scaled: (660, 6)
X_num_test_scaled : (165, 6)

Scaled numeric check (first 5 features):
Train mean: [ 0.0000000e+00  7.2248052e-10 -1.5894573e-08 -2.8899221e-09
  7.2248052e-10]
Train std : [0. 1. 1. 1. 1.]

Saved:
 - /home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/02_preprocessed/clinical/clin_numeric_scaler.joblib


## 6D) Build final clinical feature matrix

We concatenate:
- Scaled numeric clinical features
- One-hot encoded categorical clinical features

This produces the final clinical feature matrices aligned to the RNA split.

**Output:**
- `X_clin_train`
- `X_clin_test`

In [9]:
import numpy as np

X_clin_train = np.hstack([
    X_num_train_scaled,
    X_cat_train
]).astype(np.float32)

X_clin_test = np.hstack([
    X_num_test_scaled,
    X_cat_test
]).astype(np.float32)

print("Final clinical shapes:")
print("Train:", X_clin_train.shape)
print("Test :", X_clin_test.shape)

# Save final clinical matrices
np.save(OUT_DIR / "clin_train_scaled.npy", X_clin_train)
np.save(OUT_DIR / "clin_test_scaled.npy", X_clin_test)

print("\nSaved:")
print(" -", OUT_DIR / "clin_train_scaled.npy")
print(" -", OUT_DIR / "clin_test_scaled.npy")


Final clinical shapes:
Train: (660, 3221)
Test : (165, 3221)

Saved:
 - /home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/02_preprocessed/clinical/clin_train_scaled.npy
 - /home/steps4growth/gmriechi/Lung-Cancer-Subtyping/Data/02_preprocessed/clinical/clin_test_scaled.npy
