# ACC Survival Analysis: Define Track A/B Cohorts

**Goal**: Define paper-mirroring (Track A) and sensitivity (Track B) cohorts with proper exclusions

**Endpoints**:
- **OS (Overall Survival)**: Death from any cause
- **CSS (Cancer-Specific Survival)**: Death attributable to ACC

**Track Definitions**:
- **Track A (Paper-mirroring)**: Exclude missing TNM staging (for baseline TNM-only comparison)
- **Track B (Sensitivity)**: Keep all cases, encode missing TNM as "Unknown"

**Split**: 2:1 train:validation (67%:33%), stratified by event status

## 1. Setup and Import Libraries

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
import json
import warnings

warnings.filterwarnings("ignore")

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)

print("Libraries loaded successfully!")

Libraries loaded successfully!


## 2. Load Raw SEER Data

We reload the raw data to extract CSS endpoint information that wasn't included in notebook 01.

In [2]:
# Load raw SEER data
raw_data_path = Path("../ACC数据/r分析seer/SEER纯ACC数据.xlsx")
print(f"Loading: {raw_data_path}")

df_raw = pd.read_excel(raw_data_path)

print(f"\nRaw data shape: {df_raw.shape}")
print(f"Total records: {len(df_raw)}")

Loading: ../ACC数据/r分析seer/SEER纯ACC数据.xlsx

Raw data shape: (2833, 52)
Total records: 2833


In [3]:
# Check cause-of-death columns for CSS endpoint
print("SEER cause-specific death classification values:")
print(df_raw["SEER cause-specific death classification"].value_counts())

print("\n" + "=" * 60)
print("\nSEER other cause of death classification values:")
print(df_raw["SEER other cause of death classification"].value_counts())

SEER cause-specific death classification values:
SEER cause-specific death classification
Alive or dead of other cause             1893
Dead (attributable to this cancer dx)     909
Dead (missing/unknown COD)                 31
Name: count, dtype: int64


SEER other cause of death classification values:
SEER other cause of death classification
Alive or dead due to cancer                                2261
Dead (attributable to causes other than this cancer dx)     541
Dead (missing/unknown COD)                                   31
Name: count, dtype: int64


## 3. Extract All Required Variables Including CSS Endpoint

In [None]:
def recode_site_to_4_categories(site_value):
    """
    Recode ICD-O-3 site codes to 4 consolidated categories.
    """
    if pd.isna(site_value):
        return np.nan

    site_str = str(site_value)
    code = site_str.split("-")[0] if "-" in site_str else site_str
    code_prefix = code[:3] if len(code) >= 3 else code

    # Major salivary glands
    if code_prefix in ["C07", "C08"]:
        return "大唾液腺"
    # Larynx and hypopharynx
    if code_prefix in ["C32", "C12", "C13"]:
        return "喉和下咽"
    # Nasal cavity/sinuses/nasopharynx
    if code_prefix in ["C30", "C31", "C11"]:
        return "鼻腔鼻窦副鼻窦鼻咽"
    # Oral/oropharynx/other
    return "口腔口咽其它"


# Column mapping including CSS endpoint and T/N/M components
column_mapping = {
    "ID": "编号",
    "age": "年龄",
    "sex": "性别",
    "site_raw": "原发部位",
    "grade": "分化级别(thru 2017)",
    "radiotherapy": "放疗",
    "chemotherapy": "化疗",
    "tumor_number": "肿瘤数量",
    "race": "种族",
    "marital_status": "婚姻",
    "urban_rural": "城乡",
    "time_os": "存活月数",
    "event_os": "生存（截止至研究日期）",
    "TNMstage": "TNM",
    "T": "T",
    "N": "N",
    "M": "M",
    "css_classification": "SEER cause-specific death classification",
}

# Select and rename columns
df = df_raw[[v for v in column_mapping.values()]].copy()
df.columns = list(column_mapping.keys())

# Apply site recoding
df["site"] = df["site_raw"].apply(recode_site_to_4_categories)
df = df.drop(columns=["site_raw"])

print(f"Selected {len(df.columns)} variables")
print(f"Data shape: {df.shape}")

# Show T, N, M distributions
print("\nT, N, M component distributions:")
for col in ["T", "N", "M"]:
    print(f"\n{col}:")
    print(df[col].value_counts())

## 4. Define CSS Endpoint

In [5]:
# Define CSS event based on SEER cause-specific death classification
# "Dead (attributable to this cancer dx)" = cancer-specific death (event_css = 1)
# "Alive or dead of other cause" = censored for CSS (event_css = 0)

df["event_css"] = (
    df["css_classification"] == "Dead (attributable to this cancer dx)"
).astype(int)

# CSS uses same time as OS (time from diagnosis to death or last follow-up)
df["time_css"] = df["time_os"]

print("CSS endpoint definition:")
print(f"  Cancer-specific deaths (event_css=1): {df['event_css'].sum()}")
print(f"  Censored (event_css=0): {(df['event_css']==0).sum()}")
print(f"  CSS event rate: {df['event_css'].mean()*100:.1f}%")

print("\nOS endpoint:")
print(f"  Deaths (event_os=1): {df['event_os'].sum()}")
print(f"  Alive (event_os=0): {(df['event_os']==0).sum()}")
print(f"  OS event rate: {df['event_os'].mean()*100:.1f}%")

CSS endpoint definition:
  Cancer-specific deaths (event_css=1): 909
  Censored (event_css=0): 1924
  CSS event rate: 32.1%

OS endpoint:
  Deaths (event_os=1): 1481
  Alive (event_os=0): 1352
  OS event rate: 52.3%


In [6]:
# Drop intermediate column
df = df.drop(columns=["css_classification"])

# Rename for clarity
# time_os and time_css are identical (time from diagnosis)
# event_os: 1=death (any cause), 0=alive
# event_css: 1=cancer-specific death, 0=alive or other cause death

print("Final columns:")
print(df.columns.tolist())

Final columns:
['ID', 'age', 'sex', 'grade', 'radiotherapy', 'chemotherapy', 'tumor_number', 'race', 'marital_status', 'urban_rural', 'time_os', 'event_os', 'TNMstage', 'site', 'event_css', 'time_css']


## 5. Data Type Conversion

In [None]:
# Define candidate variables for Cox screening
# TNMstage = combined stage, T/N/M = separate components
candidate_vars = [
    "age",
    "sex",
    "site",
    "grade",
    "radiotherapy",
    "chemotherapy",
    "tumor_number",
    "race",
    "marital_status",
    "urban_rural",
    "TNMstage",
    "T",
    "N",
    "M",
]

# Convert categorical variables
for var in candidate_vars:
    df[var] = df[var].astype("category")

# Convert numeric outcome variables
df["event_os"] = pd.to_numeric(df["event_os"], errors="coerce").astype(int)
df["event_css"] = df["event_css"].astype(int)
df["time_os"] = pd.to_numeric(df["time_os"], errors="coerce")
df["time_css"] = df["time_css"].astype(float)

print("Data type conversion complete")
print(f"\nDataset shape: {df.shape}")
print(f"\nCandidate variables: {len(candidate_vars)}")
print("  Non-staging: age, sex, site, grade, radiotherapy, chemotherapy, tumor_number, race, marital_status, urban_rural")
print("  Staging (combined): TNMstage")
print("  Staging (separate): T, N, M")

## 6. Apply Exclusion Criteria (Common to Both Tracks)

Exclusions applied to all cohorts:
1. Missing or invalid survival time (time <= 0 or NaN)
2. Missing survival status

In [8]:
print("Applying common exclusion criteria...")
print(f"\nStarting N: {len(df)}")

# Exclusion 1: Missing or invalid survival time
valid_time = (df["time_os"].notna()) & (df["time_os"] > 0)
n_invalid_time = (~valid_time).sum()
print(f"  Excluded for invalid/missing time: {n_invalid_time}")

# Exclusion 2: Missing survival status
valid_status = df["event_os"].notna()
n_invalid_status = (~valid_status).sum()
print(f"  Excluded for missing status: {n_invalid_status}")

# Apply common exclusions
df_clean = df[valid_time & valid_status].copy()
print(f"\nAfter common exclusions: {len(df_clean)} records")

Applying common exclusion criteria...

Starting N: 2833
  Excluded for invalid/missing time: 33
  Excluded for missing status: 0

After common exclusions: 2800 records


## 7. Define Track A Cohort (Paper-Mirroring)

**Track A**: Exclude cases with missing TNM staging
- This allows baseline TNM-only model comparison (as in the reference paper)
- Primary analysis cohort

In [9]:
print("Creating Track A cohort (paper-mirroring)...")
print(f"\nStarting N (after common exclusions): {len(df_clean)}")

# Check TNM availability
tnm_valid = df_clean["TNMstage"].notna()
n_missing_tnm = (~tnm_valid).sum()
print(f"  Records with missing TNMstage: {n_missing_tnm}")

# Also exclude "UNK Stage" as it's essentially missing
tnm_known = tnm_valid & (df_clean["TNMstage"] != "UNK Stage")
n_unk_stage = (df_clean["TNMstage"] == "UNK Stage").sum()
print(f"  Records with 'UNK Stage': {n_unk_stage}")

# Create Track A
df_trackA = df_clean[tnm_known].copy()
print(f"\nTrack A cohort: {len(df_trackA)} records")
print(
    f"  OS events: {df_trackA['event_os'].sum()} ({df_trackA['event_os'].mean()*100:.1f}%)"
)
print(
    f"  CSS events: {df_trackA['event_css'].sum()} ({df_trackA['event_css'].mean()*100:.1f}%)"
)

Creating Track A cohort (paper-mirroring)...

Starting N (after common exclusions): 2800
  Records with missing TNMstage: 1306
  Records with 'UNK Stage': 111

Track A cohort: 1383 records
  OS events: 573 (41.4%)
  CSS events: 385 (27.8%)


In [10]:
# Track A: TNMstage distribution
print("Track A TNMstage distribution:")
print(df_trackA["TNMstage"].value_counts())

Track A TNMstage distribution:
TNMstage
1            346
4A           272
2            270
3            227
4C           139
4B           118
4NOS           7
4              4
UNK Stage      0
Name: count, dtype: int64


## 8. Define Track B Cohort (Sensitivity Analysis)

**Track B**: Keep all cases, encode missing TNM as "Unknown"
- This maximizes sample size for ACC (where staging is often missing)
- Sensitivity analysis cohort

In [11]:
print("Creating Track B cohort (sensitivity)...")
print(f"\nStarting N (after common exclusions): {len(df_clean)}")

# Create Track B (all records, encode missing TNM)
df_trackB = df_clean.copy()

# Encode missing TNM and "UNK Stage" as "Unknown"
# First, convert to string to handle the category
df_trackB["TNMstage"] = df_trackB["TNMstage"].astype(str)
df_trackB.loc[df_trackB["TNMstage"].isin(["nan", "UNK Stage"]), "TNMstage"] = "Unknown"
df_trackB["TNMstage"] = df_trackB["TNMstage"].astype("category")

print(f"\nTrack B cohort: {len(df_trackB)} records")
print(
    f"  OS events: {df_trackB['event_os'].sum()} ({df_trackB['event_os'].mean()*100:.1f}%)"
)
print(
    f"  CSS events: {df_trackB['event_css'].sum()} ({df_trackB['event_css'].mean()*100:.1f}%)"
)

Creating Track B cohort (sensitivity)...

Starting N (after common exclusions): 2800

Track B cohort: 2800 records
  OS events: 1467 (52.4%)
  CSS events: 908 (32.4%)


In [12]:
# Track B: TNMstage distribution (including Unknown)
print("Track B TNMstage distribution:")
print(df_trackB["TNMstage"].value_counts())

Track B TNMstage distribution:
TNMstage
Unknown    1417
1           346
4A          272
2           270
3           227
4C          139
4B          118
4NOS          7
4             4
Name: count, dtype: int64


## 9. Create Train/Validation Splits (2:1)

For each track, create:
- Training set (67%): For model development
- Validation set (33%): For internal validation

Split is stratified by OS event status.

In [13]:
def create_train_val_split(df, track_name, random_state=RANDOM_STATE):
    """
    Create 2:1 train/validation split, stratified by OS event status.
    """
    train, val = train_test_split(
        df, test_size=0.33, random_state=random_state, stratify=df["event_os"]
    )

    print(f"\n{track_name} Split:")
    print(f"  Training: {len(train)} ({len(train)/len(df)*100:.1f}%)")
    print(f"  Validation: {len(val)} ({len(val)/len(df)*100:.1f}%)")
    print(
        f"  Train OS events: {train['event_os'].sum()} ({train['event_os'].mean()*100:.1f}%)"
    )
    print(
        f"  Val OS events: {val['event_os'].sum()} ({val['event_os'].mean()*100:.1f}%)"
    )

    return train, val


# Track A splits
trackA_train, trackA_val = create_train_val_split(df_trackA, "Track A")

# Track B splits
trackB_train, trackB_val = create_train_val_split(df_trackB, "Track B")


Track A Split:
  Training: 926 (67.0%)
  Validation: 457 (33.0%)
  Train OS events: 384 (41.5%)
  Val OS events: 189 (41.4%)

Track B Split:
  Training: 1876 (67.0%)
  Validation: 924 (33.0%)
  Train OS events: 983 (52.4%)
  Val OS events: 484 (52.4%)


## 10. Align Categorical Levels

Ensure validation sets use the same category levels as training sets.

In [14]:
def align_categorical_levels(train_df, val_df, full_df, candidate_vars):
    """
    Align categorical levels between train/val/full datasets.
    Uses training set categories as reference.
    """
    for var in candidate_vars:
        if var in train_df.columns:
            train_categories = train_df[var].cat.categories
            val_df[var] = pd.Categorical(
                val_df[var],
                categories=train_categories,
                ordered=train_df[var].cat.ordered,
            )
            full_df[var] = pd.Categorical(
                full_df[var],
                categories=train_categories,
                ordered=train_df[var].cat.ordered,
            )
    return train_df, val_df, full_df


# Align Track A
trackA_train, trackA_val, df_trackA = align_categorical_levels(
    trackA_train, trackA_val, df_trackA, candidate_vars
)
print("Track A categorical levels aligned")

# Align Track B
trackB_train, trackB_val, df_trackB = align_categorical_levels(
    trackB_train, trackB_val, df_trackB, candidate_vars
)
print("Track B categorical levels aligned")

Track A categorical levels aligned
Track B categorical levels aligned


## 11. Summary: Cohort Flow

In [15]:
print("=" * 70)
print("COHORT FLOW SUMMARY")
print("=" * 70)
print(f"\nRaw SEER data: {len(df_raw)} records")
print(f"After common exclusions (valid time & status): {len(df_clean)} records")
print(f"  - Excluded for invalid time: {n_invalid_time}")
print(f"  - Excluded for missing status: {n_invalid_status}")

print(f"\n--- Track A (Paper-mirroring, excludes missing TNM) ---")
print(f"  Full cohort: {len(df_trackA)} records")
print(f"  Training: {len(trackA_train)} records")
print(f"  Validation: {len(trackA_val)} records")
print(
    f"  OS events: {df_trackA['event_os'].sum()} ({df_trackA['event_os'].mean()*100:.1f}%)"
)
print(
    f"  CSS events: {df_trackA['event_css'].sum()} ({df_trackA['event_css'].mean()*100:.1f}%)"
)

print(f"\n--- Track B (Sensitivity, includes Unknown TNM) ---")
print(f"  Full cohort: {len(df_trackB)} records")
print(f"  Training: {len(trackB_train)} records")
print(f"  Validation: {len(trackB_val)} records")
print(
    f"  OS events: {df_trackB['event_os'].sum()} ({df_trackB['event_os'].mean()*100:.1f}%)"
)
print(
    f"  CSS events: {df_trackB['event_css'].sum()} ({df_trackB['event_css'].mean()*100:.1f}%)"
)

print("\n" + "=" * 70)

COHORT FLOW SUMMARY

Raw SEER data: 2833 records
After common exclusions (valid time & status): 2800 records
  - Excluded for invalid time: 33
  - Excluded for missing status: 0

--- Track A (Paper-mirroring, excludes missing TNM) ---
  Full cohort: 1383 records
  Training: 926 records
  Validation: 457 records
  OS events: 573 (41.4%)
  CSS events: 385 (27.8%)

--- Track B (Sensitivity, includes Unknown TNM) ---
  Full cohort: 2800 records
  Training: 1876 records
  Validation: 924 records
  OS events: 1467 (52.4%)
  CSS events: 908 (32.4%)



## 12. Save All Cohorts

In [16]:
# Output directory
output_dir = Path("../data/processed")
output_dir.mkdir(parents=True, exist_ok=True)

# Save Track A cohorts
trackA_train.to_pickle(output_dir / "trackA_train.pkl")
trackA_val.to_pickle(output_dir / "trackA_val.pkl")
df_trackA.to_pickle(output_dir / "trackA_full.pkl")

trackA_train.to_csv(output_dir / "trackA_train.csv", index=False)
trackA_val.to_csv(output_dir / "trackA_val.csv", index=False)
df_trackA.to_csv(output_dir / "trackA_full.csv", index=False)

# Save Track B cohorts
trackB_train.to_pickle(output_dir / "trackB_train.pkl")
trackB_val.to_pickle(output_dir / "trackB_val.pkl")
df_trackB.to_pickle(output_dir / "trackB_full.pkl")

trackB_train.to_csv(output_dir / "trackB_train.csv", index=False)
trackB_val.to_csv(output_dir / "trackB_val.csv", index=False)
df_trackB.to_csv(output_dir / "trackB_full.csv", index=False)

print("Cohort files saved:")
for f in sorted(output_dir.glob("track*")):
    print(f"  {f.name}")

Cohort files saved:
  trackA_full.csv
  trackA_full.pkl
  trackA_train.csv
  trackA_train.pkl
  trackA_val.csv
  trackA_val.pkl
  trackB_full.csv
  trackB_full.pkl
  trackB_train.csv
  trackB_train.pkl
  trackB_val.csv
  trackB_val.pkl


In [17]:
# Update data dictionary
data_dict_path = output_dir / "data_dictionary.json"

# Load existing or create new
if data_dict_path.exists():
    with open(data_dict_path, "r", encoding="utf-8") as f:
        data_dict = json.load(f)
else:
    data_dict = {}

# Add cohort information
data_dict.update(
    {
        "cohort_definition_date": pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S"),
        "endpoints": {
            "OS": {
                "time_var": "time_os",
                "event_var": "event_os",
                "description": "Overall survival (death from any cause)",
            },
            "CSS": {
                "time_var": "time_css",
                "event_var": "event_css",
                "description": "Cancer-specific survival (death attributable to ACC)",
            },
        },
        "track_A": {
            "description": "Paper-mirroring analysis - excludes missing TNM staging",
            "exclusion_criteria": [
                "Missing or invalid survival time (time <= 0)",
                "Missing survival status",
                "Missing TNM stage",
                "TNM stage = 'UNK Stage'",
            ],
            "full_n": len(df_trackA),
            "train_n": len(trackA_train),
            "val_n": len(trackA_val),
            "os_events": int(df_trackA["event_os"].sum()),
            "css_events": int(df_trackA["event_css"].sum()),
        },
        "track_B": {
            "description": "Sensitivity analysis - includes all cases, missing TNM encoded as 'Unknown'",
            "exclusion_criteria": [
                "Missing or invalid survival time (time <= 0)",
                "Missing survival status",
            ],
            "full_n": len(df_trackB),
            "train_n": len(trackB_train),
            "val_n": len(trackB_val),
            "os_events": int(df_trackB["event_os"].sum()),
            "css_events": int(df_trackB["event_css"].sum()),
        },
        "split_method": "2:1 train:validation (67%:33%), stratified by OS event status",
        "random_state": RANDOM_STATE,
        "candidate_variables": candidate_vars,
    }
)

with open(data_dict_path, "w", encoding="utf-8") as f:
    json.dump(data_dict, f, indent=2, ensure_ascii=False)

print(f"\nData dictionary updated: {data_dict_path}")


Data dictionary updated: ../data/processed/data_dictionary.json


## Summary

### Cohorts Created:

| Track | Description | Full | Train | Validation |
|-------|-------------|------|-------|------------|
| A | Paper-mirroring (excludes missing TNM) | See output above | 67% | 33% |
| B | Sensitivity (includes Unknown TNM) | See output above | 67% | 33% |

### Endpoints:
- **OS (Overall Survival)**: `time_os`, `event_os` (1=death any cause, 0=alive)
- **CSS (Cancer-Specific Survival)**: `time_css`, `event_css` (1=ACC death, 0=censored)

### Next Steps:
1. **Notebook 03**: Univariate Cox screening (p < 0.05 selection)
2. **Notebook 04**: Forward-stepwise multivariate Cox
3. **Notebook 05/06**: Nomogram construction (OS/CSS)
4. **Notebook 07/08**: Internal and external validation

### Files Saved:
- `trackA_train.pkl`, `trackA_val.pkl`, `trackA_full.pkl` (+ CSV versions)
- `trackB_train.pkl`, `trackB_val.pkl`, `trackB_full.pkl` (+ CSV versions)
- Updated `data_dictionary.json`