Notes from AI Labs
Split Data before permutation importance - 70/30 split.  Can do 5 different splits

When evaluating - need evaluation matrix, true positives and true negatives with false positives and false negatives

Precision = True Positives/(True Positives + False Positives)

Accuracy = True Positives and Negatives/(all four categories)

True Positive Rate, Recall = TP/(TP+FN)

False Positive Rate, FP/(FP+TN)

C-index(Concordance Index) in Survival Analysis

Measures the agreement between predicted risk scores and observed survival outcomes

C-index = Number of concordant patient pairs/(Total comparable patient pairs)

Goal of project is to maximize C-index

Survival Analysis when it's binary can be random forest

In [1]:
#import kagglehub - no longer needed once using local .csv file
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, cross_val_score
#from sklearn.preprocessing import LabelEncoder
from sksurv.ensemble import RandomSurvivalForest
from sksurv.metrics import concordance_index_censored
from sklearn.inspection import permutation_importance
# for splitting data
from sklearn.model_selection import ShuffleSplit
# Define a scorer compatible with GridSearchCV
from sklearn.metrics import make_scorer

## Skip below after first run ##

In [None]:
path = kagglehub.dataset_download("reihanenamdari/breast-cancer")
path #path to data download on local machine

Downloading from https://www.kaggle.com/api/v1/datasets/download/reihanenamdari/breast-cancer?dataset_version_number=1...


100%|██████████| 42.8k/42.8k [00:00<00:00, 1.22MB/s]

Extracting files...





In [2]:
df = pd.read_csv('Breast_Cancer.csv')
#4000 total records

In [3]:
df.head()

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,Alive
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,Alive
2,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,Alive
3,58,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,18,Positive,Positive,2,1,84,Alive
4,47,White,Married,T2,N1,IIB,Poorly differentiated,3,Regional,41,Positive,Positive,3,1,50,Alive


In [4]:
#Create numerical columns out of Race
df['White'] = (df['Race']=='White').astype(int)
df['Black'] = (df['Race']=='Black').astype(int)
df['Other'] = (df['Race']=='Other').astype(int)

In [5]:
#Drop categorical columns - drops Race, Marital Status, 6th stage, differentiate, A stage
df.drop(df.columns[[1,2,5,6,8]], axis=1, inplace=True)

In [6]:
#Encode Categorical Columns
df['T Stage '] = df['T Stage '].map({'T1':1,'T2':2, 'T3':3,'T4':4})
df['N Stage'] = df['N Stage'].map({'N1':1,'N2':2, 'N3':3})
df['Estrogen Status'] = df['Estrogen Status'].map({'Positive':1,'Negative':0})
df['Progesterone Status'] = df['Progesterone Status'].map({'Positive': 1,'Negative': 0})
df['Status'] = df['Status'].map({'Alive':1,'Dead':0})
#Force to numeric and drop those with missing grades
df['Grade'] = pd.to_numeric(df['Grade'], errors = 'coerce')
df = df.dropna(subset = ['Grade'])

In [7]:
#check shape of dataset
print(df.shape)

(4005, 14)


In [8]:
#Create a survival object dataframe
y = df[['Status','Survival Months']]

In [9]:
y.head()

Unnamed: 0,Status,Survival Months
0,1,60
1,1,62
2,1,75
3,1,84
4,1,50


Below will create a list of tuples, 'Status' and 'Survival Months' by row and keeps Status boolean
and Survival Months as a 64 bit float

These two variables define survival time and status as target variables.

This helps us use sckit-surival library for analysis.  It requires this tuple format.S

In [10]:
y_structured = np.array([(bool(status), months) for status, months in zip(y['Status'], y['Survival Months'])],
                        dtype = [('Status','bool'), ('Survival Months', 'f8')])

In [11]:
y_structured 
# Status of True means event happened (Death)

array([( True,  60.), ( True,  62.), ( True,  75.), ..., ( True,  69.),
       ( True,  72.), ( True, 100.)],
      shape=(4005,), dtype=[('Status', '?'), ('Survival Months', '<f8')])

In [12]:
#Removes target columns from original dataframe, which contains our features only
X= df.drop(columns = ['Status','Survival Months'])

In [13]:
#X becomes our Independent variables
X.head()

Unnamed: 0,Age,T Stage,N Stage,Grade,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,White,Black,Other
0,68,1,1,3.0,4,1,1,24,1,1,0,0
1,50,2,2,2.0,35,1,1,14,5,1,0,0
2,58,3,3,2.0,63,1,1,14,7,1,0,0
3,58,1,1,3.0,18,1,1,2,1,1,0,0
4,47,2,1,3.0,41,1,1,3,1,1,0,0


In [14]:
# Custom scoring function for survival analysis - closer to 1 is better.  0.5 is a random guess.  
# C-index measures how well the model predicts who survives longer
# Custom scoring function for survival analysis
def cindex_score(model, x, y_struct):
    prediction = model.predict(x)
    return concordance_index_censored(y_struct['Status'], y_struct['Survival Months'], prediction)[0]

**Training and Test Split -70/30 with 5 splits**

In [15]:
rs = ShuffleSplit(n_splits=5, test_size=0.3, random_state=42)
split_scores = []
print("\n--- Cross-Validation C-Indices ---")
#Here we are training multiple models
for train_idx, test_idx in rs.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y_structured[train_idx], y_structured[test_idx]

    model = RandomSurvivalForest(n_estimators=100, min_samples_split=10, min_samples_leaf=15, random_state=42, n_jobs=-1)
    model.fit(X_train, y_train)

    c_index = cindex_score(model, X_test, y_test)
    split_scores.append(c_index)
    print(f"C-Index for one split: {c_index:.4f}")

print(f"\nAverage C-Index across 5 splits: {np.mean(split_scores):.4f}")


--- Cross-Validation C-Indices ---
C-Index for one split: 0.5230
C-Index for one split: 0.5022
C-Index for one split: 0.5188
C-Index for one split: 0.5168
C-Index for one split: 0.5340

Average C-Index across 5 splits: 0.5190


Collect all models and test sets and do permutation importance for each then average.

In [16]:
# Calculate average permutation importance across all splits
importances_list = []

for train_idx, test_idx in rs.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y_structured[train_idx], y_structured[test_idx]

    model = RandomSurvivalForest(n_estimators=100, min_samples_split=10, min_samples_leaf=15, random_state=42, n_jobs=-1)
    model.fit(X_train, y_train)

    perm_result = permutation_importance(
        estimator=model,
        X=X_test,
        y=y_test,
        n_repeats=10,
        random_state=42,
        n_jobs=-1,
        scoring=cindex_score
    )
    importances_list.append(perm_result.importances_mean)

# Average the importances
avg_importances = np.mean(importances_list, axis=0)
importance_df = pd.Series(avg_importances, index=X.columns).sort_values(ascending=False)

print("\nPermutation importances (averaged across splits):\n", importance_df)



Permutation importances (averaged across splits):
 Progesterone Status       0.010083
Age                       0.004052
Grade                     0.003538
Reginol Node Positive     0.001826
T Stage                   0.001741
Estrogen Status           0.001416
Tumor Size                0.001188
N Stage                   0.000657
Black                    -0.000366
Other                    -0.000775
White                    -0.001004
Regional Node Examined   -0.004990
dtype: float64


**Feature and Model Selection**

We do permutation based feature importance to identify top 5 features.  This reduces dimensionality and focuses on variables we think are more predictive.

Random Survival Forest is chosen as the model approach due to its robustness with non-linear relationships and handling high dimensional data.  Random Survival Forest is good for predicting survival time distributions.

In [28]:
print("\n--- Cross-Validated Top-K Features Testing ---")

rs = ShuffleSplit(n_splits=5, test_size=0.3, random_state=42)

importances_list = []   
all_k_scores = {}       
best_score = 0
best_k = 0

for k in range(3, 11):
    cindex_scores_k = []

    for train_idx, test_idx in rs.split(X):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y_structured[train_idx], y_structured[test_idx]

        # Step 1: Train full model on training data
        model = RandomSurvivalForest(n_estimators=100, min_samples_split=10, min_samples_leaf=15, random_state=42, n_jobs=-1)
        model.fit(X_train, y_train)

        # Step 2: Permutation importance on training data
        perm_result = permutation_importance(
            estimator=model,
            X=X_train,
            y=y_train,
            n_repeats=10,
            random_state=42,
            n_jobs=-1,
            scoring=cindex_score
        )
        importances_list.append(perm_result.importances_mean)

        # Step 3: Select top K features from training data
        importance_train = pd.Series(perm_result.importances_mean, index=X.columns)
        top_k_features = importance_train.sort_values(ascending=False).head(k).index

        # Step 4: Train new model on only Top-K features
        model_k = RandomSurvivalForest(n_estimators=100, min_samples_split=10, min_samples_leaf=15, random_state=42, n_jobs=-1)
        model_k.fit(X_train[top_k_features], y_train)

        # Step 5: Evaluate C-index on test set
        c_index_k = cindex_score(model_k, X_test[top_k_features], y_test)
        cindex_scores_k.append(c_index_k)

    avg_cindex = np.mean(cindex_scores_k)
    all_k_scores[k] = avg_cindex

    print(f"Top {k} features average C-Index across splits: {avg_cindex:.4f}")

    if avg_cindex > best_score:
        best_score = avg_cindex
        best_k = k
print(f"\nBest K based on C-index: {best_k} with C-Index: {best_score:.4f}")

# Average the collected permutation importances across all runs
avg_importances = np.mean(importances_list, axis=0)

# Build importance DataFrame and select final features
importance_df = pd.Series(avg_importances, index=X.columns).sort_values(ascending=False)
top_features = importance_df.head(best_k).index

# Final X_selected for hyperparameter tuning
X_selected = X[top_features]


--- Cross-Validated Top-K Features Testing ---
Top 3 features average C-Index across splits: 0.4939
Top 4 features average C-Index across splits: 0.5018
Top 5 features average C-Index across splits: 0.5095
Top 6 features average C-Index across splits: 0.5154
Top 7 features average C-Index across splits: 0.5189
Top 8 features average C-Index across splits: 0.5200
Top 9 features average C-Index across splits: 0.5178
Top 10 features average C-Index across splits: 0.5184

Best K based on C-index: 8 with C-Index: 0.5200


**Hyperparameter Tuning**

Simply, we look at all the trees and number of features selected at each split.  This aims to optimize Concordance Index, which is a standard metric in survival analysis.

In [29]:
def concordance_index_scorer(estimator, X, y):
    return concordance_index_censored(y['Status'], y['Survival Months'], estimator.predict(X))[0]

param_grid = {
    'n_estimators': [50,100,150,200,250,300,350,400,450,500],
    'max_features': [1, 2, 3, 4, 5,'sqrt'],
    'min_samples_split': [3, 5,7,10,13,15],
    'min_samples_leaf': [3,5,7,10,13,15]
}

grid_search = GridSearchCV(
    estimator=RandomSurvivalForest(random_state=42),
    param_grid=param_grid,
    scoring=make_scorer(concordance_index_scorer),
    cv=3,
    n_jobs=-1
)

grid_search.fit(X_selected, y_structured)
print("\nBest parameters from GridSearchCV:", grid_search.best_params_)




Best parameters from GridSearchCV: {'max_features': 1, 'min_samples_leaf': 3, 'min_samples_split': 3, 'n_estimators': 50}


In [29]:
#Train best model
best_rsf = grid_search.best_estimator_
best_rsf.fit(X_selected, y_structured)

final_c_index = concordance_index_censored(
    event_indicator=y_structured['Status'],
    event_time=y_structured['Survival Months'],
    estimate=best_rsf.predict(X_selected)
)[0]

print('\nFinal Concordance Index (on full selected features):', final_c_index)


Final Concordance Index (on full selected features): 0.6479667954237261


**Result interpretation**

0.5: Model performs no better than random chance

Greater than 0.7: Indicates good discriminative ability

Greater than 0.8: Indicates strong predictive performance.  A higher C-index suggests that the model effectively distinguishes between patients with different survival outcomes