
# Survival Analysis in Biomedical Research

This notebook demonstrates survival analysis techniques, including Kaplan-Meier curves, log-rank tests, and Cox proportional hazards regression, using Python libraries such as `scikit-survival` and `lifelines`.

## Dataset Information and Links

1. **Veterans Lung Cancer Dataset:**
   - [Veterans Lung Cancer Dataset](https://github.com/sebp/scikit-survival/blob/v0.22.2/sksurv/datasets/data/veteran.arff)
   - Provided by the `scikit-survival` package.

2. **Breast Cancer Dataset:**
   - Comes bundled with the `scikit-survival` package.

### Prerequisites

Before running the code, install the required packages:
```bash
pip install scikit-survival lifelines matplotlib numpy pandas
```
        


## Kaplan-Meier Curve: Veterans Lung Cancer Dataset

The Kaplan-Meier curve is used to estimate survival probabilities over time.
        

In [None]:

# Import necessary libraries
from sksurv.datasets import load_veterans_lung_cancer
from sksurv.nonparametric import kaplan_meier_estimator
import matplotlib.pyplot as plt

# Load the veterans lung cancer dataset
data_x, data_y = load_veterans_lung_cancer()

# Kaplan-Meier survival estimates
time, survival_prob = kaplan_meier_estimator(data_y["Status"], data_y["Survival_in_days"])

# Plot the Kaplan-Meier curve
plt.step(time, survival_prob, where="post")
plt.ylabel("Estimated probability of survival")
plt.xlabel("Time (days)")
plt.title("Kaplan-Meier Curve: Veterans Lung Cancer")
plt.show()
        


## Log-Rank Test: Comparing Treatment Groups

The log-rank test evaluates whether there are significant differences between survival curves of two groups.
        

In [None]:

from sksurv.compare import compare_survival

# Create treatment group indicator
treatment = data_x["Treatment"] == "test"

# Log-rank test
chi2, pvalue = compare_survival(data_y, data_x["Treatment"])
print(f"Chi-squared value: {chi2}")
print(f"P-value: {pvalue}")
        


## Kaplan-Meier Curves by Treatment Group

Visualizing survival probabilities for different treatment groups.
        

In [None]:

# Separate Kaplan-Meier estimates by group
time_treatment, survival_prob_treatment = kaplan_meier_estimator(
    data_y["Status"][treatment], data_y["Survival_in_days"][treatment]
)
time_control, survival_prob_control = kaplan_meier_estimator(
    data_y["Status"][~treatment], data_y["Survival_in_days"][~treatment]
)

# Plot Kaplan-Meier curves
plt.step(time_treatment, survival_prob_treatment, where="post", label="Test Treatment")
plt.step(time_control, survival_prob_control, where="post", label="Standard Treatment")
plt.ylabel("Estimated probability of survival")
plt.xlabel("Time (days)")
plt.legend()
plt.title("Kaplan-Meier Curves by Treatment Group")
plt.show()
        


## Cox Proportional Hazards Regression: Breast Cancer Dataset

The Cox regression model evaluates the relationship between survival time and covariates.
        

In [None]:

from sksurv.datasets import load_breast_cancer
from lifelines import CoxPHFitter
import pandas as pd

# Load the breast cancer dataset
data_x, data_y = load_breast_cancer()

# Prepare data for Cox regression
data_x['size_group'] = (data_x['size'] >= 2).astype(int)
data_x['er_positive'] = data_x['er'] == "pos"
df = data_x[["age", "size_group", "er_positive"]].copy()
df["time"] = data_y["t.tdm"]
df["event"] = data_y["e.tdm"]

# Fit Cox proportional hazards model
cph = CoxPHFitter()
cph.fit(df, duration_col="time", event_col="event")
cph.print_summary()
        