In [None]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


<br>

# **Understanding the Cox Proportional Hazards Model**

<dd>

The Cox Proportional Hazards Model is a widely used method in survival analysis to assess the effect of predictor variables on survival time. It is particularly valuable because it does not require specifying the underlying distribution of survival times, making it a semi-parametric model.

<dt>

<br>

---

<br>

### **Key Concepts**

**1. Hazard Function in the Cox Model**

<dd>

The hazard function, denoted as $h(t)$, represents the instantaneous risk of an event occurring at time $t$, given survival up to that time. The Cox model assumes that this hazard function can be expressed as:

$$
h(t | X) = h_0(t) e^{(\beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p)}
$$

where:


*   $h_0(t)$ is the baseline hazard function, representing the risk when all predictor variables are zero.
*   $X_1, X_2, ..., X_p$ are the predictor variables (covariates).
*   $\beta_1, \beta_2, ..., \beta_p$ are the coefficients that measure the impact of each predictor on survival.


This formulation allows us to analyze the effect of covariates on survival without making assumptions about the baseline hazard $h_0(t)$.

<dt>

**2. Proportional Hazards Assumption**

<dd>

The term proportional hazards comes from the assumption that the hazard ratios between individuals remain constant over time. That is, the effect of a covariate does not change as time progresses. Mathematically, for two individuals with predictor values $X_A$ and $X_B$:

$$
\frac{h(t | X_A)}{h(t | X_B)} = e^{(\beta_1 (X_{A1} - X_{B1}) + ... + \beta_p (X_{Ap} - X_{Bp}))}
$$

Since $h_0(t)$ cancels out, the hazard ratio is independent of time $t$.

If this assumption does not hold, alternative models like time-dependent covariates or stratified Cox models may be necessary.

<dt>

<br>

---

<br>

### **Interpreting the Cox Model**

**1. Hazard Ratio (HR)**

<dd>

The hazard ratio (HR) quantifies the effect of a predictor variable on survival. It is calculated as:

$$
HR = e^{\beta}
$$

where:


*   If $HR > 1$, the predictor increases the hazard (higher risk, shorter survival time).

*   If $HR < 1$, the predictor decreases the hazard (lower risk, longer survival time).

*   If $HR = 1$, the predictor has no effect on survival.


<dt>

**2. Confidence Intervals (CI)**

<dd>

To assess statistical significance, we compute the 95% confidence interval (CI) for the hazard ratio:

$$
CI = \left[ e^{(\beta - 1.96 \cdot  σ)}, e^{(\beta + 1.96 \cdot  σ)} \right]
$$

where ${σ}$ is the standard error of the coefficient.

*   If the CI excludes 1, the predictor is statistically significant.
*   If the CI includes 1, the effect may be due to chance.


<dt>

**3. p-value**

<dd>

The p-value tests whether the predictor has a significant effect on survival:

*   If p < 0.05, the variable is statistically significant.
*   If p > 0.05, there is no strong evidence that the predictor affects survival.

<dt>

<br>

---

<br>

### **Advantages**


*   No assumption on survival time distribution: Unlike parametric models, the Cox model does not require specifying the shape of the survival curve.

*   Handles censored data well: It efficiently includes individuals for whom the event has not yet occurred.

*   Interpretable coefficients: The exponentiated coefficients provide direct insights into risk factors.

<br>

---

<br>


### **Limitations**

*   Proportional hazards assumption: If violated, the model may produce misleading results.

*   Time-dependent covariates: The Cox model does not naturally handle predictors that change over time.

*   Baseline hazard is not estimated: The model focuses on relative risks rather than predicting absolute survival probabilities.

<br>



---


<br>


### **References**

<dd>

If you wish to further explore the Cox Proportional Hazards model—its assumptions, estimation methods, applications in medical research, and extensions—there are several foundational and practical works that provide in-depth explanations. Below are some of the most recommended references for both theoretical and applied studies of the method:

* **Cox, D. R. (1972). Regression models and life-tables**. Journal of the Royal Statistical Society: Series B (Methodological), 34(2), 187–220. DOI: 10.1111/j.2517-6161.1972.tb00899.x
 * ◦ This seminal paper introduced the proportional hazards model, laying the theoretical foundation for what became the Cox regression model. It remains the cornerstone reference for understanding hazard functions and survival regression.

* **Hosmer, D. W., Lemeshow, S., & May, S. (2008). Applied Survival Analysis: Regression Modeling of Time-to-Event Data** (2nd ed.). Wiley-Interscience. DOI: 10.1002/0471754994
 * A widely used reference for practical application of Cox models in healthcare and clinical research. Includes extensive examples using statistical software such as SAS and SPSS.

* **Therneau, T. M., & Grambsch, P. M. (2000). Modelling Survival Data: Extending the Cox Model. Springer**. DOI: 10.1007/978-1-4757-3294-8
 * This book is considered essential for understanding extensions of the Cox model, including time-varying covariates and stratified models. Written by the developer of the *survival* package in R.

* **Collett, D. (2023). Modelling Survival Data in Medical Research** (4th ed.). Chapman and Hall/CRC. DOI: 10.1201/9781003282525
 * Covers both the Kaplan-Meier and Cox models with clarity and depth, using real-life medical examples and datasets. Suitable for graduate students and professionals.

* **Kleinbaum, D. G., & Klein, M. (2012). Survival Analysis: A Self-Learning Text** (3rd ed.). Springer. DOI: 10.1007/978-1-4419-6646-9
 * A reader-friendly introduction to survival models including Cox regression. Features intuitive explanations and worked examples using software like R and Stata.

<dt>

<br>


---






# **Implementing the Cox Proportional Hazards Model**

<dd>

##Library import

To perform Survival Analysis using the Cox Proportional Hazards model, we’ll use three main Python libraries:

Pandas – for data manipulation

Numpy – for numerical operations

Lifelines – for modeling survival data
<dt>


In [None]:
import pandas as pd
import numpy as np
from lifelines import CoxPHFitter

The lifelines library might not be available by default, so if needed, you can install it using:

In [None]:
!pip install lifelines

<br>

---

<br>

**1. Defining the Cox Model Function**

<dd>

To streamline the process, we define a function that fits the Cox model and summarizes the key outputs.



In [None]:
def execute_cox_model(df, duration_col, event_col, predictors, labels=None):
    """
    Performs a Cox Proportional Hazards model without weights and returns a summary of the results.

    Parameters:
    - df: Pandas DataFrame containing the data.
    - duration_col: String with the name of the time variable.
    - event_col: String with the name of the outcome variable (binary event).
    - predictors: List of strings with the names of predictor variables.
    - labels: (Optional) Dictionary mapping variable names to readable labels. Default is None.

    Returns:
    - summary_df: DataFrame with the results of the Cox model.
    """

---

<br>


**2. Ensuring the Correct Structure of the Dataset**

<dd>

The dataset for a Cox model typically consists of:

*   A duration column representing the time until the event or censoring.

*   An event column (binary: 1 = event occurred, 0 = censored).

*   A set of predictor variables that influence survival.

<dt>

<br>

---

**3. Identifying and Formatting Categorical Variables**

<dd>

Categorical variables must be correctly identified and converted into an appropriate format before being used in the model.


In [None]:
def handle_categoricals(df, predictors):
    """
    Ensure categorical variables are treated appropriately.
    """
    categorical_vars = df.select_dtypes(include=['object', 'category']).columns.intersection(predictors)
    for var in categorical_vars:
        df[var] = df[var].astype('category')
    return df, categorical_vars

<dd>

This ensures that categorical variables are properly formatted without introducing multicollinearity.

<dt>

---

<br>


**4. Encoding Categorical Variables**

<dd>

Since survival models require numerical input, categorical variables must be transformed into a numeric format using one-hot encoding.


In [None]:
def apply_dummies(df, categorical_vars):
    """
    Convert categorical variables to one-hot encoding.
    """
    df = pd.get_dummies(df, columns=categorical_vars, drop_first=True)
    return df

<dd>

This transformation converts categorical variables into binary columns, preventing the model from treating them as ordinal values.

<dt>


---

<br>

**5. Ensuring Numerical Consistency**

<dd>

The duration column and event column should be strictly numeric. If they are stored as strings, they need to be converted.


In [None]:
def convert_columns_to_numeric(df, duration_col, event_col):
    """
    Ensure numerical variables have the correct type.
    """
    df[duration_col] = pd.to_numeric(df[duration_col], errors='coerce')
    df[event_col] = pd.to_numeric(df[event_col], errors='coerce')
    return df

<dd>

This conversion guarantees that the survival time and event indicator are properly formatted for analysis.

<dt>


---

<br>

**6. Selecting Relevant Predictors**

<dd>

After encoding categorical variables, we need to update the list of predictors to include the newly created one-hot encoded columns.


In [None]:
def update_predictors_with_dummies(df, predictors, categorical_vars):
    """
    Update predictors to include one-hot encoded columns.
    """
    updated_predictors = [
        c for c in df.columns
        if c in predictors or any(c.startswith(p + '_') for p in categorical_vars)
    ]
    return updated_predictors

<dd>

This step ensures that our model includes all relevant variables without omitting transformed features.


<dt>

---

<br>

**7. Handling Missing Values**

<dd>

Survival models require complete data, so missing values must be handled before fitting the model.

In [None]:
def drop_missing_values(df, duration_col, event_col, predictors):
    """
    Remove rows with missing values in essential columns.
    """
    df = df.dropna(subset=[duration_col, event_col] + predictors)
    return df

<dd>

By removing rows with missing values, we ensure that the model is trained only on complete cases.

<dt>

---

<br>

**8. Finalizing the Process**

<dd>

Once all preprocessing steps are completed, we extract the relevant columns for model fitting.

In [None]:
def create_cox_dataframe(df, duration_col, event_col, predictors):
    """
    Select relevant columns for the Cox model.
    """
    df_cox = df[[duration_col, event_col] + predictors]
    return df_cox

<dd>

This dataset is now fully prepared and ready for analysis using the Cox Proportional Hazards Model.

<dt>


---

<br>

**9. Fitting the Cox Model**

<dd>

We now fit the Cox model using the lifelines library.

In [None]:
from lifelines import CoxPHFitter

def fit_cox_model(df_cox, duration_col, event_col):
    """
    Fit the Cox Proportional Hazards model.
    """
    cph = CoxPHFitter()
    cph.fit(df_cox, duration_col=duration_col, event_col=event_col)
    return cph

<dd>

This step trains the Cox model on the provided dataset.

<dt>


---

<br>

**10. Extracting Model Results**

<dd>

Once the model is trained, we extract and interpret key outputs.

In [None]:
import numpy as np

def extract_cox_summary_metrics(cph):
    """
    Extract HR, confidence intervals, and adjusted p-values from the Cox model.
    """
    summary = cph.summary.copy()
    summary['HR'] = np.exp(summary['coef'])
    summary['CI_lower'] = np.exp(summary['coef'] - 1.96 * summary['se(coef)'])
    summary['CI_upper'] = np.exp(summary['coef'] + 1.96 * summary['se(coef)'])
    summary['p_adj'] = summary['p'].apply(lambda p: "<0.001" if p < 0.001 else round(p, 3))
    return summary

<dd>

This step calculates the hazard ratios (HR), confidence intervals, and adjusted p-values for interpretation.

<dt>


---

<br>

**11. Formatting the Final Output**

<dd>

We structure the final results into a readable summary table.

In [None]:
def format_summary_table(summary, labels=None):
    """
    Format the summary DataFrame with selected columns and variable labels.
    """
    summary_df = summary[['HR', 'CI_lower', 'CI_upper', 'p_adj']].reset_index()
    summary_df.rename(columns={'index': 'Variable', 'p_adj': 'p-value'}, inplace=True)

    if labels:
        summary_df['Variable'] = summary_df['Variable'].map(labels).fillna(summary_df['Variable'])

    return summary_df


<dd>

This final step ensures that the output is formatted and easily interpretable.

<dt>

<br>


---

<br>

**12. Cox pipeline**

<dd>

We execute the full Cox analysis in one function.

In [None]:
def run_cox_pipeline(df, duration_col, event_col, predictors, labels=None):
    df, categorical_vars = handle_categoricals(df, predictors)
    df = apply_dummies(df, categorical_vars)
    df = convert_columns_to_numeric(df, duration_col, event_col)
    updated_predictors = update_predictors_with_dummies(df, predictors, categorical_vars)
    df = drop_missing_values(df, duration_col, event_col, updated_predictors)
    df_cox = create_cox_dataframe(df, duration_col, event_col, updated_predictors)

    cph = fit_cox_model(df_cox, duration_col, event_col)
    summary = extract_cox_summary_metrics(cph)
    formatted_summary = format_summary_table(summary, labels)

    return formatted_summary

<br>

---

# **User Case 1**


### **1. Uploading and Loading the Dataset**


In [None]:
from google.colab import files
# Upload file
uploaded = files.upload()
df_model = pd.read_csv('df_model.csv')

Saving df_model.csv to df_model.csv




---

### **2. Running the cox model**

*   Defining Time, Event, and Predictor Variables
*   Running the Cox Model
*   Visualize the Model Results




In [None]:
# Time, outcome, and predictor variables
duration_col = 'HospitalLengthStay_trunc'  # Time variable
event_col = 'HospitalDischargeCode_trunc_bin'  # Binary outcome variable
cox_predictors = [
    'period', 'Idade_Agrupada2', 'ChronicHealthStatusName', 'obesity',
    'IsImmunossupression', 'IsSteroidsUse', 'IsSevereCopd', 'IsChfNyha',
    'cancer', 'ResourceIsRenalReplacementTherapy', 'ResourceIsVasopressors',
    'Vent_Resource'
]

# Rodar o pipeline do modelo de Cox usando as funções separadas
results = run_cox_pipeline(df_model, duration_col, event_col, cox_predictors)

# Mostrar os resultados formatados
print(results)

                                            covariate        HR  CI_lower  \
0                                    period_2022-2023  1.053884  0.968827   
1                                 Idade_Agrupada2_<65  0.777876  0.679785   
2                                Idade_Agrupada2_>=80  1.570852  1.421130   
3   ChronicHealthStatusName_Major assistance / bed...  1.368248  1.228323   
4            ChronicHealthStatusName_Minor assistance  1.188733  1.070908   
5                                         obesity_yes  0.820770  0.653215   
6                             IsImmunossupression_yes  1.238394  1.057788   
7                                   IsSteroidsUse_yes  1.083819  0.825148   
8                                    IsSevereCopd_yes  0.959102  0.860470   
9                                       IsChfNyha_yes  1.209235  1.072670   
10                                         cancer_yes  1.466217  1.315612   
11              ResourceIsRenalReplacementTherapy_yes  1.130405  1.011259   



---


### **3.Interpreting the Model Results**

*   Age Group (>=80): HR = 1.57, CI = [1.42, 1.73], p < 0.001 → Older patients have a higher risk.

*   Chronic Health Status (Major assistance): HR = 1.37, p < 0.001 → Patients with major assistance needs have increased risk.
*   Cancer: HR = 1.47, CI = [1.32, 1.63], p < 0.001 → Cancer increases risk significantly.
*   Vasopressor Use: HR = 1.52, p < 0.001 → Patients needing vasopressors have a higher risk.
*   Ventilation (NIV): HR = 0.41, CI = [0.36, 0.46], p < 0.001 → Non-invasive ventilation significantly reduces risk.











---

## **Analyzing results with Forest Plot**

<dd>

Forest Plot is a graphical method of analyzing the result table of the Cox model. Where each point is a Hazard Ratio and is visible the Lower and Upper confidence interval.

Below is the full code for the Forest Plot.

<dt>

In [None]:
import pandas as pd
import plotly.graph_objs as go


def fig_forest_plot(
        df, dictionary=None,
        title='Forest Plot',
        labels=['Study', 'Hazard Ratio', 'LowerCI', 'UpperCI'],
        graph_id='forest-plot', graph_label='', graph_about='',
        only_display=False):

    # Ordering Values -> Descending Order
    df = df.sort_values(by=labels[1], ascending=True)

    # Error Handling
    if not set(labels).issubset(df.columns):
        print(df.columns)
        error_str = f'Dataframe must contain the following columns: {labels}'
        raise ValueError(error_str)

    # Prepare Data Traces
    traces = []

    # Add the point estimates as scatter plot points
    traces.append(
        go.Scatter(
            x=df[labels[1]],
            y=df[labels[0]],
            mode='markers',
            name='Hazard Ratio',
            marker=dict(color='blue', size=10))
    )

    # Add the confidence intervals as lines
    for index, row in df.iterrows():
        traces.append(
            go.Scatter(
                x=[row[labels[2]], row[labels[3]]],
                y=[row[labels[0]], row[labels[0]]],
                mode='lines',
                showlegend=False,
                line=dict(color='blue', width=2))
        )

    # Define layout
    layout = go.Layout(
        title=title,
        xaxis=dict(title='Hazard Ratio'),
        yaxis=dict(
            title='', automargin=True, tickmode='array',
            tickvals=df[labels[0]].tolist(), ticktext=df[labels[0]].tolist()),
        shapes=[
            dict(
                type='line', x0=1, y0=-0.5, x1=1, y1=len(df[labels[0]])-0.5,
                line=dict(color='red', width=2)
            )],  # Line of no effect
        margin=dict(l=100, r=100, t=100, b=50),
        height=600
    )

    return go.Figure(data=traces, layout=layout)

---

## **Executing Forest Plot**

In [None]:
graph = fig_forest_plot(
    df = results,
    labels = results.columns.tolist(),
    only_display=True
)

graph.show()

---

## **User Case 2**


### **1. Uploading and Loading the Dataset**

In [None]:
from google.colab import files
# Upload file
uploaded = files.upload()
df_map = pd.read_csv('df_map.csv')

Saving df_map.csv to df_map.csv


---
### **2. Preprocessing the Dataset to Ensure Binary Outcome**

<dd>

Since the outcome must be binary, this treatment should be applied before executing the code.

<dt>

In [None]:
df_map["outco_binary_outcome"] = df_map["outco_binary_outcome"].map({
    "Death": 1,
    "Censored": 0,
    "Discharged": 0
})



---

### **3. Running the cox model**

*   Defining Time, Event, and Predictor Variables
*   Running the Cox Model
*   Visualize the Model Results



In [None]:
# Time, outcome, and predictor variables
duration_col = 'drug14_antiviralday'  # Time variable
event_col = 'outco_binary_outcome'  # Binary outcome variable
cox_predictors = [
    'demog_sex', 'demog_race___Latin American', 'expo14_setting', 'comor_aids_art',
    'comor_chrkidney', 'comor_chrpulmona', 'comor_liverdisease', 'comor_chrcardiac',
    'adsym_mobile', 'vital_o2supp_type'
]

# Run the Cox model
results = run_cox_pipeline(df_map, duration_col, event_col, cox_predictors)

# Display the results
print(results)

ValueError: could not convert string to float: 'udkfvdxzap'

---


# **Full Code**

In [None]:
def execute_cox_model(df, duration_col, event_col, predictors, labels=None):
    """
    Performs a Cox Proportional Hazards model without weights and returns a summary of the results.

    Parameters:
    - df: Pandas DataFrame containing the data.
    - duration_col: String with the name of the time variable.
    - event_col: String with the name of the outcome variable (binary event).
    - predictors: List of strings with the names of predictor variables.
    - labels: (Optional) Dictionary mapping variable names to readable labels. Default is None.

    Returns:
    - summary_df: DataFrame with the results of the Cox model.
    """

    # Ensure categorical variables are treated appropriately
    categorical_vars = df.select_dtypes(include=['object', 'category']).columns.intersection(predictors)
    for var in categorical_vars:
        df[var] = df[var].astype('category')

    # Convert categorical variables to dummies
    df = pd.get_dummies(df, columns=categorical_vars, drop_first=True)

    # Ensure numerical variables have the correct type
    df[duration_col] = pd.to_numeric(df[duration_col], errors='coerce')
    df[event_col] = pd.to_numeric(df[event_col], errors='coerce')

    # Update predictors to include one-hot encoded columns
    predictors = [c for c in df.columns if c in predictors or any(c.startswith(p + '_') for p in categorical_vars)]

    # Remove rows with missing values in essential columns
    df = df.dropna(subset=[duration_col, event_col] + predictors)

    # Select relevant columns
    df_cox = df[[duration_col, event_col] + predictors]

    # Fit the Cox model
    cph = CoxPHFitter()
    cph.fit(df_cox, duration_col=duration_col, event_col=event_col)

    # Model summary
    summary = cph.summary
    summary['HR'] = np.exp(summary['coef'])
    summary['CI_lower'] = np.exp(summary['coef'] - 1.96 * summary['se(coef)'])
    summary['CI_upper'] = np.exp(summary['coef'] + 1.96 * summary['se(coef)'])
    summary['p_adj'] = summary['p'].apply(lambda p: "<0.001" if p < 0.001 else round(p, 3))

    # Select relevant columns for the final summary
    summary_df = summary[['HR', 'CI_lower', 'CI_upper', 'p_adj']].reset_index()
    summary_df.rename(columns={'index': 'Variable', 'p_adj': 'p-value'}, inplace=True)

    # Replace variable labels if provided
    if labels:
        summary_df['Variable'] = summary_df['Variable'].map(labels).fillna(summary_df['Variable'])

    return summary_df

---

## **User Case 3**

<dd>

Since the dataset has already been loaded in the first example, there is no need to load it again to execute this function.

<dt>

In [None]:
# Time, outcome, and predictor variables
duration_col = 'HospitalLengthStay_trunc'  # Time variable
event_col = 'HospitalDischargeCode_trunc_bin'  # Binary outcome variable
cox_predictors = [
    'period', 'Idade_Agrupada2', 'ChronicHealthStatusName', 'obesity',
    'IsImmunossupression', 'IsSteroidsUse', 'IsSevereCopd', 'IsChfNyha',
    'cancer', 'ResourceIsRenalReplacementTherapy', 'ResourceIsVasopressors',
    'Vent_Resource'
]

# Run the Cox model
results = execute_cox_model(df_model, duration_col, event_col, cox_predictors)

# Display the results
print(results)

                                            covariate        HR  CI_lower  \
0                                    period_2022-2023  1.053884  0.968827   
1                                 Idade_Agrupada2_<65  0.777876  0.679785   
2                                Idade_Agrupada2_>=80  1.570852  1.421130   
3   ChronicHealthStatusName_Major assistance / bed...  1.368248  1.228323   
4            ChronicHealthStatusName_Minor assistance  1.188733  1.070908   
5                                         obesity_yes  0.820770  0.653215   
6                             IsImmunossupression_yes  1.238394  1.057788   
7                                   IsSteroidsUse_yes  1.083819  0.825148   
8                                    IsSevereCopd_yes  0.959102  0.860470   
9                                       IsChfNyha_yes  1.209235  1.072670   
10                                         cancer_yes  1.466217  1.315612   
11              ResourceIsRenalReplacementTherapy_yes  1.130405  1.011259   

---

## **User Case 4**

<dd>

Since the dataset has already been loaded in the second example, there is no need to load it again to execute this function.

<dt>

In [None]:
# Time, outcome, and predictor variables
duration_col = 'drug14_antiviralday'  # Time variable
event_col = 'outco_binary_outcome'  # Binary outcome variable
cox_predictors = [
    'demog_sex', 'demog_race___Latin American', 'expo14_setting', 'comor_aids_art',
    'comor_chrkidney', 'comor_chrpulmona', 'comor_liverdisease', 'comor_chrcardiac',
    'adsym_mobile', 'vital_o2supp_type'
]

# Run the Cox model
results = execute_cox_model(df_map, duration_col, event_col, cox_predictors)

# Display the results
print(results)

ValueError: could not convert string to float: 'udkfvdxzap'