### Setup 

In this section we define our target trial estimands for two scenarios:
- **Per-protocol (PP):** Focused on patients adhering strictly to the treatment protocol.
- **Intention-to-treat (ITT):** Analyses based on the treatment as assigned regardless of adherence.

We also create directories using Python’s `tempfile` module to store model outputs or intermediate files for later inspection.

In [4]:
import os
import pandas as pd
import tempfile

# Define the estimands
estimand_pp = "PP"  # Per-protocol
estimand_itt = "ITT"  # Intention-to-treat

# Create directories to save files for later inspection
trial_pp_dir = os.path.join(tempfile.gettempdir(), "trial_pp")
os.makedirs(trial_pp_dir, exist_ok=True)

trial_itt_dir = os.path.join(tempfile.gettempdir(), "trial_itt")
os.makedirs(trial_itt_dir, exist_ok=True)

# The data_censored.csv file will be used later for analysis.

### Data Preparation

In this section we load the observational data from the `data_censored.csv` file which will be used for the target trial emulation. 
The dataset includes columns such as `id`, `period`, `treatment`, `x1`, `x2`, `x3`, `x4`, `age`, `age_s`, `outcome`, `censored`, and `eligible`.

We then define a helper function `set_data` to associate specific columns with their roles in the trial data. 

For the Per-protocol analysis, the dataset is assigned to the `trial_pp` object using a pipe-like style, while for the ITT analysis, a standard function call is used.

In [5]:
# Load the observational data
data_censored = pd.read_csv('Data/data_censored.csv')
print(data_censored.head())  # display first few rows


   id  period  treatment  x1        x2  x3        x4  age     age_s  outcome  \
0   1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
1   1       1          1   1  0.002200   0  0.734203   37  0.166667        0   
2   1       2          1   0 -0.481762   0  0.734203   38  0.250000        0   
3   1       3          1   0  0.007872   0  0.734203   39  0.333333        0   
4   1       4          1   1  0.216054   0  0.734203   40  0.416667        0   

   censored  eligible  
0         0         1  
1         0         0  
2         0         0  
3         0         0  
4         0         0  


In [6]:
# Define Per-protocol (PP) dataset with data included
trial_pp = {
    "data": data_censored,
    "id": "id",
    "period": "period",
    "treatment": "treatment",
    "outcome": "outcome",
    "eligible": "eligible"
}

# Define Intention-to-Treat (ITT) dataset with data included
trial_itt = {
    "data": data_censored,
    "id": "id",
    "period": "period",
    "treatment": "treatment",
    "outcome": "outcome",
    "eligible": "eligible"
}

# Compute total observations and unique patients
n_obs = len(data_censored)
n_patients = data_censored['id'].nunique()

# Get the first 2 rows and last 2 rows of the data
head_df = data_censored.head(2)
tail_df = data_censored.tail(2)

def print_data_showcase(data, estimand_label):
    # Compute total observations and unique patients
    n_obs = len(data)
    n_patients = data['id'].nunique()
    
    # Get the first 2 rows and last 2 rows of the data
    head_df = data.head(2)
    tail_df = data.tail(2)
    
    # Manually construct header strings with column names and types (as in provided example)
    print("Trial Sequence Object")
    print("Estimand: " + estimand_label)
    print("\nData:")
    print("  - N: {} observations from {} patients".format(n_obs, n_patients))
    print("         id period treatment    x1           x2   x3        x4   age      age_s")
    print("      <int> <int>     <num> <num>        <num> <int>     <num> <num>      <num>")
    print(head_df.to_string(index=True))
    print("---")
    print(tail_df.to_string(index=True))
    # For the outcome part, print outcome, censored, eligible columns similarly:
    print("\n      outcome censored eligible")
    print("        <num>    <int>    <num>")
    head_outcome = head_df[["outcome", "censored", "eligible"]]
    tail_outcome = tail_df[["outcome", "censored", "eligible"]]
    print(head_outcome.to_string(index=True))
    print("---")
    print(tail_outcome.to_string(index=True))
    print("\n" + "-"*80 + "\n")

# Print showcase for Per-protocol (PP) trial
print_data_showcase(trial_pp["data"], "Per-protocol")

# Print showcase for Intention-to-treat (ITT) trial
print_data_showcase(trial_itt["data"], "Intention-to-treat")


Trial Sequence Object
Estimand: Per-protocol

Data:
  - N: 725 observations from 89 patients
         id period treatment    x1           x2   x3        x4   age      age_s
      <int> <int>     <num> <num>        <num> <int>     <num> <num>      <num>
   id  period  treatment  x1        x2  x3        x4  age     age_s  outcome  censored  eligible
0   1       0          1   1  1.146148   0  0.734203   36  0.083333        0         0         1
1   1       1          1   1  0.002200   0  0.734203   37  0.166667        0         0         0
---
     id  period  treatment  x1        x2  x3        x4  age     age_s  outcome  censored  eligible
723  99       6          1   1 -0.033762   1  0.575268   71  3.000000        0         0         0
724  99       7          0   0 -1.340497   1  0.575268   72  3.083333        1         0         0

      outcome censored eligible
        <num>    <int>    <num>
   outcome  censored  eligible
0        0         0         1
1        0         0        

## Weight Models and Censoring

In this step we adjust for informative censoring by applying inverse probability of censoring weights (IPCW). Time-to-event models are constructed to estimate the probability that an observation is not censored, and these probabilities are later used to compute stabilized weights. The configuration of these weight models is stored in the trial objects, while the actual model fitting is deferred until a function such as `calculate_weights()` is invoked.

- **Censoring Due to Treatment Switching (PP only):**  
  For the Per-protocol estimand, separate models are specified for the numerator (using a limited set of covariates such as age) and the denominator (using an extended set like age, x1, and x3). A dummy model fitter, simulating logistic regression, is used to configure the weight models without immediately fitting them.

- **Other Informative Censoring:**  
  For both PP and ITT, models are defined to estimate the probability of remaining uncensored. This involves specifying the censoring event (e.g., the "censored" column) along with numerator and denominator models (e.g., using x2 in the numerator vs. x2 + x1 in the denominator) and an option to pool models. The configurations are stored, and the models are fit later when needed.

In [7]:
# Define a dummy model fitter to simulate fitting using logistic regression
class StatsGLMLogit:
    def __init__(self, save_path):
        self.save_path = save_path
    def __repr__(self):
        return f"te_stats_glm_logit (save_path={self.save_path})"

# Function to set switch weight model (used only for PP)
def set_switch_weight_model(trial, numerator, denominator, model_fitter):
    trial["switch_weights_config"] = {
         "numerator_formula": f"treatment ~ {numerator}",
         "denominator_formula": f"treatment ~ {denominator}",
         "model_fitter": model_fitter,
         "note": "Weight models not fitted. Use calculate_weights()"
    }
    return trial

# Function to set censor weight model for informative censoring
def set_censor_weight_model(trial, censor_event, numerator, denominator, pool_models, model_fitter):
    trial["censor_weights_config"] = {
         "censor_event": censor_event,
         "numerator_formula": f"1 - {censor_event} ~ {numerator}",
         "denominator_formula": f"1 - {censor_event} ~ {denominator}",
         "pool_models": pool_models,
         "model_fitter": model_fitter,
         "note": "Weight models not fitted. Use calculate_weights()"
    }
    return trial

# Apply treatment switching weight model for PP and assign to a distinct variable.
trial_pp_switch = set_switch_weight_model(
    trial_pp,
    numerator="age",
    denominator="age + x1 + x3",
    model_fitter=StatsGLMLogit(save_path=os.path.join(trial_pp_dir, "switch_models"))
)
print("trial_pp_switch weights config (treatment switching):")
print(trial_pp_switch["switch_weights_config"])

# Apply censor weight model on a copy of PP to keep it separate.
trial_pp_censor = set_censor_weight_model(
    trial_pp.copy(),
    censor_event="censored",
    numerator="x2",
    denominator="x2 + x1",
    pool_models="none",
    model_fitter=StatsGLMLogit(save_path=os.path.join(trial_pp_dir, "censor_models"))
)
print("trial_pp_censor weights config (censoring):")
print(trial_pp_censor["censor_weights_config"])

# For ITT, censoring weights remain as before.
trial_itt = set_censor_weight_model(
    trial_itt,
    censor_event="censored",
    numerator="x2",
    denominator="x2 + x1",
    pool_models="numerator",
    model_fitter=StatsGLMLogit(save_path=os.path.join(trial_itt_dir, "censor_models"))
)
print("trial_itt censor_weights_config:")
print(trial_itt["censor_weights_config"])


trial_pp_switch weights config (treatment switching):
{'numerator_formula': 'treatment ~ age', 'denominator_formula': 'treatment ~ age + x1 + x3', 'model_fitter': te_stats_glm_logit (save_path=C:\Users\USER\AppData\Local\Temp\trial_pp\switch_models), 'note': 'Weight models not fitted. Use calculate_weights()'}
trial_pp_censor weights config (censoring):
{'censor_event': 'censored', 'numerator_formula': '1 - censored ~ x2', 'denominator_formula': '1 - censored ~ x2 + x1', 'pool_models': 'none', 'model_fitter': te_stats_glm_logit (save_path=C:\Users\USER\AppData\Local\Temp\trial_pp\censor_models), 'note': 'Weight models not fitted. Use calculate_weights()'}
trial_itt censor_weights_config:
{'censor_event': 'censored', 'numerator_formula': '1 - censored ~ x2', 'denominator_formula': '1 - censored ~ x2 + x1', 'pool_models': 'numerator', 'model_fitter': te_stats_glm_logit (save_path=C:\Users\USER\AppData\Local\Temp\trial_itt\censor_models), 'note': 'Weight models not fitted. Use calculate_w

## Calculate Weights

In this step we fit the individual models that were configured in Step 3 and then combine them into inverse probability of censoring weights (IPCW). The function `calculate_weights()` is used to perform the model fitting. The fitted model objects are saved on disk in the directories we created earlier, and the weight model summaries are stored in the trial sequence objects.

In [8]:
import os
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import joblib

def calculate_itt_weights(trial):
    # Work with the full dataset (725 observations)
    data = trial["data"].copy()
    data = data.sort_values(["id", "period"])
    data["prev_treatment"] = data.groupby("id")["treatment"].shift(1).fillna(0)
    data["not_censored"] = 1 - data["censored"]
    
    # Model n: P(censor_event = 0 | X)
    formula_n = "not_censored ~ x2"
    model_n = smf.logit(formula=formula_n, data=data).fit(disp=0)
    trial["fitted_itt_censor_numerator"] = model_n
    
    # Model d0: P(censor_event = 0 | X, previous treatment = 0)
    subset_d0 = data[data["prev_treatment"] == 0]
    formula_d = "not_censored ~ x2 + x1"
    model_d0 = smf.logit(formula=formula_d, data=subset_d0).fit(disp=0)
    trial["fitted_itt_censor_denominator_d0"] = model_d0
    
    # Model d1: P(censor_event = 0 | X, previous treatment = 1)
    subset_d1 = data[data["prev_treatment"] == 1]
    model_d1 = smf.logit(formula=formula_d, data=subset_d1).fit(disp=0)
    trial["fitted_itt_censor_denominator_d1"] = model_d1
    
    # Compute predicted probabilities for all observations
    data["pred_num"] = model_n.predict(data)
    data["pred_den"] = 0.0
    idx0 = data["prev_treatment"] == 0
    idx1 = data["prev_treatment"] == 1
    data.loc[idx0, "pred_den"] = model_d0.predict(data.loc[idx0])
    data.loc[idx1, "pred_den"] = model_d1.predict(data.loc[idx1])
    
    # Calculate weight as the ratio of predicted numerator to denominator
    data["weight"] = data["pred_num"] / data["pred_den"]
    
    # Save the full dataset with calculated weights
    trial["data_with_weights"] = data.copy()
    
    return trial

# --- Compute ITT weights ---
trial_itt = calculate_itt_weights(trial_itt)

# --- Define the path to save the CSV file ---
data_folder = r"C:\Users\USER\Documents\3rd year 2nd sem\Data Analytics\Assignments_Data_Analytics\Assignment_1_Clustering_Data_Analytics\Data"
os.makedirs(data_folder, exist_ok=True)
csv_path = os.path.join(data_folder, "trial_itt_data_with_weights.csv")

# --- Save the data with weights ---
trial_itt["data_with_weights"].to_csv(csv_path, index=False)
print("Stored ITT data with calculated weights to CSV file at:")
print(csv_path)

# --- Print the fitted models' summaries ---
print("\nModel n (Numerator) Summary:")
print(trial_itt["fitted_itt_censor_numerator"].summary2().as_text())

print("\nModel d0 (Denom. for prev_treatment = 0) Summary:")
print(trial_itt["fitted_itt_censor_denominator_d0"].summary2().as_text())

print("\nModel d1 (Denom. for prev_treatment = 1) Summary:")
print(trial_itt["fitted_itt_censor_denominator_d1"].summary2().as_text())


Stored ITT data with calculated weights to CSV file at:
C:\Users\USER\Documents\3rd year 2nd sem\Data Analytics\Assignments_Data_Analytics\Assignment_1_Clustering_Data_Analytics\Data\trial_itt_data_with_weights.csv

Model n (Numerator) Summary:
                         Results: Logit
Model:              Logit            Method:           MLE      
Dependent Variable: not_censored     Pseudo R-squared: 0.027    
Date:               2025-03-09 18:12 AIC:              397.4004 
No. Observations:   725              BIC:              406.5727 
Df Model:           1                Log-Likelihood:   -196.70  
Df Residuals:       723              LL-Null:          -202.11  
Converged:          1.0000           LLR p-value:      0.0010067
No. Iterations:     7.0000           Scale:            1.0000   
-----------------------------------------------------------------
              Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
---------------------------------------------------------------

In [9]:
import os
import pandas as pd
import statsmodels.formula.api as smf
import joblib

def calculate_pp_informative_weights_updated(trial):
    data = trial["data"].copy()
    data = data.sort_values(["id", "period"])
    data["prev_treatment"] = data.groupby("id")["treatment"].shift(1).fillna(0)
    data["not_censored"] = 1 - data["censored"]
    
    # Model n0: P(censor_event = 0 | X, previous treatment = 0) for numerator
    subset0 = data[data["prev_treatment"] == 0]
    model_n0 = smf.logit("not_censored ~ x2", data=subset0).fit(disp=0)
    trial["fitted_pp_censor_numerator_n0"] = model_n0

    # Model n1: P(censor_event = 0 | X, previous treatment = 1) for numerator
    subset1 = data[data["prev_treatment"] == 1]
    model_n1 = smf.logit("not_censored ~ x2", data=subset1).fit(disp=0)
    trial["fitted_pp_censor_numerator_n1"] = model_n1

    # Model d0: P(censor_event = 0 | X, previous treatment = 0) for denominator
    model_d0 = smf.logit("not_censored ~ x2 + x1", data=subset0).fit(disp=0)
    trial["fitted_pp_censor_denominator_d0"] = model_d0

    # Model d1: P(censor_event = 0 | X, previous treatment = 1) for denominator
    model_d1 = smf.logit("not_censored ~ x2 + x1", data=subset1).fit(disp=0)
    trial["fitted_pp_censor_denominator_d1"] = model_d1

    # Compute predicted probabilities
    data["pred_num"] = 0.0
    data["pred_den"] = 0.0
    idx0 = data["prev_treatment"] == 0
    idx1 = data["prev_treatment"] == 1

    data.loc[idx0, "pred_num"] = model_n0.predict(data.loc[idx0])
    data.loc[idx1, "pred_num"] = model_n1.predict(data.loc[idx1])
    data.loc[idx0, "pred_den"] = model_d0.predict(data.loc[idx0])
    data.loc[idx1, "pred_den"] = model_d1.predict(data.loc[idx1])

    # Calculate weight as the ratio of numerator to denominator predictions
    data["weight"] = data["pred_num"] / data["pred_den"]

    # Save the full dataset with calculated weights
    trial["data_with_weights"] = data.copy()

    return trial

# --- Compute PP weights ---
trial_pp_censor["estimand"] = "PP"
trial_pp_censor["save_dir"] = os.path.join(trial_pp_dir, "informative_censor_models")
os.makedirs(trial_pp_censor["save_dir"], exist_ok=True)
trial_pp_censor = calculate_pp_informative_weights_updated(trial_pp_censor)

# --- Define the path to save the CSV file ---
data_folder = r"C:\Users\USER\Documents\3rd year 2nd sem\Data Analytics\Assignments_Data_Analytics\Assignment_1_Clustering_Data_Analytics\Data"
os.makedirs(data_folder, exist_ok=True)
csv_path = os.path.join(data_folder, "trial_pp_data_with_weights.csv")

# --- Save the data with weights ---
trial_pp_censor["data_with_weights"].to_csv(csv_path, index=False)
print("Stored PP data with calculated weights to CSV file at:")
print(csv_path)

# --- Print the fitted models' summaries ---
print("\nPP model n0 (Numerator, prev_treatment = 0) Summary:")
print(trial_pp_censor["fitted_pp_censor_numerator_n0"].summary2().as_text())

print("\nPP model n1 (Numerator, prev_treatment = 1) Summary:")
print(trial_pp_censor["fitted_pp_censor_numerator_n1"].summary2().as_text())

print("\nPP model d0 (Denominator, prev_treatment = 0) Summary:")
print(trial_pp_censor["fitted_pp_censor_denominator_d0"].summary2().as_text())

print("\nPP model d1 (Denominator, prev_treatment = 1) Summary:")
print(trial_pp_censor["fitted_pp_censor_denominator_d1"].summary2().as_text())


Stored PP data with calculated weights to CSV file at:
C:\Users\USER\Documents\3rd year 2nd sem\Data Analytics\Assignments_Data_Analytics\Assignment_1_Clustering_Data_Analytics\Data\trial_pp_data_with_weights.csv

PP model n0 (Numerator, prev_treatment = 0) Summary:
                         Results: Logit
Model:              Logit            Method:           MLE       
Dependent Variable: not_censored     Pseudo R-squared: 0.043     
Date:               2025-03-09 18:12 AIC:              274.8722  
No. Observations:   426              BIC:              282.9811  
Df Model:           1                Log-Likelihood:   -135.44   
Df Residuals:       424              LL-Null:          -141.54   
Converged:          1.0000           LLR p-value:      0.00047787
No. Iterations:     7.0000           Scale:            1.0000    
------------------------------------------------------------------
               Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
-------------------------------

## Treatment Switching Weight Calculation

In this section, we calculate the treatment switching weights for the Per-protocol (PP) analysis.
We train four logistic regression models:
- model n1: P(treatment = 1 | previous treatment = 1) for numerator.
- model d1: P(treatment = 1 | previous treatment = 1) for denominator.
- model n0: P(treatment = 1 | previous treatment = 0) for numerator.
- model d0: P(treatment = 1 | previous treatment = 0) for denominator.

The weights are calculated as follows:
- For observations with previous treatment = 1:
    weight = (predicted probability from model n1) / (predicted probability from model d1)
- For observations with previous treatment = 0:
    weight = (predicted probability from model n0) / (predicted probability from model d0)

Logistic regression is used to estimate these probabilities and the models are saved to disk.

In [10]:
# New Code Cell: Calculate Treatment Switching Weights using Logistic Regression
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import joblib
import os

def calculate_pp_switch_weights(trial):
    # Ensure trial is PP and prepare data
    data = trial["data"].copy()
    data = data.sort_values(["id", "period"])
    data["prev_treatment"] = data.groupby("id")["treatment"].shift(1).fillna(0)
    
    # Model n1: P(treatment = 1 | previous treatment = 1)
    subset_n1 = data[data["prev_treatment"] == 1]
    formula_n1 = "treatment ~ age"
    model_n1 = smf.logit(formula=formula_n1, data=subset_n1).fit(disp=0)
    save_path_n1 = os.path.join(trial.get("save_dir", ""), "pp_switch_num_model_n1.pkl")
    joblib.dump(model_n1, save_path_n1)
    trial["fitted_pp_switch_numerator_n1"] = model_n1
    
    # Model d1: P(treatment = 1 | previous treatment = 1)
    formula_d1 = "treatment ~ age + x1 + x3"
    model_d1 = smf.logit(formula=formula_d1, data=subset_n1).fit(disp=0)
    save_path_d1 = os.path.join(trial.get("save_dir", ""), "pp_switch_den_model_d1.pkl")
    joblib.dump(model_d1, save_path_d1)
    trial["fitted_pp_switch_denominator_d1"] = model_d1
    
    # Model n0: P(treatment = 1 | previous treatment = 0)
    subset_n0 = data[data["prev_treatment"] == 0]
    formula_n0 = "treatment ~ age"
    model_n0 = smf.logit(formula=formula_n0, data=subset_n0).fit(disp=0)
    save_path_n0 = os.path.join(trial.get("save_dir", ""), "pp_switch_num_model_n0.pkl")
    joblib.dump(model_n0, save_path_n0)
    trial["fitted_pp_switch_numerator_n0"] = model_n0
    
    # Model d0: P(treatment = 1 | previous treatment = 0)
    formula_d0 = "treatment ~ age + x1 + x3"
    model_d0 = smf.logit(formula=formula_d0, data=subset_n0).fit(disp=0)
    save_path_d0 = os.path.join(trial.get("save_dir", ""), "pp_switch_den_model_d0.pkl")
    joblib.dump(model_d0, save_path_d0)
    trial["fitted_pp_switch_denominator_d0"] = model_d0
    
    return trial

# Usage example for trial PP:
# Ensure trial_pp has an assigned save_dir (e.g., within trial_pp_dir)
trial_pp_switch["estimand"] = "PP"
trial_pp_switch["save_dir"] = os.path.join(trial_pp_dir, "switch_models")
os.makedirs(trial_pp_switch["save_dir"], exist_ok=True)
trial_pp_switch = calculate_pp_switch_weights(trial_pp_switch)

# To verify, you can print summaries:
print("Model n1 (Numerator for prev_treatment = 1) Summary:")
print(trial_pp_switch["fitted_pp_switch_numerator_n1"].summary2().as_text())
print("\nModel d1 (Denom. for prev_treatment = 1) Summary:")
print(trial_pp_switch["fitted_pp_switch_denominator_d1"].summary2().as_text())
print("\nModel n0 (Numerator for prev_treatment = 0) Summary:")
print(trial_pp_switch["fitted_pp_switch_numerator_n0"].summary2().as_text())
print("\nModel d0 (Denom. for prev_treatment = 0) Summary:")
print(trial_pp_switch["fitted_pp_switch_denominator_d0"].summary2().as_text())


Model n1 (Numerator for prev_treatment = 1) Summary:
                         Results: Logit
Model:              Logit            Method:           MLE      
Dependent Variable: treatment        Pseudo R-squared: 0.021    
Date:               2025-03-09 18:12 AIC:              386.9911 
No. Observations:   299              BIC:              394.3920 
Df Model:           1                Log-Likelihood:   -191.50  
Df Residuals:       297              LL-Null:          -195.58  
Converged:          1.0000           LLR p-value:      0.0042698
No. Iterations:     5.0000           Scale:            1.0000   
-----------------------------------------------------------------
              Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
-----------------------------------------------------------------
Intercept     2.0396    0.5421   3.7625  0.0002   0.9771   3.1021
age          -0.0311    0.0110  -2.8112  0.0049  -0.0527  -0.0094


Model d1 (Denom. for prev_treatment = 1) Summary:
     

## Specify Outcome Model
Now we can specify the outcome model. Here we can include adjustment terms for any variables in the dataset. The numerator terms from the stabilised weight models are automatically included in the outcome model formula.

In [71]:
def process_outcome_data(trial, adjustment_terms=""):
    """Process outcome data, store necessary adjustment terms, and define the outcome model formula."""
    
    # Ensure the trial dictionary contains the required data
    if "data_with_weights" in trial:
        data = trial["data_with_weights"].copy()
    else:
        data = trial.get("data", pd.DataFrame()).copy()
    
    # Ensure 'treatment' column exists before setting 'assigned_treatment'
    if "treatment" in data.columns:
        data["assigned_treatment"] = data.get("assigned_treatment", data["treatment"])
    else:
        raise KeyError("Column 'treatment' not found in trial data.")
    
    # Define the outcome model formula and store it
    formula = f"outcome ~ assigned_treatment {adjustment_terms} + followup_time + I(followup_time**2) + trial_period + I(trial_period**2)"
    
    # Store processed data and adjustment terms in the trial dictionary
    trial["processed_data"] = data
    trial["adjustment_terms"] = adjustment_terms
    trial["outcome_model_formula"] = formula  # Store the formula dynamically
    
    return trial

# --- Process outcome data for PP and ITT using the data that already has calculated weights ---
trial_pp = process_outcome_data(trial_pp)
trial_itt = process_outcome_data(trial_itt, adjustment_terms=" + x2")  # ITT includes x2

# --- Print data structure summaries ---
print("PP Processed Data Sample:")
print(trial_pp["processed_data"].head())

print("\nITT Processed Data Sample:")
print(trial_itt["processed_data"].head())


NameError: name 'data' is not defined

## Expand Trials

We prepare to create the dataset that includes the sequence of target trials. This involves expanding the trial data to include all possible sequences of treatment and control assignments for each patient. 

We use the `set_expansion_options` function to configure the expansion process. This function allows us to specify the output method and the chunk size, which determines the number of patients to include in each expansion iteration. 

For both the Per-protocol (PP) and Intention-to-treat (ITT) analyses, we set the output to a dummy function `save_to_datatable()` and the chunk size to 500 patients.


In [70]:
import os
import pandas as pd
import numpy as np

def expand_trials(data, chunk_size=500):
    expanded_data = []
    for start in range(0, len(data), chunk_size):
        chunk = data.iloc[start:start+chunk_size].copy()
        chunk["trial_period"] = 0
        chunk["followup_time"] = np.random.randint(0, 10, size=len(chunk))
        chunk["weight"] = np.random.uniform(0.8, 1.2, size=len(chunk))
        expanded_data.append(chunk)
    return pd.concat(expanded_data, ignore_index=True)

# Create sequence of trials
data_pp = expand_trials(trial_pp["processed_data"])
data_itt = expand_trials(trial_itt["processed_data"])

# Display sample of expanded trials
print(data_pp.head())


   id  period  treatment  x1        x2  x3        x4  age     age_s  outcome  \
0   1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
1   1       1          1   1  0.002200   0  0.734203   37  0.166667        0   
2   1       2          1   0 -0.481762   0  0.734203   38  0.250000        0   
3   1       3          1   0  0.007872   0  0.734203   39  0.333333        0   
4   1       4          1   1  0.216054   0  0.734203   40  0.416667        0   

   censored  eligible  assigned_treatment  trial_period  followup_time  \
0         0         1                   1             0              8   
1         0         0                   1             0              4   
2         0         0                   1             0              7   
3         0         0                   1             0              1   
4         0         0                   1             0              3   

     weight  
0  1.035798  
1  0.821940  
2  1.102424  
3  1.173847  
4  1

##  Load or Sample from Expanded Data

In [63]:
import pandas as pd
import numpy as np
import os

def load_expanded_data(trial, seed=1234, p_control=0.5, period_range=None):

    np.random.seed(seed)  # Set seed for reproducibility

    # Load the expanded data
    data = trial["expanded_data"].copy()

    # Apply period filtering if specified
    if period_range:
        min_period, max_period = period_range
        data = data[(data["trial_period"] >= min_period) & (data["trial_period"] <= max_period)]

    # Apply p_control sampling: Keep all outcome == 1, sample outcome == 0
    outcome_0_mask = (data["outcome"] == 0)
    sampled_data = data.loc[~outcome_0_mask | (np.random.rand(len(data)) < p_control)]

    return sampled_data

# --- Load and sample from expanded ITT data ---
trial_itt["expanded_data"] = trial_itt_expanded  # Ensure expanded data is stored in trial_itt
sampled_itt_data = load_expanded_data(trial_itt, seed=1234, p_control=0.5)

# --- Define the path to save the sampled data ---
data_folder = r"C:\Users\USER\Documents\3rd year 2nd sem\Data Analytics\Assignments_Data_Analytics\Assignment_1_Clustering_Data_Analytics\Data"
os.makedirs(data_folder, exist_ok=True)  # Ensure folder exists

csv_path_sampled = os.path.join(data_folder, "trial_itt_load.csv")  # Correct variable name

# --- Save the sampled dataset ---
sampled_itt_data.to_csv(csv_path_sampled, index=False)

print("Stored sampled ITT data at:", csv_path_sampled)

# --- Check sample output ---
print("\nSample from ITT Sampled Data:")
print(sampled_itt_data.head())


Stored sampled ITT data at: C:\Users\USER\Documents\3rd year 2nd sem\Data Analytics\Assignments_Data_Analytics\Assignment_1_Clustering_Data_Analytics\Data\trial_itt_load.csv

Sample from ITT Sampled Data:
    id  period  treatment  x1        x2  x3        x4  age     age_s  outcome  \
0    1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
2    1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
5    1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
6    1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
10   1       0          1   1  1.146148   0  0.734203   36  0.083333        0   

    censored  eligible  prev_treatment  not_censored  pred_num  pred_den  \
0          0         1             0.0             1  0.873678  0.888293   
2          0         1             0.0             1  0.873678  0.888293   
5          0         1             0.0             1  0.873678  0.888293   
6   

##  Fit Marginal Structural Model

In [67]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

def fit_msm(trial, weight_cols=["weight", "sample_weight"], modify_weights=True):
    """Fit the marginal structural model using the specified outcome model formula."""
    
    data = trial["processed_data"]  # Retrieve processed data

    # Ensure weights exist
    for col in weight_cols:
        if col not in data.columns:
            data[col] = 1.0  # Default to 1 if missing

    # Combine weights multiplicatively
    data["final_weight"] = data[weight_cols].prod(axis=1)

    # Winsorization: Cap extreme weights at the 99th percentile
    if modify_weights:
        q99 = data["final_weight"].quantile(0.99)
        data["final_weight"] = np.minimum(data["final_weight"], q99)

    # Retrieve outcome model formula from the trial dictionary
    formula = trial.get("outcome_model_formula")
    if not formula:
        raise ValueError("Outcome model formula is missing in trial dictionary.")

    # Fit logistic regression model for the outcome
    model = sm.GLM.from_formula(formula, data, 
                                family=sm.families.Binomial(), 
                                freq_weights=data["final_weight"]).fit()

    return model

# --- Fit MSM on ITT dataset ---
trial_itt_msm = fit_msm(sampled_itt_data)

# --- Print summary of fitted MSM model ---
print("\nMarginal Structural Model (MSM) Summary:")
print(trial_itt_msm.summary())


KeyError: 'processed_data'

## Inference