# Assignment 1 for Clustering: Target Trial Emulation
- New and novel methods in Machine Learning are made either by borrowing formulas and concepts from other scientific fields and redefining it based on new sets of assumptions, or by adding an extra step to an already existing framework of methodology.

- In this exercise (Assignment 1 of the Clustering Topic), we will try to develop a novel method of Target Trial Emulation by integrating concepts of Clustering into the already existing framework. Target Trial Emulation is a new methodological framework in epidemiology which tries to account for the biases in old and traditional designs.

These are the instructions:
1. Look at this website: https://rpubs.com/alanyang0924/TTE
2. Extract the dummy data in the package and save it as "data_censored.csv"
2. Convert the R codes into Python Codes (use Jupyter Notebook), replicate the results using your python code.
3. Create another copy of your Python Codes, name it TTE-v2 (use Jupyter Notebook).
4. Using TTE-v2, think of a creative way on where you would integrate a clustering mechanism, understand each step carefully and decide at which step a clustering method can be implemented. Generate insights from your results.
5. Do this by pair, preferably your thesis partner.
6. Push to your github repository.
7. Deadline is: February 28, 2025 at 11:59 pm.

## I. Necessary Imports

In [85]:
import pandas as pd
import numpy as np
import os
import patsy
import joblib
import json
from sklearn.linear_model import LogisticRegression
from IPython.display import display
import statsmodels.api as sm
import statsmodels.formula.api as smf
from dataclasses import dataclass
from typing import List, Optional, Any




## II. Class Definition and Required Functions

In [86]:
def stats_glm_logit(save_path):
    if save_path is not None:
        os.makedirs(save_path, exist_ok=True)

    def fit_model(numerator, denominator, data):
        # Fit model using statsmodels (instead of sklearn)
        formula = f"treatment ~ {denominator}"  
        model = smf.logit(formula, data).fit()

        # Save model
        model_path = os.path.join(save_path, "logit_model.pkl")
        joblib.dump(model, model_path)

        model_details = {
            "numerator": numerator, 
            "denominator": denominator,
            "model_type": "te_stats_glm_logit",
            "file_path": model_path
        }
        json.dump(model_details, open(os.path.join(save_path, "model_details.json"), "w"))

        return model
    
    return fit_model

@dataclass
class TEDatastore:
    data: pd.DataFrame = None

    def save_expanded_data(self, switch_data: pd.DataFrame):
        if self.data is None:
            self.data = switch_data
        else:
            self.data = pd.concat([self.data, switch_data], ignore_index=True)
        return self

@dataclass
class TEExpansion:
    chunk_size: int = 0
    datastore: TEDatastore = None
    first_period: int = 0
    last_period: float = float('inf')
    censor_at_switch: bool = False


class TrialSequence:
    def __init__(self, esitmand, **kwargs):
        self.estimand = esitmand
        self.data = None
        self.censor_weights = None
        self.switch_weights = None
        self.outcome_model = None
        self.expansion = None
        self.outcome_data = None

    def set_data(self, data):
        self.data = data


        #we really only need this because assuming that followup_time is calculated based on the period a patient is 'eligible' and when an event happens (censoring or outcome = 1)
        #since by default with the given dataset, when a patient entry is recorded, at period 0 eligibility is always equal to 1, and the last entry of a patient is always when an event happens or when they get removed from the study.
        self.data["followup_time"] = self.data.groupby("id")["period"].transform(
            lambda x: x[(self.data.loc[x.index, "censored"] == 1) | (self.data.loc[x.index, "outcome"] == 1)].min()
            if ((self.data.loc[x.index, "censored"] == 1) | (self.data.loc[x.index, "outcome"] == 1)).any()
            else x.max()
        )

        
    def show(self):
        print(f"Trial Sequence Object\nEstimand: {self.estimand}\n")
        
        if self.data is not None:
            display(self.data)
        else:
            print("No data set")

        print("\nIPW for informative censoring:")
        print(self.censor_weights if self.censor_weights is not None else "Not calculated.")
        if self.switch_weights is not None:
            print("\nIPW for treatment switch censoring:")
            print(self.switch_weights)
            
        print("\nOutcome model:")
        print(self.outcome_model if self.outcome_model is not None else "Not specified.")
        if self.outcome_data is not None:
            print("\nOutcome data:")
            print(self.outcome_data)

#STEP 3
    
    def set_switch_weight_model(self, numerator=None, denominator=None, model_fitter=None, eligible_wts_0=None, eligible_wts_1=None):
        if self.data is None:
            raise ValueError("set_data() before setting switch weight models")
        
        if self.estimand == "ITT":
            raise ValueError("Switching weights are not supported for intention-to-treat analyses")

        if eligible_wts_0 and eligible_wts_0 in self.data.columns:
            self.data = self.data.rename(columns={eligible_wts_0: "eligible_wts_0"})
        if eligible_wts_1 and eligible_wts_1 in self.data.columns:
            self.data = self.data.rename(columns={eligible_wts_1: "eligible_wts_1"})

        if numerator is None:
            numerator = "1"
        if denominator is None:
            denominator = "1"
        
        if "time_on_regime" in denominator:
            raise ValueError("time_on_regime should not be used in denominator.")

        formula_numerator = f"treatment ~ {numerator}"
        formula_denominator = f"treatment ~ {denominator}"

        self.switch_weights = {
            "numerator": formula_numerator,
            "denominator": formula_denominator,
            "model_fitter": "te_stats_glm_logit",
        }

        if model_fitter is not None:
            fitted_model = model_fitter(numerator, denominator, self.data)  
            self.switch_weights["fitted_model"] = fitted_model 

            # **🔹 Compute switch weights immediately**
            self.data["switch_prob"] = fitted_model.predict(self.data)
            self.data["switch_weight"] = 1 / self.data["switch_prob"]
            print("✅ Switch weights computed and stored in self.data")


    def show_switch_weights(self):
        return self.switch_weights if self.switch_weights else "Not calculated"
    
    def show_censor_weights(self):
        return self.censor_weights if self.censor_weights else "Not calculated"
    

    def set_censor_weight_model(self, censor_event, numerator="1", denominator="1", pool_models="none", model_fitter=None):
        if model_fitter is None: 
            model_fitter = stats_glm_logit()

        if censor_event not in self.data.columns:
            raise ValueError(f"'{censor_event}' must be a column in the dataset.")

        formula_numerator = f"1 - {censor_event} ~ {numerator}"
        formula_denominator = f"1 - {censor_event} ~ {denominator}"

        self.censor_weights = {
            "numerator": formula_numerator,
            "denominator": formula_denominator,
            "pool_numerator": pool_models in ["numerator", "both"],
            "pool_denominator": pool_models == "both",
            "model_fitter": "te_stats_glm_logit"
        }

        # Fit the numerator model using statsmodels
        self.censor_weights["fitted_model"] = model_fitter(numerator, denominator, self.data)

        # Fit separate denominator models if not pooling
        if not self.censor_weights["pool_denominator"]:
            self.censor_weights["fitted_model_0"] = model_fitter(numerator, denominator, self.data[self.data["previous_treatment"] == 0])
            self.censor_weights["fitted_model_1"] = model_fitter(numerator, denominator, self.data[self.data["previous_treatment"] == 1])


    
#STEP 4
    def calculate_weights(self, quiet=False):
        use_censor_weights = isinstance(self.censor_weights, dict) and self.censor_weights.get("fitted_model") is not None

        if self.estimand == "PP":
            if not (isinstance(self.switch_weights, dict) and self.switch_weights.get("fitted_model")):
                raise ValueError("Switch weight models are not specified. Use set_switch_weight_model()")
            self._calculate_weights_trial_seq(quiet, switch_weights=True, censor_weights=use_censor_weights)
        elif self.estimand == "ITT":
            self._calculate_weights_trial_seq(quiet, switch_weights=False, censor_weights=use_censor_weights)
        else:
            raise ValueError(f"Unknown estimand: {self.estimand}")


    def _calculate_weights_trial_seq(self, quiet, switch_weights, censor_weights):
        if switch_weights:
            if not quiet:
                print("Calculating switch weights...")
            
            switch_model = self.switch_weights["fitted_model"]
            self.data["switch_prob"] = switch_model.predict(self.data)
            self.data["switch_weight"] = 1 / self.data["switch_prob"]

        if censor_weights:
            if not quiet:
                print("Calculating censor weights...")

            censor_model = self.censor_weights["fitted_model"]
            self.data["censor_prob"] = censor_model.predict(self.data)
            self.data["censor_weight"] = 1 / self.data["censor_prob"]

        # Compute final weight
        if switch_weights and censor_weights:
            self.data["final_weight"] = self.data["switch_weight"] * self.data["censor_weight"]
        elif switch_weights:
            self.data["final_weight"] = self.data["switch_weight"]
        elif censor_weights:
            self.data["final_weight"] = self.data["censor_weight"]


#some debuggg
        if "switch_weight" in self.data.columns:
            print("\nWeight Summary for PP:")
            print(self.data[["switch_weight", "censor_weight", "final_weight"]].describe())
        else:
            print("\nWeight Summary for ITT:")
            print(self.data[["censor_weight", "final_weight"]].describe())  # No switch_weight here


    def show_weight_models(self):
        if "censored" not in self.data.columns:
            raise ValueError("Column 'censored' not found in dataset.")

        # Convert boolean censored to integer if necessary
        self.data["censored"] = self.data["censored"].astype(int)
        self.data["censored_inv"] = 1 - self.data["censored"]

        if self.censor_weights is not None:
            print("## Weight Models for Informative Censoring")
            print("#\n#")

            # Numerator model
            print("## Model: P(censored = 0 | X) for numerator")
            print("#")
            numerator_model = smf.logit("censored_inv ~ x1 + x2", data=self.data).fit(method="newton")
            print(numerator_model.summary2().tables[1].round(6))
            print("#")

            # Denominator models
            if self.censor_weights["pool_denominator"]:
                print("## Model: P(censored = 0 | X) for denominator")
                print("#")
                denominator_model = smf.logit("censored_inv ~ x1 + x2", data=self.data).fit(method="newton")
                print(denominator_model.summary2().tables[1].round(6))
            else:
                print("## Model: P(censored = 0 | X, previous treatment = 0) for denominator")
                print("#")
                denominator_model_0 = smf.logit("censored_inv ~ x1 + x2", 
                                                data=self.data[self.data["previous_treatment"] == 0]).fit(method="newton")
                print(denominator_model_0.summary2().tables[1].round(6))
                print("#")

                print("## Model: P(censored = 0 | X, previous treatment = 1) for denominator")
                print("#")
                denominator_model_1 = smf.logit("censored_inv ~ x1 + x2", 
                                                data=self.data[self.data["previous_treatment"] == 1]).fit(method="newton")
                print(denominator_model_1.summary2().tables[1].round(6))

        # 🚀 **NEW: Show Switch Weight Models for PP Only**
        if self.estimand == "PP" and self.switch_weights is not None:
            print("## Weight Models for Treatment Switch")
            print("#\n#")

            # Numerator model
            print("## Model: P(switch = 1 | X) for numerator")
            print("#")
            numerator_model = smf.logit("treatment ~ age + x1 + x3", data=self.data).fit(method="newton")
            print(numerator_model.summary2().tables[1].round(6))
            print("#")

            # Denominator model
            print("## Model: P(switch = 1 | X) for denominator")
            print("#")
            denominator_model = smf.logit("treatment ~ age + x1 + x3", data=self.data).fit(method="newton")
            print(denominator_model.summary2().tables[1].round(6))


#STEP 5
    def set_outcome_model(self, adjustment_terms="1"):
        if self.data is None:
            raise ValueError("set_data() before defining the outcome model.")

        # Determine treatment variable (PP vs ITT)
        treatment_var = "treatment" if self.estimand in ["ITT", "PP"] else "dose"

        # Retrieve Stabilized Weight Terms
        stabilised_weight_terms = "1"
        if self.switch_weights:
            stabilised_weight_terms += " + " + " + ".join(self.switch_weights['numerator'].split("~")[1].strip().split(" + "))
        if self.censor_weights:
            stabilised_weight_terms += " + " + " + ".join(self.censor_weights['numerator'].split("~")[1].strip().split(" + "))

        # Check if followup_time and trial_period exist
        additional_terms = []
        if "followup_time" in self.data.columns:
            additional_terms.append("followup_time + I(followup_time**2)")
        if "trial_period" in self.data.columns:
            additional_terms.append("trial_period + I(trial_period**2)")

        additional_formula = " + ".join(additional_terms) if additional_terms else ""

        # Create regression formula dynamically
        formula = f"outcome ~ {treatment_var} + {adjustment_terms} + {stabilised_weight_terms}"
        if additional_formula:
            formula += " + " + additional_formula

        # 🚀 Ensure weights exist before fitting the model
        if "final_weight" not in self.data.columns:
            raise ValueError("Weights have not been calculated. Run calculate_weights() first.")

        # 🔹 Extract predictor variables from formula
        predictor_vars = formula.split("~")[1].strip().split(" + ")
        predictor_vars = [var.strip() for var in predictor_vars if var.strip() != "1"]  # ✅ Remove "1"

        # ✅ Corrected Logistic Regression with Weights using GLM
        model = sm.GLM(
            self.data["outcome"],  # Dependent variable
            sm.add_constant(self.data[predictor_vars]),  # Independent variables (only those used in training)
            family=sm.families.Binomial(),  # Logistic regression
            weights=self.data["final_weight"]  # Correct weight handling
        ).fit()

        # 🚀 **Fix: Ensure prediction uses the same features as training**
        self.data["predicted_outcome"] = model.predict(sm.add_constant(self.data[predictor_vars]))  # ✅ Corrected
        self.data["residuals"] = self.data["outcome"] - self.data["predicted_outcome"]  # Residuals

        # Store the outcome model
        self.outcome_model = model
        

        return model
        
    def show_outcome_model(self):
        if self.outcome_model is None:
            return "Outcome model not specified."
        return self.outcome_model.summary()
    
    #step 6

    def set_expansion_options(self, output: TEDatastore, chunk_size: int = 0, first_period: int = 0, last_period: float = float('inf'), censor_at_switch: bool = False):
        
        self.expansion = TEExpansion(chunk_size = chunk_size, datastore = output, first_period = first_period, last_period = last_period, censor_at_switch = censor_at_switch)

        return self
    
    def expand_trials(self):
        data = self.data.copy()
        outcome_adj_vars = self.get_outcome_adjustment_vars()
        keeplist = ['id', 'trial_period', 'followup_time', 'outcome', 'weight', 'treatment', 'x2', 'age'] + outcome_adj_vars

        if 'wt' not in data.columns:
            data['wt'] =  1

        all_ids = data['id'].unique()
        if self.expansion.chunk_size == 0:
            ids_split = [all_ids]
        else: 
            ids_split = np.array_split(all_ids, np.ceil(len(all_ids) / self.expansion.chunk_size))

        for ids in ids_split:
            switch_data = self._expand_chunk(data, ids, outcome_adj_vars, keeplist)
            self.expansion.datastore = self.expansion.datastore.save_expanded_data(switch_data)
        
        return self
    
    def _expand_chunk(self, data: pd.DataFrame, ids: np.ndarray, outcome_adj_vars: List[str], keeplist: List[str]):
        chunk_data = data[data['id'].isin(ids)].copy()

        first_period = max([self.expansion.first_period, chunk_data[chunk_data['eligible'] == 1]['period'].min() or self.expansion.first_period])
        last_period = min([self.expansion.last_period, chunk_data[chunk_data['eligible'] == 1]['period'].max() or self.expansion.last_period])
        
        expanded_data = []
        for _, row in chunk_data.iterrows():
            if row['eligible'] == 1 and first_period <= row['period'] <= last_period:
                trial_start = row['period']
                trial_data = self._generate_trial_instance(row, chunk_data, trial_start, last_period, outcome_adj_vars, keeplist)
                expanded_data.append(trial_data)

        result = pd.concat(expanded_data, ignore_index=True) if expanded_data else pd.DataFrame()

        return result[keeplist]
    

    def _generate_trial_instance(self, baseline_row: pd.Series, data: pd.DataFrame, trial_start: int, last_period: float, outcome_adj_vars: List[str], keeplist: List[str]):

        id_val = baseline_row['id']
        patient_data = data[data['id'] == id_val].sort_values('period')
        rows = []

        if pd.isna(last_period) or last_period == float('inf'):
            last_period_value = patient_data['period'].max()
        else:
            last_period_value = last_period

        # Convert float to integer to handle errors
        if pd.notna(last_period_value):
            last_period_int = int(np.floor(float(last_period_value)))
        else:
            last_period_int = int(trial_start)

        max_period_value = patient_data['period'].max()
        if pd.notna(max_period_value):
            max_period = int(np.floor(float(max_period_value)))
        else:
            max_period = last_period_int 

        last_period_int = int(last_period_int)
        max_period = int(max_period)

        for period in range(int(trial_start), int(min(last_period_int + 1, max_period + 1))):
            period_row = patient_data[patient_data['period'] == period].iloc[0] if not patient_data[patient_data['period'] == period].empty else None
            
            if period_row is None:
                continue

            if self.expansion.censor_at_switch and period > trial_start:
                prev_row = patient_data[patient_data['period'] == (period - 1)].iloc[0]
                if prev_row['treatment'] != period_row['treatment']:
                    break  # Censor at switch

            trial_period = period - trial_start
            followup_time = period - trial_start
            row_dict = {
                'id': id_val,
                'trial_period': trial_period,
                'followup_time': followup_time,
                'outcome': period_row['outcome'],
                'weight': period_row.get('wt', 1.0),  
                'treatment': period_row['treatment'],
            }
            
            for var in outcome_adj_vars + ['age', 'x2']:
                if var in patient_data.columns:
                    row_dict[var] = period_row.get(var, np.nan)
                else:
                    row_dict[var] = np.nan 

            rows.append(pd.Series(row_dict))

        df = pd.DataFrame(rows)
        int_columns = ['id', 'trial_period', 'followup_time', 'outcome', 'treatment', 'age']
        df[int_columns] = df[int_columns].astype(int)

        return df
    
    def get_outcome_adjustment_vars(self):
        return getattr(self.outcome_model, 'adjustment_vars', [])
    

    # step 7
    def load_expanded_data(self, p_control: Optional[float] = None, period: Optional[List[int]] = None, subset_condition: Optional[str] = None, seed: Optional[int] = None):
        
        if p_control is None:
            data_table = self.expansion.datastore.data.copy()
            data_table['sample_weight'] = 1
        else:
            np.random.seed(seed) if seed is not None else np.random.seed()
            data_table = self.expansion.datastore.data.copy()

            mask_outcome_1 = data_table['outcome'] == 1
            mask_outcome_0 = data_table['outcome'] == 0
            sampled_0 = data_table[mask_outcome_0].sample(frac=p_control, replace=False)
            data_table = pd.concat([data_table[mask_outcome_1], sampled_0])

            data_table.loc[mask_outcome_0, 'sample_weight'] = 1 / p_control if p_control > 0 else 1
            data_table.loc[mask_outcome_1, 'sample_weight'] = 1

        if period is not None:
            data_table = data_table[data_table['trial_period'].isin(period) | data_table['followup_time'].isin(period)]
        
        if subset_condition is not None:
            data_table = data_table.query(subset_condition)
        
        data_table = data_table.sort_values(['id', 'trial_period', 'followup_time'])
        data_table = data_table.reset_index(drop=True)
        
        self.outcome_data = data_table
        
        return self


#Subclass of Trial Sequence, handles the PP (hehe) estimand
class TrialSequencePP(TrialSequence):
    def __init__(self, **kwargs):
        super().__init__("PP", **kwargs)
 
#Subclass of Trial Sequence, handles the ITT estimand
class TrialSequenceITT(TrialSequence):
    def __init__(self, **kwargs):
        super().__init__("ITT", **kwargs)

#trial_sequence function equivalent used in the article
def trial_sequence(estimand, **kwargs):
    estimand_classes = {
        "PP": TrialSequencePP,
        "ITT": TrialSequenceITT
    }

    if estimand not in estimand_classes:
        raise ValueError(f"{estimand} is not a valid estimand, choose either PP or ITT")
    
    return estimand_classes[estimand](**kwargs)

## III. Process

### 1. Setup
A sequence of target trials analysis starts by specifying which estimand will be used:

In [87]:
trial_pp = trial_sequence("PP")
trial_itt = trial_sequence("ITT")

### 2. Data Preparation
Next the user must specify the observational input data that will be used for the target trial emulation. Here we need to specify which columns contain which values and how they should be used.

In [88]:
data_censored = pd.read_csv("data_censored.csv")
print("Extracted Dummy Data")
display(data_censored)
data_censored["previous_treatment"] = data_censored["treatment"].shift(1).fillna(0)
#Setting the dataset to the data field
trial_pp.set_data(data_censored.copy())  # Create a separate copy
trial_itt.set_data(data_censored.copy())  


#Displaying the info stored in each class
trial_pp.show()

trial_itt.show()

Extracted Dummy Data


Unnamed: 0,id,period,treatment,x1,x2,x3,x4,age,age_s,outcome,censored,eligible
0,1,0,1,1,1.146148,0,0.734203,36,0.083333,0,0,1
1,1,1,1,1,0.002200,0,0.734203,37,0.166667,0,0,0
2,1,2,1,0,-0.481762,0,0.734203,38,0.250000,0,0,0
3,1,3,1,0,0.007872,0,0.734203,39,0.333333,0,0,0
4,1,4,1,1,0.216054,0,0.734203,40,0.416667,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
720,99,3,0,0,-0.747906,1,0.575268,68,2.750000,0,0,0
721,99,4,0,0,-0.790056,1,0.575268,69,2.833333,0,0,0
722,99,5,1,1,0.387429,1,0.575268,70,2.916667,0,0,0
723,99,6,1,1,-0.033762,1,0.575268,71,3.000000,0,0,0


Trial Sequence Object
Estimand: PP



Unnamed: 0,id,period,treatment,x1,x2,x3,x4,age,age_s,outcome,censored,eligible,previous_treatment,followup_time
0,1,0,1,1,1.146148,0,0.734203,36,0.083333,0,0,1,0.0,5
1,1,1,1,1,0.002200,0,0.734203,37,0.166667,0,0,0,1.0,5
2,1,2,1,0,-0.481762,0,0.734203,38,0.250000,0,0,0,1.0,5
3,1,3,1,0,0.007872,0,0.734203,39,0.333333,0,0,0,1.0,5
4,1,4,1,1,0.216054,0,0.734203,40,0.416667,0,0,0,1.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
720,99,3,0,0,-0.747906,1,0.575268,68,2.750000,0,0,0,0.0,7
721,99,4,0,0,-0.790056,1,0.575268,69,2.833333,0,0,0,0.0,7
722,99,5,1,1,0.387429,1,0.575268,70,2.916667,0,0,0,0.0,7
723,99,6,1,1,-0.033762,1,0.575268,71,3.000000,0,0,0,1.0,7



IPW for informative censoring:
Not calculated.

Outcome model:
Not specified.
Trial Sequence Object
Estimand: ITT



Unnamed: 0,id,period,treatment,x1,x2,x3,x4,age,age_s,outcome,censored,eligible,previous_treatment,followup_time
0,1,0,1,1,1.146148,0,0.734203,36,0.083333,0,0,1,0.0,5
1,1,1,1,1,0.002200,0,0.734203,37,0.166667,0,0,0,1.0,5
2,1,2,1,0,-0.481762,0,0.734203,38,0.250000,0,0,0,1.0,5
3,1,3,1,0,0.007872,0,0.734203,39,0.333333,0,0,0,1.0,5
4,1,4,1,1,0.216054,0,0.734203,40,0.416667,0,0,0,1.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
720,99,3,0,0,-0.747906,1,0.575268,68,2.750000,0,0,0,0.0,7
721,99,4,0,0,-0.790056,1,0.575268,69,2.833333,0,0,0,0.0,7
722,99,5,1,1,0.387429,1,0.575268,70,2.916667,0,0,0,0.0,7
723,99,6,1,1,-0.033762,1,0.575268,71,3.000000,0,0,0,1.0,7



IPW for informative censoring:
Not calculated.

Outcome model:
Not specified.


### 3. Weight Models
To adjust for the effects of informative censoring, inverse probability of censoring weights (IPCW) can be applied. To estimate these weights, we construct time-to-(censoring) event models. Two sets of models are fit for the two censoring mechanisms which may apply: censoring due to deviation from assigned treatment and other informative censoring.
#### 3.1 Censoring due to treatment switching
We specify model formulas to be used for calculating the probability of receiving treatment in the current period. Separate models are fitted for patients who had treatment = 1 and those who had treatment = 0 in the previous period. Stabilized weights are used by fitting numerator and denominator models.

There are optional arguments to specify columns which can include/exclude observations from the treatment models. These are used in case it is not possible for a patient to deviate from a certain treatment assignment in that period.

In [89]:
path = "Models"
trial_pp.set_switch_weight_model(numerator="age", denominator="age + x1 + x3", model_fitter=stats_glm_logit(save_path=os.path.join(path, "switch_models")))
trial_pp.show_switch_weights()



Optimization terminated successfully.
         Current function value: 0.660234
         Iterations 5
✅ Switch weights computed and stored in self.data


{'numerator': 'treatment ~ age',
 'denominator': 'treatment ~ age + x1 + x3',
 'model_fitter': 'te_stats_glm_logit',
 'fitted_model': <statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x14d9bcda6f0>}

#### 3.2 Other informative censoring
In case there is other informative censoring occurring in the data, we can create similar models to estimate the IPCW. These can be used with all types of estimand. We need to specifycensor_event which is the column containing the censoring indicator.

In [90]:
print(trial_pp.data.columns)  # Make sure 'previous_treatment' is in the dataset
print(trial_pp.data["previous_treatment"].value_counts())
trial_pp.set_censor_weight_model(censor_event="censored", numerator="x2", denominator="x2 + x1", pool_models="none", model_fitter=stats_glm_logit(save_path=os.path.join(path, "switch_models")))
trial_pp.show_censor_weights()

Index(['id', 'period', 'treatment', 'x1', 'x2', 'x3', 'x4', 'age', 'age_s',
       'outcome', 'censored', 'eligible', 'previous_treatment',
       'followup_time', 'switch_prob', 'switch_weight'],
      dtype='object')
previous_treatment
0.0    386
1.0    339
Name: count, dtype: int64
Optimization terminated successfully.
         Current function value: 0.682694
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.623353
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.634002
         Iterations 5


{'numerator': '1 - censored ~ x2',
 'denominator': '1 - censored ~ x2 + x1',
 'pool_numerator': False,
 'pool_denominator': False,
 'model_fitter': 'te_stats_glm_logit',
 'fitted_model': <statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x14d9be584d0>,
 'fitted_model_0': <statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x14d9bd19790>,
 'fitted_model_1': <statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x14d9be64ef0>}

In [91]:
trial_itt.set_censor_weight_model(censor_event="censored", numerator="x2", denominator="x2 + x1", pool_models="numerator", model_fitter=stats_glm_logit(save_path=os.path.join(path, "switch_models")))
trial_itt.show_censor_weights()

Optimization terminated successfully.
         Current function value: 0.682694
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.623353
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.634002
         Iterations 5


{'numerator': '1 - censored ~ x2',
 'denominator': '1 - censored ~ x2 + x1',
 'pool_numerator': True,
 'pool_denominator': False,
 'model_fitter': 'te_stats_glm_logit',
 'fitted_model': <statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x14d9be66e10>,
 'fitted_model_0': <statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x14d9bd3cb30>,
 'fitted_model_1': <statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x14d9bdffef0>}

#### 4. Calculate Weights
Next we need to fit the individual models and combine them into weights. This is done with calculate_weights().

In [92]:

trial_pp.calculate_weights()
trial_itt.calculate_weights()
print(trial_pp.data.columns)


Calculating switch weights...
Calculating censor weights...

Weight Summary for PP:
       switch_weight  censor_weight  final_weight
count     725.000000     725.000000    725.000000
mean        2.308916       2.180657      5.054075
std         0.694835       0.312298      1.765067
min         1.309778       1.441370      2.280781
25%         1.804966       1.961719      3.790160
50%         2.171654       2.146813      4.663147
75%         2.626700       2.358125      5.909398
max         5.556136       3.596574     13.443658
Calculating censor weights...

Weight Summary for ITT:
       censor_weight  final_weight
count     725.000000    725.000000
mean        2.180657      2.180657
std         0.312298      0.312298
min         1.441370      1.441370
25%         1.961719      1.961719
50%         2.146813      2.146813
75%         2.358125      2.358125
max         3.596574      3.596574
Index(['id', 'period', 'treatment', 'x1', 'x2', 'x3', 'x4', 'age', 'age_s',
       'outcome', 'c

In [93]:
trial_pp.show_weight_models()

## Weight Models for Informative Censoring
#
#
## Model: P(censored = 0 | X) for numerator
#
Optimization terminated successfully.
         Current function value: 0.267425
         Iterations 7
              Coef.  Std.Err.          z     P>|z|    [0.025    0.975]
Intercept  2.205875  0.165376  13.338511  0.000000  1.881743  2.530007
x1         0.701948  0.307264   2.284511  0.022342  0.099721  1.304174
x2        -0.470645  0.137497  -3.422940  0.000619 -0.740134 -0.201155
#
## Model: P(censored = 0 | X, previous treatment = 0) for denominator
#
Optimization terminated successfully.
         Current function value: 0.287706
         Iterations 7
              Coef.  Std.Err.         z     P>|z|    [0.025    0.975]
Intercept  1.861908  0.215667  8.633235  0.000000  1.439207  2.284608
x1         1.225127  0.402734  3.042029  0.002350  0.435784  2.014471
x2        -0.479630  0.185757 -2.582034  0.009822 -0.843706 -0.115554
#
## Model: P(censored = 0 | X, previous treatment = 1) for denom

In [94]:
trial_itt.show_weight_models()

## Weight Models for Informative Censoring
#
#
## Model: P(censored = 0 | X) for numerator
#
Optimization terminated successfully.
         Current function value: 0.267425
         Iterations 7
              Coef.  Std.Err.          z     P>|z|    [0.025    0.975]
Intercept  2.205875  0.165376  13.338511  0.000000  1.881743  2.530007
x1         0.701948  0.307264   2.284511  0.022342  0.099721  1.304174
x2        -0.470645  0.137497  -3.422940  0.000619 -0.740134 -0.201155
#
## Model: P(censored = 0 | X, previous treatment = 0) for denominator
#
Optimization terminated successfully.
         Current function value: 0.287706
         Iterations 7
              Coef.  Std.Err.         z     P>|z|    [0.025    0.975]
Intercept  1.861908  0.215667  8.633235  0.000000  1.439207  2.284608
x1         1.225127  0.402734  3.042029  0.002350  0.435784  2.014471
x2        -0.479630  0.185757 -2.582034  0.009822 -0.843706 -0.115554
#
## Model: P(censored = 0 | X, previous treatment = 1) for denom

### 5. Specify Outcome Model
Now we can specify the outcome model. Here we can include adjustment terms for any variables in the dataset. The numerator terms from the stabilised weight models are automatically included in the outcome model formula.

In [95]:
trial_pp.set_outcome_model()  
print(trial_pp.show_outcome_model())  

KeyError: "['I(followup_time**2)'] not in index"

### 6. Expand Trials
Now we are ready to create the data set with all of the sequence of target trials.

In [96]:
output = TEDatastore()
trial_pp.set_expansion_options(output, chunk_size=500, first_period = 0, last_period= float('inf'), censor_at_switch = True)

<__main__.TrialSequencePP at 0x14d94b43770>

#### 6.1 Create Sequence of Trials Data

In [97]:
trial_pp.expand_trials()
print("\nExpanded Data:")
print(trial_pp.expansion.datastore.data)


Expanded Data:
     id  trial_period  followup_time  outcome  weight  treatment        x2  \
0     1             0              0        0     1.0          1  1.146148   
1     1             1              1        0     1.0          1  0.002200   
2     1             2              2        0     1.0          1 -0.481762   
3     1             3              3        0     1.0          1  0.007872   
4     1             4              4        0     1.0          1  0.216054   
..   ..           ...            ...      ...     ...        ...       ...   
495  98             0              0        0     1.0          1  1.392339   
496  98             1              1        0     1.0          1 -0.934798   
497  98             2              2        0     1.0          1 -0.735241   
498  99             0              0        0     1.0          1 -0.346378   
499  99             1              1        0     1.0          1 -1.106481   

     age  
0     36  
1     37  
2     38  
3  

### 7. Load or Sample Expanded Data
Now that the expanded data has been created, we can prepare the data to fit the outcome model. For data that can fit comfortably in memory, this is a trivial step using load_expanded_data.

For large datasets, it may be necessary to sample from the expanded by setting the p_control argument. This sets the probability that an observation with outcome == 0 will be included in the loaded data. A seed can be set for reproducibility. Additionally, a vector of periods to include can be specified, e.g., period = 1:60, and/or a subsetting condition, subset_condition = "age > 65".

In [99]:
trial_pp.load_expanded_data(p_control = 0.5, seed=1234)


<__main__.TrialSequencePP at 0x14d94b43770>