# Assignment 1 for Clustering: Target Trial Emulation
- New and novel methods in Machine Learning are made either by borrowing formulas and concepts from other scientific fields and redefining it based on new sets of assumptions, or by adding an extra step to an already existing framework of methodology.

- In this exercise (Assignment 1 of the Clustering Topic), we will try to develop a novel method of Target Trial Emulation by integrating concepts of Clustering into the already existing framework. Target Trial Emulation is a new methodological framework in epidemiology which tries to account for the biases in old and traditional designs.

These are the instructions:
1. Look at this website: https://rpubs.com/alanyang0924/TTE
2. Extract the dummy data in the package and save it as "data_censored.csv"
2. Convert the R codes into Python Codes (use Jupyter Notebook), replicate the results using your python code.
3. Create another copy of your Python Codes, name it TTE-v2 (use Jupyter Notebook).
4. Using TTE-v2, think of a creative way on where you would integrate a clustering mechanism, understand each step carefully and decide at which step a clustering method can be implemented. Generate insights from your results.
5. Do this by pair, preferably your thesis partner.
6. Push to your github repository.
7. Deadline is: February 28, 2025 at 11:59 pm.

## I. Necessary Imports

In [269]:
import pandas as pd
import numpy as np
import os
import patsy
import joblib
import json
from sklearn.linear_model import LogisticRegression
from IPython.display import display
import statsmodels.formula.api as smf

## II. Class Definition and Required Functions

In [270]:
def stats_glm_logit(save_path):
    if save_path is not None:
        os.makedirs(save_path, exist_ok=True)

        def fit_model(numerator, denominator, data):
            formula_denominator = f"treatment ~ {denominator}"  
            y, X = patsy.dmatrices(formula_denominator, data, return_type="dataframe")
            model = LogisticRegression()
            model.fit(X, y.values.ravel())

            model_path = os.path.join(save_path, "logit_model.txt")
            joblib.dump(model, model_path)

            model_details = {
                "numerator": numerator, 
                "denominator": denominator,
                "model_type": "te_stats_glm_logit",
                "file_path" : model_path
            }

            json.dump(model_details, open(os.path.join(save_path, "model_details.json"), "w"))

            return model
        
    return fit_model


class TrialSequence:
    def __init__(self, esitmand, **kwargs):
        self.estimand = esitmand
        self.data = None
        self.censor_weights = None
        self.switch_weights = None
        self.outcome_model = None
        self.expansion = None
        self.outcome_data = None

    def set_data(self, data):
        self.data = data

    def show(self):
        print(f"Trial Sequence Object\nEstimand: {self.estimand}\n")
        
        if self.data is not None:
            display(self.data)
        else:
            print("No data set")

        print("\nIPW for informative censoring:")
        print(self.censor_weights if self.censor_weights is not None else "Not calculated.")
        if self.switch_weights is not None:
            print("\nIPW for treatment switch censoring:")
            print(self.switch_weights)
            
        print("\nOutcome model:")
        print(self.outcome_model if self.outcome_model is not None else "Not specified.")
        if self.outcome_data is not None:
            print("\nOutcome data:")
            print(self.outcome_data)
        
    def set_switch_weight_model(self, numerator=None, denominator=None, model_fitter=None, eligible_wts_0=None, eligible_wts_1=None):
        if self.data is None:
            raise ValueError("set_data() before setting switch weight models")
        
        if self.estimand == "ITT":
            raise ValueError("Switching weights are not supported for intention-to-treat analyses")

        if eligible_wts_0 and eligible_wts_0 in self.data.columns:
            self.data = self.data.rename(columns={eligible_wts_0: "eligible_wts_0"})
        if eligible_wts_1 and eligible_wts_1 in self.data.columns:
            self.data = self.data.rename(columns={eligible_wts_1: "eligible_wts_1"})

        if numerator is None:
            numerator = "1"
        if denominator is None:
            denominator = "1"
        
        if "time_on_regime" in denominator:
            raise ValueError("time_on_regime should not be used in denominator.")

        formula_numerator = f"treatment ~ {numerator}"
        formula_denominator = f"treatment ~ {denominator}"

        self.switch_weights = {
            "numerator": formula_numerator,
            "denominator": formula_denominator,
            "model_fitter": "te_stats_glm_logit",
        }

        if model_fitter is not None:
            fitted_model = model_fitter(numerator, denominator, self.data)  
            self.switch_weights["fitted_model"] = fitted_model 

    def show_switch_weights(self):
        return self.switch_weights if self.switch_weights else "Not calculated"
    
    def show_censor_weights(self):
        return self.censor_weights if self.censor_weights else "Not calculated"
    

    def set_censor_weight_model(self, censor_event, numerator="1", denominator="1", pool_models="none", model_fitter=None):
        if model_fitter is None: 
            model_fitter = stats_glm_logit()
            
        if censor_event not in self.data.columns:
            raise ValueError(f"'{censor_event}' must be a column in the dataset.")
        
        formula_numerator = f"1 - {censor_event} ~ {numerator}"
        formula_denominator = f"1 - {censor_event} ~ {denominator}"

        self.censor_weights = {
            "numerator": formula_numerator,
            "denominator": formula_denominator,
            "pool_numerator": pool_models in ["numerator", "both"],
            "pool_denominator": pool_models == "both",
            "model_fitter": "te_stats_glm_logit"
        }

        self.censor_weights["fitted_model"] = model_fitter(numerator, denominator, self.data)
        return self
    

    def calculate_weights(self, quiet=False):
        use_censor_weights = not isinstance(self.censor_weights, dict) or self.censor_weights.get("fitted_model") is not None

        if self.estimand == "PP":
            if not isinstance(self.switch_weights, dict) or self.switch_weights.get("fitted_model") is None:
                raise ValueError("Switch weight models are not specified. Use set_switch_weight_model()")
            self._calculate_weights_trial_seq(quiet, switch_weights=True, censor_weights=use_censor_weights)
        elif self.estimand == "ITT":
            self._calculate_weights_trial_seq(quiet, switch_weights=False, censor_weights=use_censor_weights)
        else:
            raise ValueError(f"Unknown estimand: {self.estimand}")

    def _calculate_weights_trial_seq(self, quiet, switch_weights, censor_weights):
        if switch_weights:
            if not quiet:
                print("Calculating switch weights...")
            # Logic to calculate switch weights
            pass

        if censor_weights:
            if not quiet:
                print("Calculating censor weights...")
            # Logic to calculate censor weights
            pass


    def show_weight_models(self):
        if self.censor_weights is not None:
            print("## Weight Models for Informative Censoring")
            print("#")
            print("#")

            # Numerator model
            print("## Model: P(censor_event = 0 | X) for numerator")
            print("#")
            numerator_formula = self.censor_weights["numerator"]
            numerator_model = smf.logit(numerator_formula, data=self.data).fit(disp=0)
            print(numerator_model.summary())
            print("#")

            # Denominator models
            if self.censor_weights["pool_denominator"]:
                print("## Model: P(censor_event = 0 | X) for denominator")
                print("#")
                denominator_formula = self.censor_weights["denominator"]
                denominator_model = smf.logit(denominator_formula, data=self.data).fit(disp=0)
                print(denominator_model.summary())
            else:
                print("## Model: P(censor_event = 0 | X, previous treatment = 0) for denominator")
                print("#")
                denominator_formula_0 = self.censor_weights["denominator"]
                denominator_model_0 = smf.logit(denominator_formula_0, data=self.data[self.data["previous_treatment"] == 0]).fit(disp=0)
                print(denominator_model_0.summary())
                print("#")

                print("## Model: P(censor_event = 0 | X, previous treatment = 1) for denominator")
                print("#")
                denominator_formula_1 = self.censor_weights["denominator"]
                denominator_model_1 = smf.logit(denominator_formula_1, data=self.data[self.data["previous_treatment"] == 1]).fit(disp=0)
                print(denominator_model_1.summary())

        if self.switch_weights is not None:
            print("## Weight Models for Treatment Switch")
            print("#")
            print("#")

            # Numerator model
            print("## Model: P(switch = 1 | X) for numerator")
            print("#")
            numerator_formula = self.switch_weights["numerator"]
            numerator_model = smf.logit(numerator_formula, data=self.data).fit(disp=0)
            print(numerator_model.summary())
            print("#")

            # Denominator model
            print("## Model: P(switch = 1 | X) for denominator")
            print("#")
            denominator_formula = self.switch_weights["denominator"]
            denominator_model = smf.logit(denominator_formula, data=self.data).fit(disp=0)
            print(denominator_model.summary())

#Subclass of Trial Sequence, handles the PP (hehe) estimand
class TrialSequencePP(TrialSequence):
    def __init__(self, **kwargs):
        super().__init__("PP", **kwargs)
 
#Subclass of Trial Sequence, handles the ITT estimand
class TrialSequenceITT(TrialSequence):
    def __init__(self, **kwargs):
        super().__init__("ITT", **kwargs)

#trial_sequence function equivalent used in the article
def trial_sequence(estimand, **kwargs):
    estimand_classes = {
        "PP": TrialSequencePP,
        "ITT": TrialSequenceITT
    }

    if estimand not in estimand_classes:
        raise ValueError(f"{estimand} is not a valid estimand, choose either PP or ITT")
    
    return estimand_classes[estimand](**kwargs)

## III. Process

### 1. Setup
A sequence of target trials analysis starts by specifying which estimand will be used:

In [271]:
trial_pp = trial_sequence("PP")
trial_itt = trial_sequence("ITT")

### 2. Data Preparation
Next the user must specify the observational input data that will be used for the target trial emulation. Here we need to specify which columns contain which values and how they should be used.

In [272]:
data_censored = pd.read_csv("data_censored.csv")
print("Extracted Dummy Data")
display(data_censored)
data_censored["previous_treatment"] = data_censored["treatment"].shift(1).fillna(0)
#Setting the dataset to the data field
trial_pp.set_data(data_censored)
trial_itt.set_data(data_censored)

#Displaying the info stored in each class
trial_pp.show()
trial_itt.show()

Extracted Dummy Data


Unnamed: 0,id,period,treatment,x1,x2,x3,x4,age,age_s,outcome,censored,eligible
0,1,0,1,1,1.146148,0,0.734203,36,0.083333,0,0,1
1,1,1,1,1,0.002200,0,0.734203,37,0.166667,0,0,0
2,1,2,1,0,-0.481762,0,0.734203,38,0.250000,0,0,0
3,1,3,1,0,0.007872,0,0.734203,39,0.333333,0,0,0
4,1,4,1,1,0.216054,0,0.734203,40,0.416667,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
720,99,3,0,0,-0.747906,1,0.575268,68,2.750000,0,0,0
721,99,4,0,0,-0.790056,1,0.575268,69,2.833333,0,0,0
722,99,5,1,1,0.387429,1,0.575268,70,2.916667,0,0,0
723,99,6,1,1,-0.033762,1,0.575268,71,3.000000,0,0,0


Trial Sequence Object
Estimand: PP



Unnamed: 0,id,period,treatment,x1,x2,x3,x4,age,age_s,outcome,censored,eligible,previous_treatment
0,1,0,1,1,1.146148,0,0.734203,36,0.083333,0,0,1,0.0
1,1,1,1,1,0.002200,0,0.734203,37,0.166667,0,0,0,1.0
2,1,2,1,0,-0.481762,0,0.734203,38,0.250000,0,0,0,1.0
3,1,3,1,0,0.007872,0,0.734203,39,0.333333,0,0,0,1.0
4,1,4,1,1,0.216054,0,0.734203,40,0.416667,0,0,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
720,99,3,0,0,-0.747906,1,0.575268,68,2.750000,0,0,0,0.0
721,99,4,0,0,-0.790056,1,0.575268,69,2.833333,0,0,0,0.0
722,99,5,1,1,0.387429,1,0.575268,70,2.916667,0,0,0,0.0
723,99,6,1,1,-0.033762,1,0.575268,71,3.000000,0,0,0,1.0



IPW for informative censoring:
Not calculated.

Outcome model:
Not specified.
Trial Sequence Object
Estimand: ITT



Unnamed: 0,id,period,treatment,x1,x2,x3,x4,age,age_s,outcome,censored,eligible,previous_treatment
0,1,0,1,1,1.146148,0,0.734203,36,0.083333,0,0,1,0.0
1,1,1,1,1,0.002200,0,0.734203,37,0.166667,0,0,0,1.0
2,1,2,1,0,-0.481762,0,0.734203,38,0.250000,0,0,0,1.0
3,1,3,1,0,0.007872,0,0.734203,39,0.333333,0,0,0,1.0
4,1,4,1,1,0.216054,0,0.734203,40,0.416667,0,0,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
720,99,3,0,0,-0.747906,1,0.575268,68,2.750000,0,0,0,0.0
721,99,4,0,0,-0.790056,1,0.575268,69,2.833333,0,0,0,0.0
722,99,5,1,1,0.387429,1,0.575268,70,2.916667,0,0,0,0.0
723,99,6,1,1,-0.033762,1,0.575268,71,3.000000,0,0,0,1.0



IPW for informative censoring:
Not calculated.

Outcome model:
Not specified.


### 3. Weight Models
To adjust for the effects of informative censoring, inverse probability of censoring weights (IPCW) can be applied. To estimate these weights, we construct time-to-(censoring) event models. Two sets of models are fit for the two censoring mechanisms which may apply: censoring due to deviation from assigned treatment and other informative censoring.
#### 3.1 Censoring due to treatment switching
We specify model formulas to be used for calculating the probability of receiving treatment in the current period. Separate models are fitted for patients who had treatment = 1 and those who had treatment = 0 in the previous period. Stabilized weights are used by fitting numerator and denominator models.

There are optional arguments to specify columns which can include/exclude observations from the treatment models. These are used in case it is not possible for a patient to deviate from a certain treatment assignment in that period.

In [273]:
path = "Models"
trial_pp.set_switch_weight_model(numerator="age", denominator="age + x1 + x3", model_fitter=stats_glm_logit(save_path=os.path.join(path, "switch_models")))
trial_pp.show_switch_weights()

{'numerator': 'treatment ~ age',
 'denominator': 'treatment ~ age + x1 + x3',
 'model_fitter': 'te_stats_glm_logit',
 'fitted_model': LogisticRegression()}

#### 3.2 Other informative censoring
In case there is other informative censoring occurring in the data, we can create similar models to estimate the IPCW. These can be used with all types of estimand. We need to specifycensor_event which is the column containing the censoring indicator.

In [274]:
trial_pp.set_censor_weight_model(censor_event="censored", numerator="x2", denominator="x2 + x1", pool_models="none", model_fitter=stats_glm_logit(save_path=os.path.join(path, "switch_models")))
trial_pp.show_censor_weights()

{'numerator': '1 - censored ~ x2',
 'denominator': '1 - censored ~ x2 + x1',
 'pool_numerator': False,
 'pool_denominator': False,
 'model_fitter': 'te_stats_glm_logit',
 'fitted_model': LogisticRegression()}

In [275]:
trial_itt.set_censor_weight_model(censor_event="censored", numerator="x2", denominator="x2+x1", pool_models="numerator", model_fitter=stats_glm_logit(save_path=os.path.join(path, "switch_models")))
trial_itt.show_censor_weights()

{'numerator': '1 - censored ~ x2',
 'denominator': '1 - censored ~ x2+x1',
 'pool_numerator': True,
 'pool_denominator': False,
 'model_fitter': 'te_stats_glm_logit',
 'fitted_model': LogisticRegression()}

#### 4. Calculate Weights
Next we need to fit the individual models and combine them into weights. This is done with calculate_weights().

In [276]:
trial_pp.calculate_weights()
trial_itt.calculate_weights()
trial_pp.show_weight_models()
trial_itt.show_weight_models()

Calculating switch weights...
Calculating censor weights...
Calculating censor weights...
## Weight Models for Informative Censoring
#
#
## Model: P(censor_event = 0 | X) for numerator
#


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q * linpred)))


LinAlgError: Singular matrix