<center>
    <h1>Target Trial Emulation</h1>
    By Christian Abay-abay & Thristan Jay Nakila
</center>

## **Instructions**

1. Extract the dummy data from [RPubs - TTE](https://rpubs.com/alanyang0924/TTE) and save it as `data_censored.csv`.
2. Convert the R code to Python in a Jupyter Notebook, ensuring the results match the original.
3. Create a second version (`TTE-v2.ipynb`) with additional analysis.
4. Integrate clustering in `TTE-v2`, determine where it fits, and generate insights.
5. Work in pairs, preferably with your thesis partner.
6. Push your Jupyter Notebooks (`TTE.ipynb` and `TTE-v2.ipynb`) to GitHub.
7. 📅 **Deadline:** February 28, 2025, at **11:59 PM**.


***
<center>
    <h2>R Code converted to Python</h2>
    R Code from [RPubs - TTE](https://rpubs.com/alanyang0924/TTE) converted to Python code for this notebook.
</center>

In [14]:
import os
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.stats import norm

# Load data
file_path = "data/data_censored.csv"
data_censored = pd.read_csv(file_path)
print("Dataset Head:")
print(data_censored.head())

# Create directories for saving models
trial_pp_dir = os.path.join(os.getcwd(), "trial_pp")
trial_itt_dir = os.path.join(os.getcwd(), "trial_itt")
os.makedirs(trial_pp_dir, exist_ok=True)
os.makedirs(trial_itt_dir, exist_ok=True)

# Define datasets
trial_pp = data_censored.copy()
trial_itt = data_censored.copy()

# Compute additional columns to match R transformations
trial_itt['age_s'] = (trial_itt['age'] - trial_itt['age'].min()) / 12
trial_itt['time_on_regime'] = trial_itt.groupby('id').cumcount()
trial_itt['assigned_treatment'] = trial_itt['treatment'].shift(fill_value=0)
trial_itt['trial_period'] = trial_itt.groupby('id').cumcount()

# Fit logistic models for IPW (inverse probability weighting)
print("Fitting switch weight model for Per-protocol...")
X_pp = sm.add_constant(trial_pp[['age', 'x1', 'x3']])
y_pp = trial_pp['treatment']
switch_model_pp = sm.Logit(y_pp, X_pp).fit()

print("Fitting censor weight model for Intention-to-treat...")
X_censor_itt = sm.add_constant(trial_itt[['x2', 'x1']])
y_censor_itt = trial_itt['censored']
censor_model_itt = sm.Logit(y_censor_itt, X_censor_itt).fit()

# Calculating weights
trial_itt['weight'] = 1 / censor_model_itt.predict(X_censor_itt)

# Winsorization (modify extreme weights)
def winsorize_weights(weights, quantile=0.99):
    threshold = np.quantile(weights, quantile)
    return np.minimum(weights, threshold)

trial_itt['weight'] = winsorize_weights(trial_itt['weight'])

# Fit the outcome model with `followup_time`, `trial_period`, and their squares
print("Fitting outcome model...")
X_outcome_itt = trial_itt[['assigned_treatment', 'x2', 'time_on_regime', 'trial_period']].copy()
X_outcome_itt['time_on_regime_sq'] = X_outcome_itt['time_on_regime']**2
X_outcome_itt['trial_period_sq'] = X_outcome_itt['trial_period']**2
X_outcome_itt.insert(0, 'const', 1)  # Manually add constant column
y_outcome_itt = trial_itt['outcome']
outcome_model_itt = sm.Logit(y_outcome_itt, X_outcome_itt).fit()

print("Outcome model fitted with coefficients:")
print(outcome_model_itt.params)
print("Expected feature order:")
print(X_outcome_itt.columns)

# Generate predictions
print("Generating predictions...")
predict_times = np.arange(0, 11)
preds = pd.DataFrame({'followup_time': predict_times})
preds['assigned_treatment'] = 1  # Default value to match model structure
preds['x2'] = trial_itt['x2'].mean()  # Use mean x2 value for prediction
preds['time_on_regime'] = preds['followup_time']
preds['time_on_regime_sq'] = preds['time_on_regime']**2
preds['trial_period'] = preds['followup_time']  # Assuming trial_period follows followup_time
preds['trial_period_sq'] = preds['trial_period']**2

# Ensure constant column is added before selection
X_new = preds[['assigned_treatment', 'x2', 'time_on_regime', 'time_on_regime_sq', 'trial_period', 'trial_period_sq']]
X_new.insert(0, 'const', 1)  # Manually add constant column

print("Feature order in prediction matrix before reordering:")
print(X_new.columns)

# Ensure order matches trained model
X_new = X_new[X_outcome_itt.columns]

print("Feature order in prediction matrix after reordering:")
print(X_new.columns)

preds['survival_diff'] = outcome_model_itt.predict(X_new)

# Compute confidence intervals
pred_var = np.diag(X_new @ outcome_model_itt.cov_params() @ X_new.T)
pred_std_err = np.sqrt(pred_var)
z_critical = norm.ppf(0.975)  # 95% confidence interval
preds['lower'] = preds['survival_diff'] - z_critical * pred_std_err
preds['upper'] = preds['survival_diff'] + z_critical * pred_std_err

print("Predictions with Confidence Intervals:")
print(preds.head())

# Plot survival difference with confidence intervals
plt.plot(preds['followup_time'], preds['survival_diff'], label='Survival Difference', color='black')
plt.fill_between(preds['followup_time'], preds['lower'], preds['upper'], color='red', alpha=0.3, label='95% CI')
plt.xlabel("Follow-up Time")
plt.ylabel("Survival Difference")
plt.legend()
plt.show()


Dataset Head:
   id  period  treatment  x1        x2  x3        x4  age     age_s  outcome  \
0   1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
1   1       1          1   1  0.002200   0  0.734203   37  0.166667        0   
2   1       2          1   0 -0.481762   0  0.734203   38  0.250000        0   
3   1       3          1   0  0.007872   0  0.734203   39  0.333333        0   
4   1       4          1   1  0.216054   0  0.734203   40  0.416667        0   

   censored  eligible  
0         0         1  
1         0         0  
2         0         0  
3         0         0  
4         0         0  
Fitting switch weight model for Per-protocol...
Optimization terminated successfully.
         Current function value: 0.660234
         Iterations 5
Fitting censor weight model for Intention-to-treat...
Optimization terminated successfully.
         Current function value: 0.267425
         Iterations 7
Fitting outcome model...
Optimization terminated succes

LinAlgError: Singular matrix