Assignment 1 for Clustering:
New and novel methods in Machine Learning are made either by borrowing formulas and concepts from other scientific fields and redefining it based on new sets of assumptions, or by adding an extra step to an already existing framework of methodology.

In this exercise (Assignment 1 of the Clustering Topic), we will try to develop a novel method of Target Trial Emulation by integrating concepts of Clustering into the already existing framework. Target Trial Emulation is a new methodological framework in epidemiology which tries to account for the biases in old and traditional designs.

These are the instructions:
1. Look at this website: https://rpubs.com/alanyang0924/TTE
2. Extract the dummy data in the package and save it as "data_censored.csv"
2. Convert the R codes into Python Codes (use Jupyter Notebook), replicate the results using your python code.
3. Create another copy of your Python Codes, name it TTE-v2 (use Jupyter Notebook).
4. Using TTE-v2, think of a creative way on where you would integrate a clustering mechanism, understand each step carefully and decide at which step a clustering method can be implemented. Generate insights from your results.
5. Do this by pair, preferably your thesis partner.
6. Push to your github repository.
7. Deadline is 2 weeks from today: February 28, 2025 at 11:59 pm.

HINT: For those who dont have a thesis topic yet, you can actually develop a thesis topic out of this assignment.

I  dont mind you use A.I. tools with this assignment, but if you do please include your prompts in the submission.

# `TrialSequence` Class Documentation

The `TrialSequence` class is a Python implementation designed to emulate clinical trials for causal inference, specifically for per-protocol (PP) and intention-to-treat (ITT) analyses. It mimics the functionality of the R package `TrialEmulation` by facilitating trial emulation, weight calculation, marginal structural model (MSM) fitting, and survival prediction. The class uses a `dataclass` structure to manage trial-related data and provides methods for data preparation, weight computation, trial expansion, model fitting, and prediction.

---

## Class Overview

The `TrialSequence` class is built to handle longitudinal data with repeated measures, allowing users to emulate hypothetical trials, adjust for confounding using inverse probability weights, and estimate treatment effects using survival analysis.

### Dependencies
- `pandas` for data manipulation
- `numpy` for numerical operations
- `statsmodels` for survival analysis (Cox proportional hazards model)
- `random` for random sampling
- Python `dataclasses` and `typing` for structured data management

---

## Attributes

The `TrialSequence` class uses a `dataclass` to define its attributes, providing a clean and structured way to store trial-related data.

| Attribute            | Type                  | Default | Description                                                                 |
|----------------------|-----------------------|---------|-----------------------------------------------------------------------------|
| `estimand`           | `str`                 | -       | The type of estimand to analyze: `"PP"` (per-protocol) or `"ITT"` (intention-to-treat). Required upon initialization. |
| `data`               | `Optional[pd.DataFrame]` | `None`  | The input dataset containing longitudinal trial data.               |
| `id_col`             | `Optional[str]`       | `None`  | Name of the column in `data` identifying unique individuals.        |
| `period_col`         | `Optional[str]`       | `None`  | Name of the column in `data` indicating time periods.               |
| `treatment_col`      | `Optional[str]`       | `None`  | Name of the column in `data` indicating treatment assignment (binary: 0 or 1). |
| `outcome_col`        | `Optional[str]`       | `None`  | Name of the column in `data` indicating the outcome (event indicator: 0 or 1). |
| `eligible_col`       | `Optional[str]`       | `None`  | Name of the column in `data` indicating eligibility for trial emulation (0 or 1). |
| `switch_weights`     | `Optional[pd.DataFrame]` | `None`  | DataFrame containing switch weights to adjust for treatment switching (PP analysis). |
| `censor_weights`     | `Optional[pd.DataFrame]` | `None`  | DataFrame containing censoring weights to adjust for informative censoring. |
| `combined_weights`   | `Optional[pd.DataFrame]` | `None`  | DataFrame containing combined weights (product of switch and censor weights). |
| `outcome_model`      | `Optional[Any]`       | `None`  | The fitted outcome model object (e.g., an instance of `OutcomeModel`) for survival analysis. |
| `expansion`          | `Optional[pd.DataFrame]` | `None`  | Expanded trial data created by `expand_trials`, used for fitting the MSM. |
| `expansion_options`  | `Optional[Dict]`      | `None`  | Dictionary containing options for trial expansion, such as chunk size and output handler. |

---

## Methods

### `__init__(estimand: str)`
Initializes a new `TrialSequence` instance.

#### Parameters
- `estimand` (`str`): The type of estimand to analyze. Must be `"PP"` (per-protocol) or `"ITT"` (intention-to-treat).

#### Example
```python
trial = TrialSequence(estimand="ITT")
```

---

### `set_data(data: pd.DataFrame, id: str, period: str, treatment: str, outcome: str, eligible: str) -> TrialSequence`
Sets the input data and column names for the trial sequence.

#### Parameters
- `data` (`pd.DataFrame`): The input dataset containing longitudinal trial data.
- `id` (`str`): Name of the column in `data` identifying unique individuals.
- `period` (`str`): Name of the column in `data` indicating time periods.
- `treatment` (`str`): Name of the column in `data` indicating treatment assignment (binary: 0 or 1).
- `outcome` (`str`): Name of the column in `data` indicating the outcome (event indicator: 0 or 1).
- `eligible` (`str`): Name of the column in `data` indicating eligibility for trial emulation (0 or 1).

#### Returns
- `self`: Returns the `TrialSequence` instance for method chaining.

#### Example
```python
data = pd.DataFrame({
    'id': [1, 1, 2, 2],
    'period': [0, 1, 0, 1],
    'treatment': [0, 1, 0, 0],
    'outcome': [0, 1, 0, 0],
    'eligible': [1, 1, 1, 1]
})
trial.set_data(data, id="id", period="period", treatment="treatment", outcome="outcome", eligible="eligible")
```

---

### `set_switch_weight_model(numerator: str, denominator: str, model_fitter: Any) -> TrialSequence`
Sets up and calculates switch weights to adjust for treatment switching (used in PP analysis).

#### Parameters
- `numerator` (`str`): R-style formula string for the numerator model (e.g., `"~ age"`).
- `denominator` (`str`): R-style formula string for the denominator model (e.g., `"~ age + x1"`).
- `model_fitter` (`Any`): A model fitter object (e.g., `StatsGlmLogit`) to fit logistic regression models for switch weights.

#### Returns
- `self`: Returns the `TrialSequence` instance for method chaining.

#### Notes
- Switch weights are calculated using stabilized inverse probability weights to adjust for treatment switching over time.
- The `model_fitter` should have a `fit` method that accepts the data, treatment column, numerator and denominator variables, and ID and period columns.

#### Example
```python
trial.set_switch_weight_model(
    numerator="~ age",
    denominator="~ age + x1",
    model_fitter=StatsGlmLogit(save_path="switch_models")
)
```

---

### `set_censor_weight_model(censor_event: str, numerator: str, denominator: str, pool_models: str, model_fitter: Any) -> TrialSequence`
Sets up and calculates censoring weights to adjust for informative censoring.

#### Parameters
- `censor_event` (`str`): Name of the censoring indicator column in `data` (0 or 1).
- `numerator` (`str`): R-style formula string for the numerator model (e.g., `"~ x2"`).
- `denominator` (`str`): R-style formula string for the denominator model (e.g., `"~ x2 + x1"`).
- `pool_models` (`str`): Strategy for pooling models across periods: `"none"`, `"numerator"`, or `"denominator"`.
- `model_fitter` (`Any`): A model fitter object (e.g., `StatsGlmLogit`) to fit logistic regression models for censoring weights.

#### Returns
- `self`: Returns the `TrialSequence` instance for method chaining.

#### Notes
- Censoring weights are calculated using stabilized inverse probability of censoring weights.
- Requires a `CensorWeightCalculator` object to handle the weight computation.

#### Example
```python
trial.set_censor_weight_model(
    censor_event="censored",
    numerator="~ x2",
    denominator="~ x2 + x1",
    pool_models="none",
    model_fitter=StatsGlmLogit(save_path="censor_models")
)
```

---

### `calculate_weights() -> TrialSequence`
Combines switch and censor weights into a single set of weights for analysis.

#### Returns
- `self`: Returns the `TrialSequence` instance for method chaining.

#### Raises
- `ValueError`: If neither `switch_weights` nor `censor_weights` have been calculated.

#### Notes
- Creates a DataFrame of all possible individual-period combinations.
- Merges switch and censor weights, filling missing weights with 1.0.
- Computes combined weights as the product of switch and censor weights.

#### Example
```python
trial.calculate_weights()
```

---

### `set_outcome_model(adjustment_terms: Optional[str] = None) -> TrialSequence`
Sets up the outcome model for survival analysis.

#### Parameters
- `adjustment_terms` (`Optional[str]`): R-style formula string for adjustment terms (e.g., `"~ x2"`). If `None`, no adjustment terms are used.

#### Returns
- `self`: Returns the `TrialSequence` instance for method chaining.

#### Notes
- Creates an `OutcomeModel` instance, optionally with adjustment variables.

#### Example
```python
trial.set_outcome_model(adjustment_terms="~ x2")
```

---

### `set_expansion_options(output: Optional[Callable] = None, chunk_size: int = 500) -> TrialSequence`
Sets options for trial data expansion.

#### Parameters
- `output` (`Optional[Callable]`): An output handler function to process expanded data (default: `None`).
- `chunk_size` (`int`): Number of individuals to process per chunk during expansion (default: 500).

#### Returns
- `self`: Returns the `TrialSequence` instance for method chaining.

#### Example
```python
trial.set_expansion_options(output=save_to_datatable(), chunk_size=500)
```

---

### `expand_trials() -> TrialSequence`
Expands the trial data for analysis, creating a dataset of emulated trials.

#### Returns
- `self`: Returns the `TrialSequence` instance for method chaining.

#### Raises
- `ValueError`: If `expansion_options` are not set (call `set_expansion_options` first).

#### Notes
- Processes individuals in chunks to manage memory usage.
- Calls `_expand_individuals` to generate trial records for each chunk.

#### Example
```python
trial.expand_trials()
```

---

### `_expand_individuals(data: pd.DataFrame) -> pd.DataFrame`
Creates expanded trial data for a subset of individuals.

#### Parameters
- `data` (`pd.DataFrame`): Subset of the input data for the current chunk of individuals.

#### Returns
- `pd.DataFrame`: Expanded trial data for the given individuals.

#### Notes
- For each individual, identifies eligible periods and creates trial records.
- Computes survival time and event status for each trial.
- Includes baseline covariates and weights from `combined_weights`.

---

### `load_expanded_data(seed: Optional[int] = None, p_control: float = 0.5) -> TrialSequence`
Loads the expanded trial data and applies sampling weights.

#### Parameters
- `seed` (`Optional[int]`): Random seed for reproducibility (default: `None`).
- `p_control` (`float`): Probability of assignment to the control arm (default: 0.5).

#### Returns
- `self`: Returns the `TrialSequence` instance for method chaining.

#### Raises
- `ValueError`: If `expansion` is `None` (call `expand_trials` first).

#### Notes
- Adds sampling weights to balance treatment assignment probabilities.

#### Example
```python
trial.load_expanded_data(seed=1234, p_control=0.5)
```

---

### `fit_msm(weight_cols: List[str], modify_weights: Optional[Callable] = None) -> TrialSequence`
Fits a marginal structural model (MSM) using a Cox proportional hazards model.

#### Parameters
- `weight_cols` (`List[str]`): List of column names for additional weights (e.g., `["sample_weight"]`).
- `modify_weights` (`Optional[Callable]`): Function to modify combined weights (e.g., winsorization at the 99th percentile).

#### Returns
- `self`: Returns the `TrialSequence` instance for method chaining.

#### Raises
- `ValueError`: If `expansion` is `None` (call `expand_trials` first).

#### Notes
- Fits the model using `survival_time` and `event` columns from the expanded data.
- Combines weights from `combined_weights` and additional `weight_cols`.

#### Example
```python
trial.fit_msm(
    weight_cols=["sample_weight"],
    modify_weights=lambda w: np.minimum(w, np.quantile(w, 0.99))
)
```

---

### `predict(newdata: pd.DataFrame, predict_times: List[int], type: str = "survival") -> Dict`
Predicts survival outcomes based on the fitted MSM.

#### Parameters
- `newdata` (`pd.DataFrame`): Data for prediction, with the same structure as the training data.
- `predict_times` (`List[int]`): List of time points at which to predict survival probabilities (e.g., `[0, 1, ..., 10]`).
- `type` (`str`): Type of prediction (default: `"survival"`).

#### Returns
- `Dict`: A dictionary containing:
  - `arm_0`: Survival predictions for the control arm (treatment = 0).
  - `arm_1`: Survival predictions for the treatment arm (treatment = 1).
  - `difference`: DataFrame with columns `followup_time`, `survival_diff`, `2.5%`, and `97.5%`, representing the difference in survival probabilities and confidence intervals.

#### Raises
- `ValueError`: If the outcome model is not fitted (call `fit_msm` first).

#### Example
```python
prediction_data = trial.expansion[trial.expansion['trial_period'] == 1]
preds = trial.predict(
    newdata=prediction_data,
    predict_times=list(range(11)),
    type="survival"
)
```

---

### `_formula_to_vars(formula: Union[str, Any]) -> List[str]`
Converts an R-style formula string to a list of variable names.

#### Parameters
- `formula` (`Union[str, Any]`): R-style formula string (e.g., `"~ age + x1"`).

#### Returns
- `List[str]`: List of variable names extracted from the formula (e.g., `["age", "x1"]`).

#### Notes
- Handles formulas by removing the `~` and splitting on `+`.

---

## Example Usage
Below is a complete example of using the `TrialSequence` class to emulate trials, fit an MSM, and predict survival differences:

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.duration.hazard_regression import PHReg

# Create a sample dataset
data = pd.DataFrame({
    'id': [1, 1, 2, 2],
    'period': [0, 1, 0, 1],
    'treatment': [0, 1, 0, 0],
    'outcome': [0, 1, 0, 0],
    'eligible': [1, 1, 1, 1],
    'censored': [0, 0, 0, 1],
    'age': [30, 30, 40, 40],
    'x1': [0.5, 0.6, 0.3, 0.4]
})

# Initialize TrialSequence for ITT analysis
trial_itt = TrialSequence(estimand="ITT")

# Set data
trial_itt.set_data(
    data=data,
    id="id",
    period="period",
    treatment="treatment",
    outcome="outcome",
    eligible="eligible"
)

# Set censoring weights
trial_itt.set_censor_weight_model(
    censor_event="censored",
    numerator="~ age",
    denominator="~ age + x1",
    pool_models="none",
    model_fitter=StatsGlmLogit()
)

# Calculate weights
trial_itt.calculate_weights()

# Set outcome model with adjustment
trial_itt.set_outcome_model(adjustment_terms="~ age")

# Set expansion options and expand trials
trial_itt.set_expansion_options(chunk_size=500)
trial_itt.expand_trials()

# Load expanded data
trial_itt.load_expanded_data(seed=1234, p_control=0.5)

# Fit MSM
trial_itt.fit_msm(weight_cols=["sample_weight"])

# Predict survival differences
prediction_data = trial_itt.expansion[trial_itt.expansion['trial_period'] == 1].copy()
prediction_data['Intercept'] = 1  # Ensure intercept for prediction
preds = trial_itt.predict(
    newdata=prediction_data,
    predict_times=list(range(11)),
    type="survival"
)

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(preds['difference']['followup_time'], preds['difference']['survival_diff'], label="Survival Difference")
plt.plot(preds['difference']['followup_time'], preds['difference']['2.5%'], 'r--', label="95% CI")
plt.plot(preds['difference']['followup_time'], preds['difference']['97.5%'], 'r--')
plt.axhline(0, color='blue', linestyle='--')
plt.xlabel("Follow up")
plt.ylabel("Survival difference")
plt.title("Treatment Effect on Survival")
plt.legend()
plt.grid(True)
plt.show()
```

---

## Notes
- **Weight Calculations**: The class supports inverse probability weights for treatment switching (PP) and censoring (PP and ITT), crucial for unbiased causal inference.
- **Trial Expansion**: The `expand_trials` method emulates randomized trials by creating a dataset where each eligible period for an individual starts a new trial.
- **Survival Analysis**: Uses `statsmodels`’ `PHReg` for fitting a Cox proportional hazards model, allowing for time-to-event analysis.
- **Flexibility**: The class is designed to handle large datasets by processing individuals in chunks during trial expansion.

---

## Limitations
- The current implementation assumes a binary treatment variable (`0` for control, `1` for treatment).
- Confidence intervals in `predict` are simplified and may not be statistically rigorous (consider using bootstrap methods for better intervals).
- The baseline survival estimation in `OutcomeModel` may need enhancement to align with R’s event-driven approach (e.g., using Kaplan-Meier or Breslow estimator).