# TTE-v2 Analysis and Clustering Integration Notebook

This notebook presents an updated version of the TTE-v2 code with extensive documentation, improved explanations, and additional data exploration. The updates include:

1. **Data Preparation and Model Training:**
   - Loading and cleaning the input dataset.
   - Simulating the training of the TTE-v2 model by generating risk scores.

2. **Clustering Mechanism Integration:**
   - Using K-means clustering on the generated risk scores to identify distinct subgroups (e.g., high, moderate, and low risk).
   - Explaining why and how clustering is integrated into the analysis.

3. **Visualization and Insight Generation:**
   - Visualizing the distribution of risk scores by cluster and summarizing cluster statistics.

4. **Survival Analysis Using Kaplan-Meier Estimator (Optional):**
   - Estimating survival probabilities for different treatment groups and computing the survival difference.

Each section contains detailed explanations and code examples to ensure clarity and reproducibility.

## 1. Data Preparation and Model Training

In this section, we load the observational dataset from `data_censored.csv`, perform basic cleaning, and simulate the training of the TTE-v2 model by generating risk scores. 

We then display a snapshot of the data to confirm that the loading and preprocessing steps were successful.

In [None]:
import os
import numpy as np
import pandas as pd
import logging

# Configure logging to help trace the execution
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

def load_data(file_path):
    """Load the dataset from a CSV file and perform basic cleaning."""
    try:
        data = pd.read_csv(file_path)
        # Drop rows with missing values
        data.dropna(inplace=True)
        logging.info("Data loaded and cleaned successfully from %s.", file_path)
        return data
    except Exception as e:
        logging.error("Error loading data: %s", e)
        raise

def train_tte_model(data):
    """Simulate training of the TTE-v2 model by generating a risk score for each observation."
    # Here we simulate risk scores using random numbers for demonstration purposes.
    data['risk_score'] = np.random.rand(len(data))
    logging.info("TTE-v2 model simulated: risk scores generated.")
    return data

# Load data from the CSV file (adjust the path as needed)
data = load_data('data_censored.csv')

# Simulate TTE model training
data = train_tte_model(data)

# Display the first few rows of the data to verify
print("--- Data Snapshot ---")
print(data.head())

In [None]:
# Display summary information about the dataset
print("\n--- Data Summary ---")
print(f"Number of observations: {len(data)}")
if 'id' in data.columns:
    print(f"Number of unique patients: {data['id'].nunique()}")
else:
    print("Column 'id' not found in the data. Check your dataset columns.")

## 2. Clustering Mechanism Integration

After generating the risk scores, we apply K-means clustering to identify potential subgroups within the data. The rationale is as follows:

- **Why Clustering?**: Clustering the risk scores can reveal whether there are distinct subpopulations (e.g., high, moderate, and low risk) within the dataset. This can provide additional insights into the heterogeneity of the predictions from the TTE-v2 model.

- **How Clustering is Applied:**
  1. We extract the `risk_score` column as the feature for clustering.
  2. We use the K-means algorithm to partition the data into a predefined number of clusters (e.g., 3 clusters).
  3. The resulting cluster labels are appended to the original dataset for further analysis.

Below, we define a function `perform_clustering()` to carry out these steps and then apply it to our dataset.

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

def perform_clustering(data, feature_col='risk_score', n_clusters=3, use_pca=False):
    """Apply K-means clustering to the specified feature in the dataset.
    
    Parameters:
        data (DataFrame): The dataset containing the feature.
        feature_col (str): The column name to use for clustering.
        n_clusters (int): The number of clusters to form.
        use_pca (bool): Whether to apply PCA for dimensionality reduction (if needed).
    
    Returns:
        data (DataFrame): The original dataset with an added 'cluster' column.
        kmeans (KMeans): The fitted KMeans model.
    """
    # Extract features
    features = data[[feature_col]].values
    
    # Optionally apply PCA if more than one feature is available
    if use_pca and features.shape[1] > 1:
        pca = PCA(n_components=2)
        features = pca.fit_transform(features)
        logging.info("PCA applied for dimensionality reduction.")
    
    # Initialize and fit the K-means model
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(features)
    data['cluster'] = clusters
    logging.info("K-means clustering completed with %d clusters.", n_clusters)
    return data, kmeans

# Apply clustering on the risk score
data, kmeans_model = perform_clustering(data, feature_col='risk_score', n_clusters=3)

## 3. Visualizing Clusters and Generating Insights

In this section, we visualize the output of the clustering algorithm to better understand the distribution of risk scores across clusters. We:

- **Plot a Boxplot:** Displays the distribution (median, quartiles, and outliers) of risk scores for each cluster.

- **Print Summary Statistics:** Calculates and shows the mean, median, and standard deviation of risk scores for each cluster.

- **Plot a Count Plot:** Visualizes the number of observations in each cluster.

These visualizations help in identifying which cluster might represent high-risk patients versus low-risk or moderate-risk groups.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot risk score distribution by cluster
plt.figure(figsize=(8, 6))
sns.boxplot(x='cluster', y='risk_score', data=data)
plt.title('Risk Score Distribution Across Clusters')
plt.xlabel('Cluster')
plt.ylabel('Risk Score')
plt.show()

# Calculate and display summary statistics for each cluster
cluster_summary = data.groupby('cluster')['risk_score'].agg(['mean', 'median', 'std']).reset_index()
print("--- Cluster Summary Statistics ---")
print(cluster_summary)

# Plot the count of observations per cluster
plt.figure(figsize=(6, 4))
sns.countplot(x='cluster', data=data)
plt.title('Number of Observations per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Count')
plt.show()

## 4. Survival Analysis Using Kaplan-Meier Estimator

In addition to the clustering analysis, we perform survival analysis on a subset of the data to compare survival probabilities between treatment groups. 

The steps involved are:

1. **Filter the Data:** Select observations corresponding to the first trial period (or another relevant period).
2. **Fit Kaplan-Meier Models:** Fit separate Kaplan-Meier estimators for the treated and control groups.
3. **Predict Survival Probabilities:** Compute the survival probabilities at defined time points.
4. **Compute Survival Difference:** Calculate the difference in survival probabilities between the two groups and estimate confidence intervals.
5. **Visualize the Results:** Plot the survival difference over time along with the 95% confidence intervals.

This analysis provides insight into how survival probabilities differ based on treatment, which can be crucial for understanding the effect of the intervention.

In [None]:
from lifelines import KaplanMeierFitter

# Filter the data for the first trial period (adjust the condition as needed)
newdata = data[data["trial_period"] == 1] if "trial_period" in data.columns else data

# Define a range of follow-up times for prediction (0 to 10)
predict_times = np.arange(0, 11)

# Initialize Kaplan-Meier fitters for treated and control groups
kmf_treated = KaplanMeierFitter()
kmf_control = KaplanMeierFitter()

# Fit the Kaplan-Meier model for the treated group
treated_mask = newdata["assigned_treatment"] == 1 if "assigned_treatment" in newdata.columns else np.ones(len(newdata), dtype=bool)
kmf_treated.fit(
    durations=newdata[treated_mask]["followup_time"],
    event_observed=newdata[treated_mask]["outcome"],
    label="Treated"
)

# Fit the Kaplan-Meier model for the control group
control_mask = newdata["assigned_treatment"] == 0 if "assigned_treatment" in newdata.columns else np.zeros(len(newdata), dtype=bool)
kmf_control.fit(
    durations=newdata[control_mask]["followup_time"],
    event_observed=newdata[control_mask]["outcome"],
    label="Control"
)

# Predict survival probabilities at the defined time points
surv_prob_treated = kmf_treated.predict(predict_times)
surv_prob_control = kmf_control.predict(predict_times)

# Compute the difference in survival probabilities and estimate a basic 95% CI
survival_diff = surv_prob_treated - surv_prob_control
ci_lower = survival_diff - 1.96 * np.std(survival_diff)
ci_upper = survival_diff + 1.96 * np.std(survival_diff)

# Plot the survival probability difference over time
plt.figure(figsize=(8, 5))
plt.plot(predict_times, survival_diff, label="Survival Difference", color="blue")
plt.fill_between(predict_times, ci_lower, ci_upper, color="red", alpha=0.2, label="95% CI")
plt.xlabel("Follow-up Time")
plt.ylabel("Survival Difference")
plt.title("Survival Probability Difference Over Time")
plt.legend()
plt.show()

## Conclusion

In this notebook, we have:

- Loaded and preprocessed the input dataset from `data_censored.csv`.
- Simulated training of the TTE-v2 model by generating risk scores.
- Integrated a K-means clustering mechanism to identify potential subgroups in the data and provided detailed visualizations and statistics for each cluster.
- Performed survival analysis using Kaplan-Meier estimators to compare the survival probabilities of treated and control groups.

This detailed documentation and step-by-step explanation help ensure that the implementation is both clear and reproducible. Future improvements could include integrating more advanced survival models or automated report generation using tools like Sphinx or Jupyter Book.