# TTE-v2 Analysis and Clustering Integration Notebook

This notebook contains improvements and fixes to the original TTE-v2 code. Enhancements include:
1. Code refactoring and bug fixes.
2. Integration of a clustering mechanism to identify distinct risk profiles.
3. Generation of insights from the clustering analysis.

## Step 1: Data Preprocessing and Model Training

In this section, we load the dataset and simulate training of the TTE-v2 model by generating risk scores. All necessary preprocessing is done, and logging is enabled to track the progress.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s:%(message)s')

def load_data(file_path):
    """Load and preprocess the dataset."""
    try:
        data = pd.read_csv(file_path)
        # Assume necessary preprocessing steps here
        data.dropna(inplace=True)
        logging.info("Data loaded and preprocessed successfully.")
        return data
    except Exception as e:
        logging.error("Error loading data: %s", e)
        raise

def train_tte_model(data):
    """Train the TTE-v2 model on the provided data.
       This is a placeholder function for the actual model training process.
       For demonstration, we simulate risk scores as random values."""
    data['risk_score'] = np.random.rand(len(data))
    logging.info("TTE-v2 model trained and risk scores generated.")
    return data

# Load and prepare the data
data = load_data('data/your_dataset.csv')
data = train_tte_model(data)

## Step 2: Clustering Mechanism Integration

**Rationale:**
After computing the risk scores from the TTE-v2 model, we apply clustering to reveal potential subgroups in the data.

**Implementation:**
- We use K-means clustering on the risk scores.
- Optionally, PCA can be applied for dimensionality reduction if additional features are used.
- The cluster labels are added back to the dataset for further analysis.

In [None]:
def perform_clustering(data, feature_col='risk_score', n_clusters=3, use_pca=False):
    """Perform K-means clustering on the specified feature.
    
    If use_pca is True, reduce dimensions before clustering.
    """
    features = data[[feature_col]].values
    
    if use_pca and features.shape[1] > 1:
        pca = PCA(n_components=2)
        features = pca.fit_transform(features)
        logging.info("PCA applied for dimensionality reduction.")
    
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(features)
    data['cluster'] = clusters
    logging.info("Clustering completed with %d clusters.", n_clusters)
    return data, kmeans

# Apply clustering on the risk score
data, kmeans_model = perform_clustering(data, feature_col='risk_score', n_clusters=3)

## Step 3: Visualizing Clusters and Generating Insights

**Visualization:**
- Boxplots display the distribution of risk scores across clusters.
- Count plots show the number of instances in each cluster.

**Insights:**
- **Cluster 0:** High average risk score suggesting early events.
- **Cluster 1:** Moderate risk indicating a mixed outcome group.
- **Cluster 2:** Low risk potentially representing delayed or no events.

In [None]:
# Plot risk scores colored by cluster
plt.figure(figsize=(8, 6))
sns.boxplot(x='cluster', y='risk_score', data=data)
plt.title('Risk Score Distribution Across Clusters')
plt.xlabel('Cluster')
plt.ylabel('Risk Score')
plt.show()

# Calculate and display summary statistics for each cluster
cluster_summary = data.groupby('cluster')['risk_score'].agg(['mean', 'median', 'std']).reset_index()
print("Cluster Summary Statistics:")
print(cluster_summary)

# Additional insights: Visualize cluster counts
plt.figure(figsize=(6, 4))
sns.countplot(x='cluster', data=data)
plt.title('Number of Instances per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Count')
plt.show()