<div style="text-align:center;">
    <h1>🔬 Target Trial Emulation with Clustering Enhancement</h1>
    <h3>Assignment 1: Clustering Integration in Target Trial Emulation (TTE)</h3>
    <h4>Authors:</h4>
    <ul style="list-style:none;">
        <li>👤 Shawn Jurgen Mayol</li>
        <li>👤 Elgen Mar Arinasa</li>
    </ul>
    <hr>
</div>

## 📖 Introduction
Target Trial Emulation (TTE) is a methodological framework in epidemiology designed to reduce biases that arise in observational studies. It allows researchers to simulate randomized controlled trials (RCTs) using observational data. Traditional observational study designs often suffer from selection bias and confounding, leading to unreliable causal inferences.

This notebook aims to **replicate the Target Trial Emulation (TTE) process** from the `TrialEmulation` R package in **Python**, ensuring that the results match those obtained in R. Additionally, we will explore a **novel integration of clustering techniques** within the TTE framework to improve the robustness of the analysis.

## 🎯 Objectives
This notebook will accomplish the following tasks:
1. **Load and inspect** the provided dataset (`data_censored.csv`).
2. **Convert the original R-based TTE methodology to Python** while maintaining accuracy.
3. **Perform Target Trial Emulation (TTE) Analysis** following established frameworks.
4. **Introduce Clustering Methods into TTE (TTE-v2)**:
   - Implement **K-Means clustering** to identify treatment response patterns.
   - Implement **DBSCAN clustering** to detect hidden structures and potential outlier effects.
5. **Compare the performance of traditional TTE vs. TTE with Clustering**.
6. **Generate insights** from the results, discussing improvements and trade-offs.

## 📚 Import Required Libraries
Before we begin, we will import the necessary libraries for data processing, visualization, and machine learning.

```python
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score


## 📂 Data Loading and Initial Inspection
To begin our analysis, we load the dataset (`data_censored.csv`) and inspect its structure. This step allows us to understand the data types, check for missing values, and verify that the dataset is correctly formatted for further analysis.


In [4]:
# Import necessary library
import pandas as pd

# Load the dataset
df = pd.read_csv("data_censored.csv")

# Display dataset shape
print("📌 Dataset Shape:", df.shape)

# Show the first few rows
df.head()


📌 Dataset Shape: (725, 12)


Unnamed: 0,id,period,treatment,x1,x2,x3,x4,age,age_s,outcome,censored,eligible
0,1,0,1,1,1.146148,0,0.734203,36,0.083333,0,0,1
1,1,1,1,1,0.0022,0,0.734203,37,0.166667,0,0,0
2,1,2,1,0,-0.481762,0,0.734203,38,0.25,0,0,0
3,1,3,1,0,0.007872,0,0.734203,39,0.333333,0,0,0
4,1,4,1,1,0.216054,0,0.734203,40,0.416667,0,0,0


### 🔍 Data Inspection
We will now perform an initial exploration of the dataset by:
1. Checking for missing values.
2. Inspecting data types to ensure correct formatting.
3. Summarizing key statistics of numerical and categorical columns.


In [5]:
# Check for missing values
print("\n🔎 Missing Values:")
print(df.isnull().sum())

# Get dataset info (column names, data types, and non-null counts)
print("\n📜 Dataset Info:")
df.info()

# Display summary statistics
print("\n📊 Dataset Summary:")
df.describe(include="all")



🔎 Missing Values:
id           0
period       0
treatment    0
x1           0
x2           0
x3           0
x4           0
age          0
age_s        0
outcome      0
censored     0
eligible     0
dtype: int64

📜 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 725 entries, 0 to 724
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         725 non-null    int64  
 1   period     725 non-null    int64  
 2   treatment  725 non-null    int64  
 3   x1         725 non-null    int64  
 4   x2         725 non-null    float64
 5   x3         725 non-null    int64  
 6   x4         725 non-null    float64
 7   age        725 non-null    int64  
 8   age_s      725 non-null    float64
 9   outcome    725 non-null    int64  
 10  censored   725 non-null    int64  
 11  eligible   725 non-null    int64  
dtypes: float64(3), int64(9)
memory usage: 68.1 KB

📊 Dataset Summary:


Unnamed: 0,id,period,treatment,x1,x2,x3,x4,age,age_s,outcome,censored,eligible
count,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0
mean,49.278621,7.051034,0.467586,0.405517,-0.173552,0.486897,-0.274722,48.093793,1.091149,0.015172,0.08,0.234483
std,28.119313,5.802351,0.499293,0.491331,0.997552,0.500173,1.008643,11.834472,0.986206,0.122323,0.27148,0.423968
min,1.0,0.0,0.0,0.0,-3.284355,0.0,-3.003087,19.0,-1.333333,0.0,0.0,0.0
25%,23.0,2.0,0.0,0.0,-0.809344,0.0,-0.861899,40.0,0.416667,0.0,0.0,0.0
50%,50.0,6.0,0.0,0.0,-0.16306,0.0,-0.316594,49.0,1.166667,0.0,0.0,0.0
75%,73.0,12.0,1.0,1.0,0.494103,1.0,0.29951,56.0,1.75,0.0,0.0,0.0
max,99.0,19.0,1.0,1.0,3.907648,1.0,2.048087,78.0,3.583333,1.0,1.0,1.0


### 📌 Insights from Initial Inspection
- The dataset contains **`X` rows and `Y` columns**.
- **Key columns include:**
  - `id` → Unique identifier for patients.
  - `period` → Time period in the observation.
  - `treatment` → Binary variable indicating treatment assignment (`1 = treated`, `0 = control`).
  - `outcome` → Binary outcome variable (`1 = event occurred`, `0 = no event`).
  - `censored` → Whether the observation is censored (`1 = censored`, `0 = not censored`).
  - `eligible` → Binary indicator for eligibility in the trial.

If any missing values or incorrect data types are detected, we will handle them in the next step.
