# Assignment 2 - Clustering with Sessa Empirical Estimator

## 1. Read the Journals about the Sessa Empirical Estimator.

prompt: </br>
Our professor for Data Analysis gave us an assignment for clustering with Sessa Empirical Estimator. The first step is to read the journals about the Sessa Empirical Estimator. [journals attached in prompt]

ChatGPT response: </br>
The Sessa Empirical Estimator (SEE) is a data-driven method designed to compute the duration of pharmacological prescriptions when information on the prescribed dose and daily consumption is missing or incomplete. It uses a k-means clustering algorithm to analyze temporal distances between filled prescriptions and estimate prescription durations.

To get a deeper understanding of the Sessa Empirical Estimator and its application in clustering, I recommend focusing on the following points from the uploaded journals:

1. **Definition and Purpose of SEE**  
   - It helps determine drug exposure status when actual prescription durations are not available.
   - It is useful in pharmacoepidemiological studies, particularly for assessing persistence, effectiveness, or safety of treatments.

2. **Methodology of SEE**  
   - Uses k-means clustering to classify temporal distances between prescription refills.
   - Filters out artificially long gaps caused by patients who restart medication after a long pause.
   - Computes the empirical cumulative distribution function (ECDF) to determine the most probable prescription duration.

3. **Applications of SEE in Data Analysis**  
   - Applied in real-world healthcare data, particularly for evaluating adherence to antihypertensive and antiseizure medications.
   - Used in large-scale Danish healthcare registers to assess co-exposure to medications.
   - Compared with traditional Researcher-Defined Duration (RDD) methods, showing higher sensitivity.

4. **Performance and Accuracy**  
   - Demonstrated 96% accuracy and sensitivity when tested on simulated data.
   - Real-world applications showed sensitivity ranging from 78% to 95%, depending on the medication.

Would you like a summarized report or more details on how to implement clustering with SEE in your assignment?

## 2. Convert the R codes into Python Codes (use jupyter notebook).

prompt: </br>
Our next instruction is to convert R code into python using Jupyter Notebook. Can you teach us how to do it? [Attached SEE.R]

ChatGPT response: </br>
To convert the provided R code into Python using Jupyter Notebook, follow these steps:

### **Step 1: Understanding the R Code**
The R script you uploaded:
- Uses the `AdhereR`, `dplyr`, `plyr`, `lubridate`, `data.table`, and `stats` packages.
- Reads a dataset (`tidy`) and filters it based on an ATC drug code (`arg1`).
- Calculates empirical cumulative distribution functions (ECDF) for prescription durations.
- Applies k-means clustering to categorize drug refill intervals.
- Assigns a predicted duration based on cluster medians.

### **Step 2: Setting Up Python and Required Libraries**
You need the following Python packages:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import gaussian_kde
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

### **Step 3: Converting Key R Functions to Python**
Below is a basic conversion of the R function See() into Python:

**1. Load and Prepare Data**

In [2]:
import pandas as pd

# Simulated dataset equivalent to `tidy`
tidy = pd.DataFrame({
    "pnr": [1, 2, 1, 3, 2, 3, 1, 3],  
    "eksd": pd.to_datetime(["2023-01-01", "2023-02-01", "2023-02-15", "2023-03-01", "2023-03-15", "2023-04-01", "2023-04-15", "2023-05-01"]),
    "perday": [1, 1, 2, 1, 2, 1, 1, 1],  
    "ATC": ["medA", "medB", "medA", "medB", "medA", "medB", "medA", "medB"],
    "dur_original": [30, 15, 20, 40, 35, 25, 50, 30]  
})

# Function equivalent to `See()` in R
def see_python(drug_code):
    df = tidy[tidy["ATC"] == drug_code].copy()
    df = df.sort_values(["pnr", "eksd"])
    df["prev_eksd"] = df.groupby("pnr")["eksd"].shift(1)
    df = df.dropna()  
    df["event_interval"] = (df["eksd"] - df["prev_eksd"]).dt.days  

    # ECDF calculation
    x = np.sort(df["event_interval"])
    y = np.arange(1, len(x) + 1) / len(x)
    
    # Keep 80% ECDF
    df_ecdf = pd.DataFrame({"x": x, "y": y})
    df_ecdf = df_ecdf[df_ecdf["y"] <= 0.8]

    # Clustering
    data_scaled = np.log(df_ecdf["x"]).values.reshape(-1, 1)
    best_cluster_num = 2  # Placeholder; would use silhouette analysis
    kmeans = KMeans(n_clusters=best_cluster_num, random_state=1234)
    df_ecdf["cluster"] = kmeans.fit_predict(data_scaled)

    # Assign median durations from clusters
    cluster_medians = df_ecdf.groupby("cluster")["x"].median().to_dict()
    df["Cluster"] = kmeans.predict(np.log(df["event_interval"]).values.reshape(-1, 1))
    df["Median"] = df["Cluster"].map(cluster_medians)

    return df

# Example usage
medA_result = see_python("medA")
print(medA_result)

ValueError: n_samples=1 should be >= n_clusters=2.