# Kaplan Meier development
## Original community algorithm
Make sure to have R installed with the required dependencies. 

In [None]:
from importlib import import_module
_km = import_module("v6-kaplan-meier-py")

In [None]:
from vantage6.algorithm.tools.mock_client import MockAlgorithmClient
# Initialize the mock server. The datasets simulate the local datasets from
# the node. In this case we have two parties having two different datasets:
# a.csv and b.csv. The module name needs to be the name of your algorithm
# package. This is the name you specified in `setup.py`, in our case that
# would be v6-correlation-matrix-py.
dataset_1 = {"database": "/Users/frankmartin/Projects/Euracan/small_df.parquet", "db_type": "parquet"}
dataset_2 = {"database": "/Users/frankmartin/Projects/Euracan/small_df.parquet", "db_type": "parquet"}
dataset_3 = {"database": "/Users/frankmartin/Projects/Euracan/small_df.parquet", "db_type": "parquet"}
org_ids = ids = [0, 1, 2]

client = MockAlgorithmClient(
    datasets=[[dataset_1], [dataset_2], [dataset_3]],
    organization_ids=org_ids,
    module="v6-kaplan-meier-py",
)

In [None]:

import os
import base64
STRING_ENCODING = "utf-8"
ENV_VAR_EQUALS_REPLACEMENT = "="
def _encode(string: str):
    return (
        base64.b32encode(string.encode(STRING_ENCODING))
        .decode(STRING_ENCODING)
        .replace("=", ENV_VAR_EQUALS_REPLACEMENT)
    )
os.environ["KAPLAN_MEIER_TYPE_NOISE"] = _encode("NONE")

In [None]:
organizations = client.organization.list()
org_ids = ids = [organization["id"] for organization in organizations]

In [None]:
import pandas as pd
# Note that we only supply the task to a single organization as we only want to execute
# the central part of the algorithm once. The master task takes care of the
# distribution to the other parties.
average_task = client.task.create(
    input_={
        "method": "central",
        "kwargs": {
            "time_column_name": "SURVIVAL_DAYS",
            "censor_column_name": "CENSOR",
        },
    },
    organizations=[org_ids[0]],
)

results = client.result.get(average_task.get("id"))
df_events_clean = pd.read_json(results)

In [None]:
import matplotlib.pyplot as plt
fig, ax1 = plt.subplots()

# Plot the Kaplan-Meier curve for clean data
ax1.plot(
    df_events_clean["SURVIVAL_DAYS"], df_events_clean["survival_cdf"], label="Clean Data"
)

## Sarcoma registry extension
- [ ] Stratify on categorical variables
- [ ] make it multi-cohort / session based data input


### Stratify on categorical variables
The algorithm needs to be extended in order to support stratification on categorical variables. This simply means splitting the dataset on the categorical variable and computing the Kaplan Meier curve for each group.

* There are two partial functions: `get_unique_event_times` and `get_km_event_table`  see [here](https://github.com/vantage6/v6-kaplan-meier-py/blob/main/v6-kaplan-meier-py/partial.py). You should split the dataframe before reporting the results (and the results should be reported per strata).
* The the unique event times should be combined per strata somewhere [here](https://github.com/vantage6/v6-kaplan-meier-py/blob/e7ea6466f548902945c3f757f5ed036790ae2ea9/v6-kaplan-meier-py/central.py#L68-L78)
* And finally you need to do the same thing for the event table [here](https://github.com/vantage6/v6-kaplan-meier-py/blob/e7ea6466f548902945c3f757f5ed036790ae2ea9/v6-kaplan-meier-py/central.py#L80-L96)

### Multi-cohort / session
The algorithm currently works with only a single dataset supplied by the infrastructure. 

* In the current setup of the sarcoma registry we do not use this mechanism. Instead of using the node dataset we created copies of the output of the queries. These are stored in a directory where succesive algorithms can find it. This implies that we cannot use the default `@data` decorator to inject the algorithm with the data (see for example here: https://github.com/vantage6/v6-kaplan-meier-py/blob/e7ea6466f548902945c3f757f5ed036790ae2ea9/v6-kaplan-meier-py/partial.py#L20). 
* Another addition is to be able to compute the kaplan meier curve for multiple datasets at the same time (this saves us from having to run the algorithm multiple times on the infra making it very slow).

So what needs to be done:

* The `get_unique_event_times` and `get_km_event_table` first need to be executable without the `@data` decorator. This means that the data should be passed as an argument. In the summary algorithm I submitted a patch that does this, see the `summary_per_data_station` and `_summary_per_data_station` functions [here](https://github.com/vantage6/v6-summary-py/blob/bbdb1626bcfdf82659cd1ba82d6715e7060ad2fd/v6-summary-py/partial_summary.py#L24C5-L53). This way we can wrap our own decorator around it that does the same thing as the `@data` decorator but for multiple datasets and with our session setup. This new decorator is defined as:
    ```python
    from functools import wraps

    import pandas as pd

    from vantage6.algorithm.tools.util import info
    from vantage6.algorithm.tools.decorators import _get_user_database_labels


    def new_data_decorator(func: callable, *args, **kwargs) -> callable:
        @wraps(func)
        def decorator(*args, mock_data: list[pd.DataFrame] = None, **kwargs) -> callable:

            if mock_data:
                data_frames = mock_data
                cohort_names = [f"cohort_{i}" for i in range(len(mock_data))]
            else:
                cohort_names = _get_user_database_labels()
                data_frames = []
                for cohort_name in cohort_names:
                    info(f"Loading data for cohort {cohort_name}")

                    df = pd.read_parquet(f"/mnt/data/{cohort_name}.parquet")
                    data_frames.append(df)

            args = (data_frames, cohort_names, *args)
            return func(*args, **kwargs)

        decorator.wrapped_in_data_decorator = True
        return decorator
    ```
See how ive done this in [here](https://github.com/Franky-Codes-BV/v6-euracan-sarcoma-algorithms/blob/feature/add-summary-stats-per-var-and-cohort/v6-algorithm-wrapper-py/v6-algorithm-wrapper-py/__init__.py)
Note that I was able to reuse the partial functions from the summary community algorithm, but I had to rewrite the central function as the aggregation steps are different (because of the nested cohort results). 
