<img src="https://github.com/IKNL/guidelines/blob/master/resources/logos/iknl_nl.png?raw=true" width=200 align="right">

# Vantage6 API for the RAVEN
This notebook *should* contain all the code needed to interact with the vantage6 API from the RAVEN UI.

In [None]:
# The following structure is used in this notebook. First we will handle some
# prerequisites:
#
#   1. Authenticate with the vantage6 server - We are currently using the IKNL KeyCloak
#      server. We should be able to swap it out for the CERTH KeyCloak server (if it is
#      configured correctly).
#   2. Creating prerequisites - This is **static** content which should already be at the
#      vantage6 server. This is also not needed in case of the RAVEN UI, and you can skip
#      this section as I already have created the required content. In the final
#      deployment we will manage these through the vantage6 UI.
#
# And then the sections which are specific to the RAVEN UI:
#
#   3. Creating a new **Study** in case a new **Workspace** is created in RAVEN
#   4. Creating a new **Session** in case a new **Analysis** is created in RAVEN
#   5. Creating a new **DataFrame** for a new Cohorts in **RAVEN** using the provided `patient_ids`
#   6. Running the **Summary** analysis for data exploration
#   7. Running other analytics (crosstab, etc.)
#
# In the **future** this notebook will be extended with (1) new analytics and (2) data
# preprocessing steps (e.g. imputation, new variables, etc.)
#
# This notebook is dependant on several services to be online. It might happen that one
# of the services is not reachable. Contact the right person:
#
# - The `https://auth.vantage6.ai:8443` is not reachable (login gives a blanc screen).
#   Make sure to contact IKNL (Anja van Gestel)
# - The `https://orchestrator.idea.lst.tfo.upm.es/server/version` is not reachable.
#   Most likely the service in the orchestrator is not running, best to ask Alejandro.
# - You are able to send vantage6 tasks but the task always stays in the `pending`
#   state. Typically the node came into a vegetative state because of a alpha bug.
#   Contact Daniele to restart the pod (delete it) or maybe Alejandro can help.
#
# You need to install the following packages:
#
# - !pip install requests
# - !pip install vantage6-client==5.0.0a22
# - !pip install pandas
#
# I would recommend reading the short introduction about vantage6 in our documentation:
# https://docs.vantage6.ai/en/main/introduction/introduction.html
#

In [53]:
import requests
import json
import base64

from vantage6.client import UserClient

## 1. Authenticate with the vantage6 server

In [54]:
# I authenticated using the vantage6 client library as since we are now using keycloak
# obtain the token from the keycloak server is not trivial because of the callback
# mechanism. I expect that you know how to authenticate with keycloak
#
# NOTE: The refresh mechanism is not working as expected now that we moved to keycloak.
# I've worked around this by setting the expiration time to 1 day. We need to fix this.

# Authentication will open a browser window to authenticate. You can login with the
# credentials I provided. It is also possible to just open a browser to
# https://orchestrator.idea.lst.tfo.upm.es:443 and login. In case the login is not
# working,
client = UserClient(
    "https://orchestrator.idea.lst.tfo.upm.es:443/server",
    auth_url="https://auth.vantage6.ai:8443",
    auth_client="public_client",
    auth_realm="vantage7",
    log_level="INFO"
)
client.authenticate()

# Set the headers for the other requests
headers = {
    "Authorization": f"Bearer {client._access_token}"
}

# Print the server version
print("Server version: ", client.util.get_server_version())

 Welcome to
                  _                     __  
                 | |                   / /  
__   ____ _ _ __ | |_ __ _  __ _  ___ / /_  
\ \ / / _` | '_ \| __/ _` |/ _` |/ _ \ '_ \ 
 \ V / (_| | | | | || (_| | (_| |  __/ (_) |
  \_/ \__,_|_| |_|\__\__,_|\__, |\___|\___/ 
                            __/ |           
                           |___/            

 --> Join us on Discord! https://discord.gg/rwRvwyK
 --> Docs: https://docs.vantage6.ai
 --> Blog: https://vantage6.ai
------------------------------------------------------------
Cite us!
If you publish your findings obtained using vantage6, 
please cite the proper sources as mentioned in:
https://vantage6.ai/vantage6/references
------------------------------------------------------------
opening browser for login


127.0.0.1 - - [08/Jul/2025 10:11:13] "GET /callback?state=state&session_state=0c70256e-f0c8-4f56-8f93-780eda623470&iss=https%3A%2F%2Fauth.vantage6.ai%3A8443%2Frealms%2Fvantage7&code=63c397d1-30db-48ef-8320-af43a4184360.0c70256e-f0c8-4f56-8f93-780eda623470.40218515-930d-4f32-b4c8-4e650a2bfd46 HTTP/1.1" 200 -


 --> Succesfully authenticated
 --> Name: None (id=7)
 --> Organization: root (id=1)
Server version:  {'version': '5.0.0a22'}


## 2. Creating prerequisites (Static content)

In [11]:
# This is 'static' content which should already be at the vantage6 server. The vantage6
# UI can be used to manage the 'static' content. It is static from the point of view of
# the RAVEN UI. All calls in this section use the vantage6 client library, as you don't
# need to implement these in RAVEN.
#
# It is also possible to use the vantage6 UI for this purpose. You can use your browser
# to navigate to `https://orchestrator.idea.lst.tfo.upm.es:443` and login with the
# credentials I provided.
#
# **YOU DO NOT NEED TO CREATE THESE, YOU CAN SKIP THIS SECTION.**

### 2.1 Create the organizations

In [12]:
# client.organization.create(
#     name="Example Organization 1",
#     address1="123 Main St",
#     address2="Apt 1",
#     zipcode="1234AB",
#     country="NL",
#     domain="example-organization-1.com",
# )

In [13]:
# client.organization.create(
#     name="Example Organization 2",
#     address1="123 Main St",
#     address2="Apt 2",
#     zipcode="1234AB",
#     country="NL",
#     domain="example-organization-2.com",
# )

In [14]:
# The organizations are created. All organization have an ID which can be used to
# identify the organization at a later stage.
client.organization.list(fields=('id', 'name'))

[{'id': 3, 'name': 'Example Organization 2'},
 {'id': 2, 'name': 'Example Organization 1'},
 {'id': 1, 'name': 'root'}]

### 2.2 Create the users

In [15]:
# Users have certain permissions. These permissions are given in the form of `rules`. To
# make it easier to manage them, they are grouped in `roles`. We can use the ID of the
# role to assign it to a user.
client.role.list(fields=('id', 'name'))

[{'id': 2, 'name': 'container'},
 {'id': 1, 'name': 'Root'},
 {'id': 5, 'name': 'Researcher'},
 {'id': 6, 'name': 'Organization Admin'},
 {'id': 4, 'name': 'Viewer'},
 {'id': 3, 'name': 'node'},
 {'id': 7, 'name': 'Collaboration Admin'}]

In [16]:
# client.user.create(
#     username="user1",
#     password="Password123!",
#     email="user1@example-organization-1.com",
#     firstname="User 1",
#     lastname="User 1",
#     organization=2,
#     roles=[6]
# )

In [17]:
# client.user.create(
#     username="user2",
#     password="Password123!",
#     email="user2@example-organization-2.com",
#     firstname="User 2",
#     lastname="User 2",
#     organization=3,
#     roles=[6]
# )

In [18]:
# client.user.create(
#     username="raven",
#     password="Password123!",
#     email="raven@example-organization-2.com",
#     firstname="Raven",
#     lastname="Raven",
#     organization=2,
#     roles=[7]
# )

In [19]:
# TODO FM: Check that the users are both in KeyCloak and in the vantage6 server.
client.user.list(fields=('id', 'username'))

[{'id': 7, 'username': 'admin'},
 {'id': 9, 'username': 'user_2'},
 {'id': 11, 'username': 'alejandro'},
 {'id': 4, 'username': 'raven'},
 {'id': 8, 'username': 'user_1'}]

### 2.3 Create the collaboration

In [20]:
# client.collaboration.create(
#     name="Example Collaboration 1",
#     organizations=[2, 3]
# )

In [21]:
# In vantage6 multiple collaborations can be present. A collaboration is a group of
# organizations that can collaborate on a certain task. In IDEA4RC we create one
# collaboration for all CoE and Research Center organizations. We also have created a
# testing collaboration for now. We need the collaboration ID to create a new session
# and also when we want to create a new task.
client.collaboration.list(fields=('id', 'name'), scope="global")

[{'id': 2, 'name': 'Testing'}, {'id': 1, 'name': 'Example Collaboration 1'}]

### 2.4 Create the nodes

In [22]:
# client.node.create(
#     collaboration=1,
#     organization=2,
#     name="Organization 2 Node 1",
# )

In [23]:
# client.node.create(
#     collaboration=1,
#     organization=3,
#     name="Organization 3 Node 1",
# )

In [24]:
client.node.list(fields=("id", "name", "status"))

[{'id': 10, 'name': 'Testing-root-node', 'status': 'online'},
 {'id': 6, 'name': 'Organization 3 Node 1', 'status': 'offline'},
 {'id': 5, 'name': 'Organization 2 Node 1', 'status': 'offline'}]

## 3. New Workspace
*New study in vantage6*

In [55]:
# Lets set the collaboration ID now to the testing collaboration. Later we need to
# change this to the collaboration ID of the IDEA4RC collaboration.
COLLABORATION_ID = 2

In [56]:
# Normally, we expect all organizations to be part of 'the' Collaboration. However, our
# test collaboration does not. So we collect the organizations that are part of the
# test collaboration and add them to the study. In RAVEN these are the organizations
# that signed the data permit for the workspace.
#
# TODO We need to map the organizations in RAVEN to the organization ids in vantage6.
#
response = requests.get(
    f"https://orchestrator.idea.lst.tfo.upm.es/server/organization?collaboration_id={COLLABORATION_ID}",
    headers=headers
)
response.json()

{'data': [{'domain': None,
   'id': 1,
   'address2': None,
   'studies': '/server/study?organization_id=1',
   'zipcode': None,
   'runs': '/server/run?organization_id=1',
   'country': None,
   'nodes': '/server/node?organization_id=1',
   'public_key': '',
   'address1': None,
   'collaborations': '/server/collaboration?organization_id=1',
   'tasks': '/server/task?init_org_id=1',
   'users': '/server/user?organization_id=1',
   'name': 'root'}],
 'links': {'first': '/server/organization?collaboration_id=2&page=1',
  'self': '/server/organization?collaboration_id=2&page=1',
  'last': '/server/organization?collaboration_id=2&page=1'}}

In [57]:
ORGANIZATION_IDS = [org["id"] for org in response.json()["data"]]
ORGANIZATION_IDS

[1]

In [None]:
# To create a new study we need the organizations ids (the internal ids in vantage6)
# that are included in this workspace. The name of the study needs to be unique.
response = requests.post(
    "https://orchestrator.idea.lst.tfo.upm.es/server/study",
    headers=headers,
    json={
        # The collaboration id is the vantage6 id of the collaboration. This is
        # is the same for all workspaces. I used 1 now, but this can change when we
        # are still developing the platform.
        "collaboration_id": COLLABORATION_ID,
        # NOTE --- CHANGE THE NAME OF THE STUDY  ---
        # The name of the study needs to be unique. I guess the name of the workspace
        # is also unique, so we can use that. You could consider using a UUID.
        "name": "UPM Test Study 0",
        # The organization ids are the internal ids of the organizations in vantage6.
        "organization_ids": ORGANIZATION_IDS,
    }
)
response.json()
# In the case that:
#
# - The name is not unique
# - The collaboration id is not valid (non existing)
# - The organization ids are not valid (non existing)
#
# The API will return a 4xx error with a message. It will be of the following format:
# {
#     "msg": "Error message",
# }

{'collaboration': {'id': 2,
  'link': '/server/collaboration/2',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'organizations': [{'domain': None,
   'id': 1,
   'address2': None,
   'studies': '/server/study?organization_id=1',
   'zipcode': None,
   'runs': '/server/run?organization_id=1',
   'country': None,
   'nodes': '/server/node?organization_id=1',
   'public_key': '',
   'address1': None,
   'collaborations': '/server/collaboration?organization_id=1',
   'tasks': '/server/task?init_org_id=1',
   'users': '/server/user?organization_id=1',
   'name': 'root'}],
 'id': 16,
 'tasks': '/server/task?study_id=16',
 'name': 'UPM Test Study 0'}

In [60]:
# Now that we have our study ID lets save it so we can use it later.
STUDY_ID = response.json()["id"]

In [61]:
# You can always view all studies. This endpoint is not necessarily needed for the
# RAVEN UI but I thought it would be useful to have it here.
response = requests.get("https://orchestrator.idea.lst.tfo.upm.es/server/study", headers=headers)
response.json()["data"]

[{'collaboration': {'id': 2,
   'link': '/server/collaboration/2',
   'methods': ['PATCH', 'DELETE', 'GET']},
  'organizations': '/server/organization?study_id=16',
  'id': 16,
  'tasks': '/server/task?study_id=16',
  'name': 'UPM Test Study 0'},
 {'collaboration': {'id': 1,
   'link': '/server/collaboration/1',
   'methods': ['PATCH', 'DELETE', 'GET']},
  'organizations': '/server/organization?study_id=9',
  'id': 9,
  'tasks': '/server/task?study_id=9',
  'name': 'Example Study 9'},
 {'collaboration': {'id': 1,
   'link': '/server/collaboration/1',
   'methods': ['PATCH', 'DELETE', 'GET']},
  'organizations': '/server/organization?study_id=2',
  'id': 2,
  'tasks': '/server/task?study_id=2',
  'name': 'Example Study 2'},
 {'collaboration': {'id': 1,
   'link': '/server/collaboration/1',
   'methods': ['PATCH', 'DELETE', 'GET']},
  'organizations': '/server/organization?study_id=3',
  'id': 3,
  'tasks': '/server/task?study_id=3',
  'name': 'Example Study 3'},
 {'collaboration': {'id'

In [62]:
# You can also view the organizations that are part of a study. This endpoint is not
# necessarily needed for the RAVEN UI but I thought it would be useful to have it here.
response = requests.get(
    f"https://orchestrator.idea.lst.tfo.upm.es/server/organization?study_id={STUDY_ID}",
    headers=headers
)
response.json()

{'data': [{'domain': None,
   'id': 1,
   'address2': None,
   'studies': '/server/study?organization_id=1',
   'zipcode': None,
   'runs': '/server/run?organization_id=1',
   'country': None,
   'nodes': '/server/node?organization_id=1',
   'public_key': '',
   'address1': None,
   'collaborations': '/server/collaboration?organization_id=1',
   'tasks': '/server/task?init_org_id=1',
   'users': '/server/user?organization_id=1',
   'name': 'root'}],
 'links': {'first': '/server/organization?study_id=16&page=1',
  'self': '/server/organization?study_id=16&page=1',
  'last': '/server/organization?study_id=16&page=1'}}

## 4. New Analysis
*New session in vantage6*

In [None]:
# When a new analysis is created in RAVEN we need to create a new session in vantage6.
# A session is a file space on the data stations in which we can store dataframes (an
# extraction of the data from the OMOP database). We need the study id which should be
# stored in the workspace in order to create the session.
response = requests.post(
    "https://orchestrator.idea.lst.tfo.upm.es/server/session",
    headers=headers,
    json={
        # The collaboration id is the vantage6 id of the collaboration. This is
        # is the same for all workspaces. I used 1 now, but this can change when we
        # are still developing the platform.
        "collaboration_id": COLLABORATION_ID,
        # NOTE --- CHANGE THE NAME OF THE SESSION  ---
        # The name of the session needs to be unique within the collaboration, so in the
        # case of IDEA4RC this needs to always be unique. I would use the analysis ID to
        # create a unique name. You could consider using a UUID.
        "name": "UPM Test Session 0",
        # The study id should be linked to the workspace.
        "study_id": STUDY_ID,
        # The scope is the scope of the session. In IDEA4RC we use the collaboration
        # scope. This means that others users can use the same session.
        "scope": "collaboration"
    }
)
response.json()
# In the case that:
#
# - The name is not unique
# - The study id is not valid (non existing)
# - The scope is not valid (only 'collaboration' should be used)
# - The collaboration id is not valid (non existing)
#
# The API will return a 4xx error with a message. It will be of the following format:
# {
#     "msg": "Error message",
# }

{'collaboration': {'id': 2,
  'link': '/server/collaboration/2',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'last_used_at': '2025-07-07T10:36:03.934431',
 'study': {'id': 16,
  'link': '/server/study/16',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'dataframes': '/server/session/13/dataframe',
 'created_at': '2025-07-07T10:36:03.934369',
 'scope': 'col',
 'id': 13,
 'ready': True,
 'owner': {'id': 7,
  'link': '/server/user/7',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'tasks': '/server/task?session_id=13',
 'name': 'UPM Test Session 0'}

In [64]:
SESSION_ID = response.json()["id"]
SESSION_ID

13

## 5. New cohort
*Create a new dataframe in vantage6*

In [65]:
# When a new cohort is created vantage6 needs to extract the data from the OMOP database
# and store it in the session as a dataframe. This is done by executing a vantage6
# extraction task.

#
# Static content
#
image = "harbor2.vantage6.ai/idea4rc/sessions:latest"
label = "omop"

#
# Dynamic content
#
# The name of the cohort, this should be unique within a session. You can probably use
# the same name that you use in the RAVEN UI. Alternatively, we can also not send it.
# In that case the name will be generated by vantage6.
# name = "Cohort_name_84"

# Each `image` can have multiple `methods`. We need to use a different method for
# sarcoma and head and neck as we are extracting different features.
method = "create_cohort"

# The input for the task is the patient ids and which features we want to extract.
arguments = {
    "kwargs": {
        # NOTE --- CHANGE THE PATIENT IDS TO THE PATIENT IDS OF THE COHORT ---
        # These `patient_ids` should be coming from the cohort builder in RAVEN
        "patient_ids": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        # NOTE --- CHANGE THE FEATURES TO THE FEATURES OF THE COHORT ---
        # The features are the features that we want to extract from the OMOP database.
        # This can be either "sarcoma" or "head_neck".
        # TODO The head_neck is not implemented yet.
        "features": "sarcoma"
    }
}

In [66]:
# before we can create a task we need to prepare task instructions. In vantage6 we can
# (but we dont in IDEA4RC) use end-to-end encryption, therefore we need to store the
# input for each organization individually.
payload = {
    "label": label,
    # "name": name, # optional, v6 will generate a name if not provided
    "task": {
        "method": method,
        "image": image,
        # In vantage6 we can (but we dont in IDEA4RC) use end-to-end encryption,
        # therefore we need to store the input for each organization individually.
        "organizations": [
            {
                "id": id_,
                "input": base64.b64encode(
                    json.dumps(arguments).encode("UTF-8")
                ).decode("UTF-8")
            }
            # We always create a cohort for all organizations in the study. Even though
            # in a later stage we might send computation tasks to a subset of the
            # organizations.
            for id_ in ORGANIZATION_IDS
        ]
    }
}
payload

{'label': 'omop',
 'task': {'method': 'create_cohort',
  'image': 'harbor2.vantage6.ai/idea4rc/sessions:latest',
  'organizations': [{'id': 1,
    'input': 'eyJrd2FyZ3MiOiB7InBhdGllbnRfaWRzIjogWzEsIDIsIDMsIDQsIDUsIDYsIDcsIDgsIDksIDEwXSwgImZlYXR1cmVzIjogInNhcmNvbWEifX0='}]}}

In [67]:
# Create a vantage6 task to extract the data from the OMOP data source and store it
# into a dataframe.
response = requests.post(
    f"https://orchestrator.idea.lst.tfo.upm.es/server/session/{SESSION_ID}/dataframe",
    headers=headers,
    json=payload
)
TASK_ID = response.json()["last_session_task"]["id"]
DATAFRAME_ID = response.json()["id"]
response.json()

{'columns': [],
 'db_label': 'omop',
 'last_session_task': {'init_org': {'id': 1,
   'link': '/server/organization/1',
   'methods': ['PATCH', 'DELETE', 'GET']},
  'job_id': 55,
  'init_user': {'id': 7,
   'link': '/server/user/7',
   'methods': ['PATCH', 'DELETE', 'GET']},
  'description': 'Data extraction step for session UPM Test Session 0 (13).This session is in the Testing collaboration. Data extraction is done on the omop database, and the dataframe name will be pedantic_chatterjee.',
  'method': 'create_cohort',
  'id': 76,
  'session': {'id': 13,
   'link': '/server/session/13',
   'methods': ['PATCH', 'DELETE', 'GET']},
  'dataframe': {'id': 44, 'db_label': 'omop', 'name': 'pedantic_chatterjee'},
  'status': 'awaiting',
  'databases': [{'label': 'omop',
    'type': 'source',
    'dataframe_id': None,
    'dataframe_name': None,
    'position': 0}],
  'parent': None,
  'created_at': '2025-07-08T08:23:38.754066',
  'children': '/server/task?parent_id=76',
  'collaboration': {'id

In [69]:

# The status of the task (in this case the task that extract the data from the OMOP db
# in order to create the dataframe) can be one of the following:
#
# - pending: The task is waiting to be executed.
# - active: The task is being executed.
# - completed: The task has finished successfully.
# - crashed: The task crashed. You probably want to inspect the logs.
#
# You should poll the status of the task until it got one of the final states: crashed
# or completed
response = requests.get(
    f"https://orchestrator.idea.lst.tfo.upm.es/server/run?task_id={TASK_ID}",
    headers=headers,
)
response.json()


{'data': [{'cleanup_at': None,
   'input': 'eyJrd2FyZ3MiOiB7InBhdGllbnRfaWRzIjogWzEsIDIsIDMsIDQsIDUsIDYsIDcsIDgsIDksIDEwXSwgImZlYXR1cmVzIjogInNhcmNvbWEifX0=',
   'finished_at': '2025-07-08T08:23:52.391678',
   'node': {'keycloak_id': '463fda17-82ac-45a5-8c14-aef4b89c9006',
    'status': 'online',
    'ip': None,
    'id': 10,
    'name': 'Testing-root-node'},
   'task': {'id': 76, 'link': '/server/task/76', 'methods': ['DELETE', 'GET']},
   'status': 'completed',
   'action': 'data_extraction',
   'id': 81,
   'organization': {'id': 1,
    'link': '/server/organization/1',
    'methods': ['PATCH', 'DELETE', 'GET']},
   'ports': [],
   'results': {'id': 81,
    'link': '/server/result/81',
    'methods': ['GET', 'PATCH']},
   'started_at': '2025-07-08T08:23:39.284465',
   'assigned_at': '2025-07-07T10:36:03.908202'}],
 'links': {'first': '/server/run?task_id=76&page=1',
  'self': '/server/run?task_id=76&page=1',
  'last': '/server/run?task_id=76&page=1'}}

In [None]:

# Examples
# --------
# Some example responses, at some places i've used `...` to hide details that are not
# super important for you. These example responses are similar for all tasks. That means
# that the output can also be used as example for the Summary Statistics task.
#
#
# (1) A pending task
# ------------------
# {'data': [{'status': 'pending',
#    'organization': {...},
#    'id': 39,
#    'log': None,
#    'cleanup_at': None,
#    'action': 'data_extraction',
#    'ports': [],
#    'input': 'eyJrd2FyZ3MiOiB7InBhdGllbnRfaWRzIjogWzEsIDIsIDMsIDQsIDUsIDYsIDcsIDgsIDksIDEwXSwgImZlYXR1cmVzIjogInNhcmNvbWEifX0=',
#    'task': {'id': 34, 'link': '/server/task/34', 'methods': ['GET', 'DELETE']},
#    'assigned_at': '2025-07-03T11:33:56.136402',
#    'finished_at': None,
#    'results': {'id': 39,
#     'link': '/server/result/39',
#     'methods': ['GET', 'PATCH']},
#    'node': {...},
#    'started_at': '2025-07-07T09:07:24.187217'}],
#  'links': {'first': '/server/run?task_id=34&page=1',
#   'self': '/server/run?task_id=34&page=1',
#   'last': '/server/run?task_id=34&page=1'}}

# (2) A running task
# ------------------
# {'data': [{'status': 'active',
#    'organization': {...},
#    'id': 41,
#    'log': None,
#    'cleanup_at': None,
#    'action': 'data_extraction',
#    'ports': [],
#    'input': 'eyJrd2FyZ3MiOiB7InBhdGllbnRfaWRzIjogWzEsIDIsIDMsIDQsIDUsIDYsIDcsIDgsIDksIDEwXSwgImZlYXR1cmVzIjogInNhcmNvbWEifX0=',
#    'task': {'id': 36, 'link': '/server/task/36', 'methods': ['GET', 'DELETE']},
#    'assigned_at': '2025-07-03T11:33:56.136402',
#    'finished_at': None,
#    'results': {'id': 41,
#     'link': '/server/result/41',
#     'methods': ['GET', 'PATCH']},
#    'node': {...},
#    'started_at': '2025-07-07T09:13:58.244181'}],
#  'links': {'first': '/server/run?task_id=36&page=1',
#   'self': '/server/run?task_id=36&page=1',
#   'last': '/server/run?task_id=36&page=1'}}

# (3) A completed task:
# {'data': [{'status': 'completed',
#    'organization': {...},
#    'id': 43,
#    'log': 'LOGS of POD run-43-rqbqj (created by job run-43) \n\n info > wrapper for v6-sessions\ninfo > Reading input file /app/vantage6/task/input\ninfo > Dispatching ...\n/usr/local/lib/python3.10/site-packages/v6-sessions/cohort.py:2: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n  import pkg_resources\ninfo > Module \'v6-sessions\' imported!\ninfo > Setting up connection to database\n{\'uri\': \'jdbc:postgresql://omop-postgres-service.datamesh.svc.cluster.local:5432/omopdb\', \'type\': \'other\'}\n$dbms\n[1] "postgresql"\n\n$extraSettings\nNULL\n\n$oracleDriver\n[1] "thin"\n\n$pathToDriver\n[1] "/usr/local/lib/python3.10/site-packages/ohdsi/database_connector/java"\n\n$user\nfunction () \nrlang::eval_tidy(userExpression)\n<bytecode: 0x5585f6f99050>\n<environment: 0x5585fca586f8>\n\n$password\nfunction () \nrlang::eval_tidy(passWordExpression)\n<bytecode: 0x5585f6f98c28>\n<environment: 0x5585fca586f8>\n\n$server\nfunction () \nrlang::eval_tidy(serverExpression)\n<bytecode: 0x5585f758bf80>\n<environment: 0x5585fca586f8>\n\n$port\nfunction () \nrlang::eval_tidy(portExpression)\n<bytecode: 0x5585f758bb58>\n<environment: 0x5585fca586f8>\n\n$connectionString\nfunction () \nrlang::eval_tidy(csExpression)\n<bytecode: 0x5585f758b730>\n<environment: 0x5585fca586f8>\n\nattr(,"class")\n[1] "ConnectionDetails"        "DefaultConnectionDetails"\n\nConnecting using PostgreSQL driver\ninfo > Retrieving variables for cohort: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\ninfo > Loading SQL file: sarcoma\ninfo > -->  Done\ninfo > Injecting patient IDs into SQL\ninfo > -->  Done\ninfo > Executing SQL\ninfo > Converting dataframe to pandas\ninfo > -->  Done\ninfo > Done!\ninfo > Writing output to /app/vantage6/task/output\n\x1b[?25hR[write to console]: Warning messages:\n\nR[write to console]: 1: \nR[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :\nR[write to console]: \n \nR[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages\n\nR[write to console]: 2: \nR[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :\nR[write to console]: \n \nR[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages\n\nR[write to console]: 3: \nR[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :\nR[write to console]: \n \nR[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages\n\nR[write to console]: 4: \nR[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :\nR[write to console]: \n \nR[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages\n\nR[write to console]: 5: \nR[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :\nR[write to console]: \n \nR[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages\n\n\x1b[0m \n\n\n',
#    'cleanup_at': None,
#    'action': 'data_extraction',
#    'ports': [],
#    'input': 'eyJrd2FyZ3MiOiB7InBhdGllbnRfaWRzIjogWzEsIDIsIDMsIDQsIDUsIDYsIDcsIDgsIDksIDEwXSwgImZlYXR1cmVzIjogInNhcmNvbWEifX0=',
#    'task': {'id': 38, 'link': '/server/task/38', 'methods': ['GET', 'DELETE']},
#    'assigned_at': '2025-07-03T11:33:56.136402',
#    'finished_at': '2025-07-07T09:33:56.195070',
#    'results': {'id': 43,
#     'link': '/server/result/43',
#     'methods': ['GET', 'PATCH']},
#    'node': {...},
#    'started_at': '2025-07-07T09:32:57.230257'}],
#  'links': {'first': '/server/run?task_id=38&page=1',
#   'self': '/server/run?task_id=38&page=1',
#   'last': '/server/run?task_id=38&page=1'}}

# (4) A crashed task
# ------------------
# {'data': [{'status': 'crashed',
#    'organization': {...},
#    'id': 39,
#    'log': 'LOGS of POD run-39-p9tkc (created by job run-39) \n\n info > wrapper for v6-sessions\ninfo > Reading input file /app/vantage6/task/input\ninfo > Dispatching ...\n/usr/local/lib/python3.10/site-packages/v6-sessions/cohort.py:2: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n  import pkg_resources\ninfo > Module \'v6-sessions\' imported!\ninfo > Setting up connection to database\n{\'uri\': \'jdbc:postgresql://omop-postgres-service.datamesh.svc.cluster.local:5432/omopdb\', \'type\': \'other\'}\n$dbms\n[1] "postgresql"\n\n$extraSettings\nNULL\n\n$oracleDriver\n[1] "thin"\n\n$pathToDriver\n[1] "/usr/local/lib/python3.10/site-packages/ohdsi/database_connector/java"\n\n$user\nfunction () \nrlang::eval_tidy(userExpression)\n<bytecode: 0x55c12b738f70>\n<environment: 0x55c1313a55c8>\n\n$password\nfunction () \nrlang::eval_tidy(passWordExpression)\n<bytecode: 0x55c12b738b48>\n<environment: 0x55c1313a55c8>\n\n$server\nfunction () \nrlang::eval_tidy(serverExpression)\n<bytecode: 0x55c12b8a5a80>\n<environment: 0x55c1313a55c8>\n\n$port\nfunction () \nrlang::eval_tidy(portExpression)\n<bytecode: 0x55c12b8a5658>\n<environment: 0x55c1313a55c8>\n\n$connectionString\nfunction () \nrlang::eval_tidy(csExpression)\n<bytecode: 0x55c12b8a5230>\n<environment: 0x55c1313a55c8>\n\nattr(,"class")\n[1] "ConnectionDetails"        "DefaultConnectionDetails"\n\nConnecting using PostgreSQL driver\ninfo > Retrieving variables for cohort: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\ninfo > Loading SQL file: sarcoma\ninfo > -->  Done\ninfo > Injecting patient IDs into SQL\ninfo > -->  Done\ninfo > Executing SQL\ninfo > Converting dataframe to pandas\n   PATIENT_ID  ...  N_CANCER_EPISODES\n1        10.0  ...                0.0\n2         2.0  ...                0.0\n3         5.0  ...                0.0\n4         8.0  ...                0.0\n5         6.0  ...                0.0\n\n[5 rows x 21 columns]\ninfo > -->  Done\ninfo > Done!\ninfo > Writing output to /app/vantage6/task/output\nTraceback (most recent call last):\n  File "<string>", line 1, in <module>\n  File "/usr/local/lib/python3.10/site-packages/vantage6/algorithm/tools/wrap.py", line 75, in wrap_algorithm\n    _write_output(output, output_file)\n  File "/usr/local/lib/python3.10/site-packages/vantage6/algorithm/tools/wrap.py", line 199, in _write_output\n    pq.write_table(output, output_file)\n  File "/usr/local/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 1884, in write_table\n    where, table.schema,\nAttributeError: \'bytes\' object has no attribute \'schema\'\n\x1b[?25hR[write to console]: Warning messages:\n\nR[write to console]: 1: \nR[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :\nR[write to console]: \n \nR[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages\n\nR[write to console]: 2: \nR[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :\nR[write to console]: \n \nR[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages\n\nR[write to console]: 3: \nR[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :\nR[write to console]: \n \nR[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages\n\nR[write to console]: 4: \nR[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :\nR[write to console]: \n \nR[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages\n\nR[write to console]: 5: \nR[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :\nR[write to console]: \n \nR[write to console]:  library ‘/usr/lib/R/site-library’ contains no packages\n\n\x1b[0m \n\n\n',
#    'cleanup_at': None,
#    'action': 'data_extraction',
#    'ports': [],
#    'input': 'eyJrd2FyZ3MiOiB7InBhdGllbnRfaWRzIjogWzEsIDIsIDMsIDQsIDUsIDYsIDcsIDgsIDksIDEwXSwgImZlYXR1cmVzIjogInNhcmNvbWEifX0=',
#    'task': {'id': 34, 'link': '/server/task/34', 'methods': ['GET', 'DELETE']},
#    'assigned_at': '2025-07-03T11:33:56.136402',
#    'finished_at': '2025-07-07T09:07:55.422012',
#    'results': {'id': 39,
#     'link': '/server/result/39',
#     'methods': ['GET', 'PATCH']},
#    'node': {...},
#    'started_at': '2025-07-07T09:07:24.187217'}],
#  'links': {'first': '/server/run?task_id=34&page=1',
#   'self': '/server/run?task_id=34&page=1',
#   'last': '/server/run?task_id=34&page=1'}}

## 6. Summary statistics

In [276]:
# Before we can display the summary statistics we need to calculate them. This is done
# through a vantage6 algorithm. We first need to be sure the dataframe is ready to be
# used. Then we can execute the algorithm and await the results to be displayed.

In [70]:
# When a new cohort is created vantage6 needs to extract the data from the OMOP database
# and store it in the session as a dataframe. This is done by executing a vantage6
# extraction task.

#
# Static content
#
# This is the Docker image that contains the analysis `method` to compute the summary
# statistics.
image = "harbor2.vantage6.ai/idea4rc/analytics:latest"

#
# Dynamic content
#
# NOTE --- CHANGE THE NAME AND DESCRIPTION OF THE TASK ---
# The name (and description) of the task does not need to be unique. It is used to
# identify the task, so give it some meaningful name. For example include the cohorts
# that are being analysed.
name = "Summary Statistics of Cohort 1"
description = "Summary statistics of the cohort"

# Each `image` can have multiple `methods`. The `summary` method is used to compute the
# summary statistics.
method = "summary"

# This is the vantage6 action that will be executed. The `central_compute` action is
# an action that allows the method to create sub tasks.
action = "central_compute"

# NOTE --- CHANGE THE DATABASES TO THE DATABASES OF THE COHORT ---
# The user can select multiple cohorts in the RAVEN UI to be analysed at the same time.
databases = [
    [
        # In case the user selected multiple cohorts, we need to make a dict here per
        # cohort. Each needs to contain the dataframe label that you obtained earlier.
        {
            "type": "dataframe",
            "dataframe_id": DATAFRAME_ID
        }
        # {
        #     "type": "dataframe",
        #     "dataframe_id": DATAFRAME_ID_1
        # },
        # {
        #     "type": "dataframe",
        #     "dataframe_id": DATAFRAME_ID_2
        # }
    ]
]

# In case of the summary statistics we want to include all the organization and all the
# columns(/features/variables) of the cohort. For now we included a subset of the
# features.
# TODO include other features as well.
arguments = {
    "kwargs": {
        "columns": ["PATIENT_ID", "AGE", "TUMOR_SIZE", "N_CANCER_EPISODES", "SEX",
                    "STATUS"],
        # It is also possible to use a subset of organizations here. In case the user
        # makes a selection of organizations in the RAVEN UI.
        "organizations_to_include": ORGANIZATION_IDS
    }
}

In [71]:
payload = {
    "name": name,
    "image": image,
    "description": description,
    "action": action,
    "method": method,
    "organizations": [
        {
            # We only send the task to one organization, as this is a central compute.
            # The central compute will create tasks for all the organization specified
            # in the `organizations_to_include` argument.
            "id": ORGANIZATION_IDS[0],
            "input": base64.b64encode(
                json.dumps(arguments).encode("UTF-8")
            ).decode("UTF-8")
        }
    ],
    "databases": databases,
    "session_id": SESSION_ID,
    "study_id": STUDY_ID
}
payload


{'name': 'Summary Statistics of Cohort 1',
 'image': 'harbor2.vantage6.ai/idea4rc/analytics:latest',
 'description': 'Summary statistics of the cohort',
 'action': 'central_compute',
 'method': 'summary',
 'organizations': [{'id': 1,
   'input': 'eyJrd2FyZ3MiOiB7ImNvbHVtbnMiOiBbIlBBVElFTlRfSUQiLCAiQUdFIiwgIlRVTU9SX1NJWkUiLCAiTl9DQU5DRVJfRVBJU09ERVMiLCAiU0VYIiwgIlNUQVRVUyJdLCAib3JnYW5pemF0aW9uc190b19pbmNsdWRlIjogWzFdfX0='}],
 'databases': [[{'type': 'dataframe', 'dataframe_id': 44}]],
 'session_id': 13,
 'study_id': 16}

In [72]:
# Create a vantage6 task to execute the summary analysis.
response = requests.post(
    f"https://orchestrator.idea.lst.tfo.upm.es/server/task",
    headers=headers,
    json=payload
)
TASK_ID = response.json()["id"]
response.json()

# The status of the task (in this case the task that computes the summary statistics)
# can be one of the following:
#
# - pending: The task is waiting to be executed.
# - active: The task is being executed.
# - completed: The task has finished successfully.
# - crashed: The task crashed. You probably want to inspect the logs.
#
# You should poll the status of the task until it got one of the final states: crashed
# or completed

{'init_org': {'id': 1,
  'link': '/server/organization/1',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'job_id': 56,
 'init_user': {'id': 7,
  'link': '/server/user/7',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'description': 'Summary statistics of the cohort',
 'method': 'summary',
 'id': 77,
 'session': {'id': 13,
  'link': '/server/session/13',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'dataframe': None,
 'status': 'awaiting',
 'databases': [{'label': None,
   'type': 'dataframe',
   'dataframe_id': 44,
   'dataframe_name': 'pedantic_chatterjee',
   'position': 0}],
 'parent': None,
 'created_at': '2025-07-08T08:35:44.436670',
 'children': '/server/task?parent_id=77',
 'collaboration': {'id': 2,
  'link': '/server/collaboration/2',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'algorithm_store': None,
 'runs': '/server/run?task_id=77',
 'results': '/server/result?task_id=77',
 'required_by': [],
 'name': 'Summary Statistics of Cohort 1',
 'finished_at': None,
 'depends_on': [],
 'stud

In [73]:
# TODO Summary per CoE
# The status of the task (in this case the task that computes the summary statistics)
# can be one of the following:
#
# - pending: The task is waiting to be executed.
# - active: The task is being executed.
# - completed: The task has finished successfully.
# - crashed: The task crashed. You probably want to inspect the logs.
#
# You should poll the status of the task until it got one of the final states: crashed
# or completed
response = requests.get(
    f"https://orchestrator.idea.lst.tfo.upm.es/server/run?task_id={TASK_ID}",
    headers=headers,
)
response.json()

{'data': [{'cleanup_at': None,
   'input': 'eyJrd2FyZ3MiOiB7ImNvbHVtbnMiOiBbIlBBVElFTlRfSUQiLCAiQUdFIiwgIlRVTU9SX1NJWkUiLCAiTl9DQU5DRVJfRVBJU09ERVMiLCAiU0VYIiwgIlNUQVRVUyJdLCAib3JnYW5pemF0aW9uc190b19pbmNsdWRlIjogWzFdfX0=',
   'finished_at': '2025-07-08T08:36:12.435413',
   'node': {'keycloak_id': '463fda17-82ac-45a5-8c14-aef4b89c9006',
    'status': 'online',
    'ip': None,
    'id': 10,
    'name': 'Testing-root-node'},
   'task': {'id': 77, 'link': '/server/task/77', 'methods': ['DELETE', 'GET']},
   'status': 'completed',
   'action': 'central_compute',
   'id': 82,
   'organization': {'id': 1,
    'link': '/server/organization/1',
    'methods': ['PATCH', 'DELETE', 'GET']},
   'ports': [],
   'log': "LOGS of POD run-82-bkcx7 (created by job run-82) \n\n info > wrapper for v6-analytics\ninfo > Reading input file /app/vantage6/task/input\ninfo > Dispatching ...\ninfo > Using dataframes decorator\ninfo > Using dataframes decorator\ninfo > Using dataframes decorator\ninfo > Module 'v6

In [74]:
# When the task is completed, we can retrieve the result.
response = requests.get(
    f"https://orchestrator.idea.lst.tfo.upm.es/server/result?task_id={TASK_ID}",
    headers=headers,
)

# The result is a base64 encoded string. We need to decode it to get the actual result.
# Since this is a central task (see that we've only sent the task to one organization in
# the `organizations` argument in the original request), we can just take the first
# result.
json.loads(base64.b64decode(response.json()["data"][0]["result"]))

{'pedantic_chatterjee': {'numeric': {'PATIENT_ID': {'count': 10.0,
    'min': 1.0,
    'max': 10.0,
    'missing': 0.0,
    'sum': 55.0,
    'median': [5.5],
    'q_25': [3.25],
    'q_75': [7.75],
    'mean': 5.5,
    'std': 3.0276503540974917},
   'AGE': {'count': 10.0,
    'min': 19.0,
    'max': 74.0,
    'missing': 0.0,
    'sum': 459.0,
    'median': [49.0],
    'q_25': [24.75],
    'q_75': [65.0],
    'mean': 45.9,
    'std': 21.231266252078953},
   'TUMOR_SIZE': {'count': 8.0,
    'min': 1.1,
    'max': 9.1,
    'missing': 2.0,
    'sum': 31.6,
    'median': [3.9000000000000004],
    'q_25': [1.8],
    'q_75': [4.95],
    'mean': 3.95,
    'std': 2.5718253662108777},
   'N_CANCER_EPISODES': {'count': 10.0,
    'min': 1.0,
    'max': 1.0,
    'missing': 0.0,
    'sum': 10.0,
    'median': [1.0],
    'q_25': [1.0],
    'q_75': [1.0],
    'mean': 1.0,
    'std': 0.0}},
  'categorical': {'SEX': {'count': 10, 'missing': 0},
   'STATUS': {'count': 10, 'missing': 0}},
  'num_complete_

# 7. Collect algorithm metadata
*Collect input arguments and their types*

In [75]:
# Obtain a list of all available algorithms in the algorithm store. For now two algorithms
# are in there:
#
# - `sessions` (`harbor2.vantage6.ai/idea4rc/sessions:latest`). We've used this
#   algorithm already when creating the cohort.
# - `analytics` (`harbor2.vantage6.ai/idea4rc/analytics:latest`). This algorithm is
#   used to create the analytics. We've already used this algorithm for the computation
#   of the summary statistics. But this package currently also contains the crosstab
#   statistics and will be extended in the future with all the other analytics.
#
# In the response, each algorithm has one or more `functions`. The `functions` are
# actual Python functions that are executed on the data stations. A `function` in
# vantage6 expects a specific set of attributes that can be modified by the user. You
# should visualize these in a form in the RAVEN UI.
#
# 1. `databases`. A list of databases (typically only one) that will be supplied by the
#    node based on the `label` or `dataframe_id`. A `label` refers to the OMOP
#    database (in the IDEA4RC case) and is only used for the extraction algorithm (=
#    `create_cohort`). The `dataframe_id` refers to the cohort dataframe which can be
#    used for the analysis. In the extraction job we do not let the user select the
#    database, as we always use the OMOP database. So no need to visualize this. In the
#    analysis job we do let the user select the database, this happens when the user
#    selects a set of cohorts.
# 2. `arguments`. A list of arguments that can be modified by the user. In the case of
#    the extraction algorithm we have two arguments: `patient_ids` and `features`. The
#    `patient_ids` should be the list of patients that are comming from the cohort
#    builder and the `features` should be the of the tumor type that was selected in the
#    RAVEN workspace.
#
# The other important metadata are:
#
# 1. `name`. The method name, depending on which method the user selects in the UI
#    different arguments need to be provided. You also need to provide this `name` in
#    the `method` when creating a vantage6 task.
# 2. `image`. The image is the docker image that will be used to execute the function.
#    You also need to supply this when creating a vantage6 task.
#
# You should be able to use this metadata to create the interface in the RAVEN UI in
# order to create a task.
algorithms = requests.get(
    "https://orchestrator.idea.lst.tfo.upm.es/store/algorithm",
    headers=headers
)
algorithms.json()

{'data': [{'code_url': 'https://github.com/idea4rc/v6-analytics',
   'functions': [{'arguments': [{'is_frontend_only': False,
       'default_value': '',
       'display_name': 'Columns',
       'description': 'List of columns to compute the summary statistics for',
       'conditional_operator': None,
       'type': 'column_list',
       'conditional_value': None,
       'conditional_on_id': None,
       'name': 'columns',
       'has_default_value': False,
       'id': 10},
      {'is_frontend_only': False,
       'default_value': None,
       'display_name': 'Organizations to include',
       'description': 'The organizations to include in the task. If not given, all organizations in the collaboration are included.',
       'conditional_operator': None,
       'type': 'organization_list',
       'conditional_value': None,
       'conditional_on_id': None,
       'name': 'organizations_to_include',
       'has_default_value': True,
       'id': 11}],
     'step_type': 'central_comput

## 8. Create analytics

In [None]:
# This process is identical to the one for computing the summary statistics, however
# you of course need to alter then method name and the arguments.

In [88]:
# When a new cohort is created vantage6 needs to extract the data from the OMOP database
# and store it in the session as a dataframe. This is done by executing a vantage6
# extraction task.

#
# Static content
#

# This is the Docker image that contains the `method` to compute the crosstab
# statistics.
image = "harbor2.vantage6.ai/idea4rc/analytics:latest"

# This is the vantage6 action that will be executed. The `central_compute` action is
# an action that allows the method to create sub tasks.
action = "central_compute"

# The method is the vantage6 method that will be executed. The `crosstab` method is
# used to compute the crosstab statistics.
method = "crosstab"

#
# Dynamic content
#

# NOTE --- CHANGE THE NAME AND DESCRIPTION OF THE TASK ---
# The name (and description) of the task does not need to be unique. It is used to
# identify the task, so give it some meaningful name. For example include the cohorts
# that are being analysed.
name = "Crosstab Statistics of Cohort 1"
description = "Crosstab statistics of the cohort"

# Each `image` can have multiple `methods`. We need to use a different method for
# sarcoma and head and neck as we are extracting different features.
method = "crosstab"

# This is the vantage6 action that will be executed. The `central_compute` action is
# an action that allows the method to create sub tasks.
action = "central_compute"

# NOTE --- CHANGE THE DATABASES TO THE DATABASES OF THE COHORT ---
databases = [
    [
        # In case the user selected multiple cohorts, we need to make a dict here per
        # cohort. Each needs to contain the dataframe label that you obtained earlier.
        {
            "type": "dataframe",
            "dataframe_id": DATAFRAME_ID
        }
    ]
]

# NOTE --- CHANGE THE INPUT TO THE INPUT OF THE TASK ---
# For the crosstab statistics we need to specify the columns that we want to use for
# the crosstab. In this case we use the `SEX` column as the result column and the
# `N_CANCER_EPISODES` column as the group column. It is possible to select multiple
# columns for the group column (However for privacy reasons we should not allow for
# more than 2 columns).
arguments = {
    "kwargs": {
        "results_col": "SEX",
        "group_cols": ["FNCLCC_GRADE"],
        # NOTE --- CHANGE THE ORGANIZATIONS TO USER SELECTION ---
        # It is also possible to use a subset of organizations here. In case the user
        # makes a selection of organizations in the RAVEN UI.
        "organizations_to_include": ORGANIZATION_IDS
    }
}

In [89]:
payload = {
    "name": name,
    "image": image,
    "description": description,
    "action": action,
    "method": method,
    "organizations": [
        {
            # We only send the task to one organization, as this is a central compute.
            # The central compute will create tasks for all the organization specified
            # in the `organizations_to_include` argument.
            "id": ORGANIZATION_IDS[0],
            "input": base64.b64encode(
                json.dumps(arguments).encode("UTF-8")
            ).decode("UTF-8")
        }
    ],
    "databases": databases,
    "session_id": SESSION_ID,
    "study_id": STUDY_ID
}
payload



{'name': 'Crosstab Statistics of Cohort 1',
 'image': 'harbor2.vantage6.ai/idea4rc/analytics:latest',
 'description': 'Crosstab statistics of the cohort',
 'action': 'central_compute',
 'method': 'crosstab',
 'organizations': [{'id': 1,
   'input': 'eyJrd2FyZ3MiOiB7InJlc3VsdHNfY29sIjogIlNFWCIsICJncm91cF9jb2xzIjogWyJGTkNMQ0NfR1JBREUiXSwgIm9yZ2FuaXphdGlvbnNfdG9faW5jbHVkZSI6IFsxXX19'}],
 'databases': [[{'type': 'dataframe', 'dataframe_id': 44}]],
 'session_id': 13,
 'study_id': 16}

In [90]:
# Create a vantage6 task to execute the crosstab analysis.
response = requests.post(
    f"https://orchestrator.idea.lst.tfo.upm.es/server/task",
    headers=headers,
    json=payload
)
TASK_ID = response.json()["id"]
response.json()

# The status of the task (in this case the task that computes the crosstab statistics)
# can be one of the following:
#
# - pending: The task is waiting to be executed.
# - active: The task is being executed.
# - completed: The task has finished successfully.
# - crashed: The task crashed. You probably want to inspect the logs.
#
# You should poll the status of the task until it got one of the final states: crashed
# or completed

{'init_org': {'id': 1,
  'link': '/server/organization/1',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'job_id': 58,
 'init_user': {'id': 7,
  'link': '/server/user/7',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'description': 'Crosstab statistics of the cohort',
 'method': 'crosstab',
 'id': 82,
 'session': {'id': 13,
  'link': '/server/session/13',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'dataframe': None,
 'status': 'awaiting',
 'databases': [{'label': None,
   'type': 'dataframe',
   'dataframe_id': 44,
   'dataframe_name': 'pedantic_chatterjee',
   'position': 0}],
 'parent': None,
 'created_at': '2025-07-08T08:57:21.903117',
 'children': '/server/task?parent_id=82',
 'collaboration': {'id': 2,
  'link': '/server/collaboration/2',
  'methods': ['PATCH', 'DELETE', 'GET']},
 'algorithm_store': None,
 'runs': '/server/run?task_id=82',
 'results': '/server/result?task_id=82',
 'required_by': [],
 'name': 'Crosstab Statistics of Cohort 1',
 'finished_at': None,
 'depends_on': [],
 's

In [92]:

# TODO add status of the subtasks in each centers
# The status of the task (in this case the task that computes the summary statistics)
# can be one of the following:
#
# - pending: The task is waiting to be executed.
# - active: The task is being executed.
# - completed: The task has finished successfully.
# - crashed: The task crashed. You probably want to inspect the logs.
#
# You should poll the status of the task until it got one of the final states: crashed
# or completed
response = requests.get(
    f"https://orchestrator.idea.lst.tfo.upm.es/server/run?task_id={TASK_ID}",
    headers=headers,
)
response.json()

{'data': [{'cleanup_at': None,
   'input': 'eyJrd2FyZ3MiOiB7InJlc3VsdHNfY29sIjogIlNFWCIsICJncm91cF9jb2xzIjogWyJGTkNMQ0NfR1JBREUiXSwgIm9yZ2FuaXphdGlvbnNfdG9faW5jbHVkZSI6IFsxXX19',
   'finished_at': '2025-07-08T08:57:43.631326',
   'node': {'keycloak_id': '463fda17-82ac-45a5-8c14-aef4b89c9006',
    'status': 'online',
    'ip': None,
    'id': 10,
    'name': 'Testing-root-node'},
   'task': {'id': 82, 'link': '/server/task/82', 'methods': ['DELETE', 'GET']},
   'status': 'completed',
   'action': 'central_compute',
   'id': 87,
   'organization': {'id': 1,
    'link': '/server/organization/1',
    'methods': ['PATCH', 'DELETE', 'GET']},
   'ports': [],
   'log': "LOGS of POD run-87-wk2gh (created by job run-87) \n\n info > wrapper for v6-analytics\ninfo > Reading input file /app/vantage6/task/input\ninfo > Dispatching ...\ninfo > Using dataframes decorator\ninfo > Using dataframes decorator\ninfo > Using dataframes decorator\ninfo > Module 'v6-analytics' imported!\ninfo > Defining input

In [93]:
# Once the task is completed, we can retrieve the result.
response = requests.get(
    f"https://orchestrator.idea.lst.tfo.upm.es/server/result?task_id={TASK_ID}",
    headers=headers,
)
response.json()

# The result is a base64 encoded string. We need to decode it to get the actual result.
# Since this is a central task (see that we've only sent the task to one organization in
# the `organizations` argument in the original request), we can just take the first
# result.
json.loads(base64.b64decode(response.json()["data"][0]["result"]))

{'pedantic_chatterjee': {'contingency_table': [{'FNCLCC_GRADE': 'Grade 1 tumor',
    'MALE': '2',
    'FEMALE': '4',
    'Total': '6'},
   {'FNCLCC_GRADE': 'Grade 2 tumor', 'MALE': '1', 'FEMALE': '0', 'Total': '1'},
   {'FNCLCC_GRADE': 'Grade 3 tumor', 'MALE': '2', 'FEMALE': '1', 'Total': '3'},
   {'FNCLCC_GRADE': 'Total', 'MALE': '5', 'FEMALE': '5', 'Total': '10'}],
  'chi2': {'chi2': '2.0', 'P-value': '0.36787944117144245'}}}