# Step 2: Retrieve averages

In this notebook, we are going to check which variables are available in our collaboration. As we cannot see the data, we need to get some basic information about the available data in another way. Preferably, every node should have FAIR data descriptions, but these are not always available. In this notebook, we are going to retrieve the *global* average for a specific column. This means, an average calculated over all nodes available, for a given column.

How this works is as follows:

1. The nodes calculate the sum for a given column, and the number of values on which this sum was based
2. The *aggregation node* calculates the sum of sums, and divides it by the count of numbers. This can be denoted as: $$ globalAverage = \dfrac{
  \sum_{n=1}^{m}{S_n}
}{
  \sum_{n=1}^{m}{C_n}
} $$ where $S_n$ represents the sum of a given node, and $C_n$ represents the count of numbers for a given node.

Access to this collaboration, and its connected nodes has been arranged by the central server. Based on the given username and password (and server URL) we can connect to our collaboration.

**Task: fill in the correct connection details in the cell below, and execute the first two cells**

In [None]:
vantage_broker_url = "http://home.johanvansoest.nl"
vantage_broker_username = "node1-user"
vantage_broker_password = "node1-password"

vantage_broker_encryption = None
vantage_broker_port = 5000
vantage_broker_api_path = "/api"

In [None]:
! pip install vantage6
# Setup client connection
from vantage6.client import Client
client = Client(vantage_broker_url, vantage_broker_port, vantage_broker_api_path)
client.authenticate(vantage_broker_username, vantage_broker_password)
client.setup_encryption(vantage_broker_encryption)

We are now connected to the Vantage central server, and have access to several collaborations.

**Task: execute the cell below, to which collaboration(s) do we have access? And which nodes are available in this collaboration?**

In [None]:
import json
collaboration_list = client.collaboration.list()
collaboration_index = 0
organization_ids_ = [ ]

for organization in collaboration_list[collaboration_index]['organizations']:
    organization_ids_.append(organization['id'])
# print(json.dumps(collaboration_list, indent=2))

Now we know the collaboration, we can post a request to the central server. In this request, we will ask to retrieve the average of a specified column. This is done by requesting to execute the Docker image with name `jaspersnel/v6-average-py`. This image needs some input arguments filled in on the line `kwargs` as shown below.

**Task: execute the cell below.**

In [None]:
input_ = {
    "master": "true",
    "method":"master", 
    "args": [],
    "kwargs": {"column_name": "age"}
}

task = client.post_task(
    name="RetrieveVariables",
    image="jaspersnel/v6-average-py",
    collaboration_id=collaboration_list[collaboration_index]['id'],
    input_= input_,
    organization_ids=[organization_ids_[0]]
)

The request has been sent to the given collaboration. Now we can fetch for the results. As we do not know when the nodes are finished, we implemented a waiting loop procedure. This is implemented below at line 5.

**Task: execute the cell below. What is the average?**

In [None]:
import time
import json
resultObjRef = task.get("results")[0]
resultObj = client.result.get(resultObjRef['id'])
attempts = 1
while((resultObj["finished_at"] == None) and attempts < 10):
    print("waiting...")
    time.sleep(5)
    resultObj = client.result.get(resultObjRef['id'])
    attempts += 1
resultObj['result']['average']

Repeat the above two cells to find the average for other columns.

**Task: what is the average for the clinical T stage, and clinical N stage?**