# Node sets and querying
## Introduction
In this tutorial we cover node sets; what they are and how to use them. We'll also demonstrate the different kind of queries in SNAP.

## Preamble
The code in this section is identical to the code in sections "Introduction" and "Loading" from the previous tutorial. It assumes that you have already downloaded the circuit. If not, take a look to the notebook **01_circuits** (Downloading a circuit).

In [1]:
import bluepysnap
import pandas as pd

circuit_path = "sonata/circuit_sonata.json"
circuit = bluepysnap.Circuit(circuit_path)

## Node sets
As briefly mentioned in [node properties notebook](./03_node_properties.ipynb), node set is a predetermined collection of queries for nodes. They are saved in a JSON file which is usually added into the circuit and/or simulation config. For a more in-depth explanation, please see: [SONATA Node Sets - Circuit Documentation](https://sonata-extension.readthedocs.io/en/latest/sonata_nodeset.html).

We can directly access node sets in snap if it's added to the circuit config:

In [2]:
circuit.node_sets.content

{'Mosaic': ['All'],
 'All': ['thalamus_neurons'],
 'thalamus_neurons': {'population': 'thalamus_neurons'},
 'Excitatory': {'synapse_class': 'EXC'},
 'Inhibitory': {'synapse_class': 'INH'},
 'Rt_RC': {'mtype': 'Rt_RC'},
 'VPL_IN': {'mtype': 'VPL_IN'},
 'VPL_TC': {'mtype': 'VPL_TC'},
 'bAC_IN': {'etype': 'bAC_IN'},
 'cAD_noscltb': {'etype': 'cAD_noscltb'},
 'cNAD_noscltb': {'etype': 'cNAD_noscltb'},
 'dAD_ltb': {'etype': 'dAD_ltb'},
 'dNAD_ltb': {'etype': 'dNAD_ltb'},
 'mc0;Rt': {'region': 'mc0;Rt'},
 'mc0;VPL': {'region': 'mc0;VPL'},
 'mc1;Rt': {'region': 'mc1;Rt'},
 'mc1;VPL': {'region': 'mc1;VPL'},
 'mc2;Rt': {'region': 'mc2;Rt'},
 'mc2;VPL': {'region': 'mc2;VPL'},
 'mc3;Rt': {'region': 'mc3;Rt'},
 'mc3;VPL': {'region': 'mc3;VPL'},
 'mc4;Rt': {'region': 'mc4;Rt'},
 'mc4;VPL': {'region': 'mc4;VPL'},
 'mc5;Rt': {'region': 'mc5;Rt'},
 'mc5;VPL': {'region': 'mc5;VPL'},
 'mc6;Rt': {'region': 'mc6;Rt'},
 'mc6;VPL': {'region': 'mc6;VPL'},
 'IN': {'mtype': {'$regex': '.*IN'}, 'region': {'$reg

To prove a point, let's query some ids using a node set, and compare it with querying with a similar query:

In [3]:
node_set_result = circuit.nodes.ids('VPL_TC')
print(f'Ids found: {len(node_set_result)}')

query_result = circuit.nodes.ids({'mtype': 'VPL_TC'})
print(f'Queries result in the same outcome : {node_set_result == query_result}')

Ids found: 64856
Queries result in the same outcome : True


We'll go more into querying later at this document. For now, let's go over other aspects of node sets.

### Node sets' features / usecases
Sometimes, we may want to work with node sets that aren't found in a circuit or simulation config. This can be due to
* experimenting
* can't / don't want to modify the config file
* combining node sets
* etc.

First of all, let's see how we can open / create node sets.

#### Opening a node set file
For demonstration purposes, let's open the circuit's node sets from a file:

In [4]:
node_sets_circuit = bluepysnap.node_sets.NodeSets.from_file('./sonata/networks/nodes/node_sets.json')
print(f"Contents match: {node_sets_circuit.content == circuit.node_sets.content}")

Contents match: True


#### Creating node sets on the fly

If we want to, for example, test node sets without having to write them to a file and load that over and over again. We can create node sets directly from a dict:

In [5]:
node_sets_0_9 = bluepysnap.node_sets.NodeSets.from_dict({'nodes_0-9': {'node_id': [*range(10)]}})
node_sets_0_9.content

{'nodes_0-9': {'node_id': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}}

This may be handy if you can't modify the existing nodesets file, but want to augment it with nodesets.

#### Combining node sets
So now that we have two nodesets, `node_sets_circuit` read from a file and `node_sets_0_9` created from a dict, let's combine them. Naturally, we could also open two node sets from files and combine them.

For this purpose, node sets objects have an `NodeSets.update()` method. `update` takes another node sets object as an argument, and adds all its node sets to itself. 

Let's update the `node_sets_circuit` with `node_sets_0_9`:

In [6]:
node_sets_circuit.update(node_sets_0_9)
node_sets_circuit.content

{'Mosaic': ['All'],
 'All': ['thalamus_neurons'],
 'thalamus_neurons': {'population': 'thalamus_neurons'},
 'Excitatory': {'synapse_class': 'EXC'},
 'Inhibitory': {'synapse_class': 'INH'},
 'Rt_RC': {'mtype': 'Rt_RC'},
 'VPL_IN': {'mtype': 'VPL_IN'},
 'VPL_TC': {'mtype': 'VPL_TC'},
 'bAC_IN': {'etype': 'bAC_IN'},
 'cAD_noscltb': {'etype': 'cAD_noscltb'},
 'cNAD_noscltb': {'etype': 'cNAD_noscltb'},
 'dAD_ltb': {'etype': 'dAD_ltb'},
 'dNAD_ltb': {'etype': 'dNAD_ltb'},
 'mc0;Rt': {'region': 'mc0;Rt'},
 'mc0;VPL': {'region': 'mc0;VPL'},
 'mc1;Rt': {'region': 'mc1;Rt'},
 'mc1;VPL': {'region': 'mc1;VPL'},
 'mc2;Rt': {'region': 'mc2;Rt'},
 'mc2;VPL': {'region': 'mc2;VPL'},
 'mc3;Rt': {'region': 'mc3;Rt'},
 'mc3;VPL': {'region': 'mc3;VPL'},
 'mc4;Rt': {'region': 'mc4;Rt'},
 'mc4;VPL': {'region': 'mc4;VPL'},
 'mc5;Rt': {'region': 'mc5;Rt'},
 'mc5;VPL': {'region': 'mc5;VPL'},
 'mc6;Rt': {'region': 'mc6;Rt'},
 'mc6;VPL': {'region': 'mc6;VPL'},
 'IN': {'mtype': {'$regex': '.*IN'}, 'region': {'$reg

as we can see, the node sets object contains the newly created node set `nodes_0-9`.

**WARNING:** if the node sets object already contains node sets with same names as in the update, those node sets will be overwritten. The names of the overwritten node sets are returned in the `update` function:

In [7]:
# Let's overwrite 'nodes_0-9'
fake_0_9_node_set = bluepysnap.node_sets.NodeSets.from_dict({'nodes_0-9': {'node_id': [1]}})
overwritten = node_sets_circuit.update(fake_0_9_node_set)
print(f'Overwritten node sets: {overwritten}')
print(f'content["nodes_0-9"]: {node_sets_circuit.content["nodes_0-9"]}')

Overwritten node sets: {'nodes_0-9'}
content["nodes_0-9"]: {'node_id': [1]}


#### Compound node sets
Compound node sets are literally node sets that are composed of other node sets. Let's create node sets with one compound node set:

In [8]:
node_sets_compound = bluepysnap.node_sets.NodeSets.from_dict({
    'nodes_0-4': {'node_id': [*range(5)]},
    'nodes_5-9': {'node_id': [*range(5,10)]},
    'nodes_0-9': ['nodes_0-4', 'nodes_5-9'], # Compound node set with node set names in a list results in OR case
})
node_sets_compound.content

{'nodes_0-4': {'node_id': [0, 1, 2, 3, 4]},
 'nodes_5-9': {'node_id': [5, 6, 7, 8, 9]},
 'nodes_0-9': ['nodes_0-4', 'nodes_5-9']}

**NOTE:** Compound node sets always represent "OR" instead of "AND". I.e., the queries return results belonging to any of the node sets listed in a compound node set.

#### Referring to a node set in a `NodeSets` object
`NodeSets` object works kind of like a `dict` in the sense that if you wish to refer to a specific node set, the syntax is the same as in `dict`:

In [9]:
node_sets_circuit['VPL_TC']

<bluepysnap.node_sets.NodeSet at 0x7fffa42ddc30>

Above, we got a `NodeSet` (not `NodeSets`!) object, i.e., one instance of a node set. For our purposes, we don't really have to know what it is, as long as we know how to access it. This will become handy in querying.

## Querying

This section presents different ways to query data from nodes and edges. Since we just went over node sets, lets put them into the spotlight and see how we can utilize them in queries.

### Queries with node sets
As we know, we can use the plain node set name as a string (e.g., `circuit.nodes.ids('VPL_TC')`) to use a node set for queries. However, this only works for the node sets integrated in the circuit.

Luckily, if we want to use node sets external to the circuit config, we can do so by just passing the `NodeSet` object as a query:

In [10]:
# circuit.nodes.ids('nodes_0-9')  # This would raise an error: BluepySnapError: Undefined node set: 'nodes_0-9'
circuit.nodes.ids(node_sets_compound['nodes_0-9'])

CircuitNodeIds([('CorticoThalamic_projections', 0),
            ('CorticoThalamic_projections', 1),
            ('CorticoThalamic_projections', 2),
            ('CorticoThalamic_projections', 3),
            ('CorticoThalamic_projections', 4),
            ('CorticoThalamic_projections', 5),
            ('CorticoThalamic_projections', 6),
            ('CorticoThalamic_projections', 7),
            ('CorticoThalamic_projections', 8),
            ('CorticoThalamic_projections', 9),
            ('MedialLemniscus_projections', 0),
            ('MedialLemniscus_projections', 1),
            ('MedialLemniscus_projections', 2),
            ('MedialLemniscus_projections', 3),
            ('MedialLemniscus_projections', 4),
            ('MedialLemniscus_projections', 5),
            ('MedialLemniscus_projections', 6),
            ('MedialLemniscus_projections', 7),
            ('MedialLemniscus_projections', 8),
            ('MedialLemniscus_projections', 9),
            (           'thalamus_ne

## Querying in general
We can query data based on mtype, etype, node_id, region, layer or any of the properties the nodes / edges have:

In [11]:
circuit.nodes.property_names

{'@dynamics:holding_current',
 '@dynamics:threshold_current',
 'etype',
 'layer',
 'model_template',
 'model_type',
 'morph_class',
 'morphology',
 'mtype',
 'orientation_w',
 'orientation_x',
 'orientation_y',
 'orientation_z',
 'region',
 'rotation_angle_xaxis',
 'rotation_angle_yaxis',
 'rotation_angle_zaxis',
 'synapse_class',
 'x',
 'y',
 'z'}

When the query is a `dict` and there is a `list` in the query, it is (usually) considered as an "OR" condition, and the keys of the query are considered as an "AND" condition. E.g.,
```python
circuit.nodes.ids({                    # give me the ids of nodes that
    'mtype': ['VPL_TC', 'VPL_IN']      # have mtype of 'VPL_TC' or 'VPL_IN' and
    'population': 'thalamus_neurons'   # belong to a population 'thalamus_neurons'
})  
```

Let's start with simple examples and work our way up to more complex queries.

### Querying with ids

#### Integers as `id`s
There are a few methods to query using ids. The most simple way is just use a single integer or a list of integers:

In [12]:
print(circuit.nodes.ids(1))
print(circuit.nodes.ids([1]))

CircuitNodeIds([('CorticoThalamic_projections', 1),
            ('MedialLemniscus_projections', 1),
            (           'thalamus_neurons', 1)],
           names=['population', 'node_ids'])
CircuitNodeIds([('CorticoThalamic_projections', 1),
            ('MedialLemniscus_projections', 1),
            (           'thalamus_neurons', 1)],
           names=['population', 'node_ids'])


This results in any nodes in any of the populations that have nodes with the given id(s). 

If we want to specify a population, we can do the above query with a `dict` instead:

In [13]:
circuit.nodes.ids({'node_id': 1, 'population': 'thalamus_neurons'})

CircuitNodeIds([('thalamus_neurons', 1)],
           names=['population', 'node_ids'])

#### Using `CircuitNodeIds` / `CircuitEdgeIds`

We could also use `CircuitNodeIds` (`CircuitEdgeIds` for edges) to specify the population and the node ids to consider. These objects are the same objects returned from the `ids` function and we wouldn't generally need to create them by hand.

The main things you need to know is that these objects are returned from the `ids` functions and they can be directly passed on to `get` functions.

For the sake of an example, lets create a `CircuitNodeIds` object. There are several methods to instantiate them, but we'll use `from_dict` here:

In [14]:
ids_from_dict = bluepysnap.circuit_ids.CircuitNodeIds.from_dict({'thalamus_neurons': [1]})
print(ids_from_dict)

CircuitNodeIds([('thalamus_neurons', 1)],
           names=['population', 'node_ids'])


As you may see, it's exactly what was returned by the `circuit.nodes.ids` function. Let's use it to do a get query:

In [15]:
result = circuit.nodes.get(ids_from_dict, properties=['layer'])
pd.concat([df for _,df in result])

Unnamed: 0_level_0,Unnamed: 1_level_0,layer
population,node_ids,Unnamed: 2_level_1
thalamus_neurons,1,Rt


#### Resolved `ids` in `get` queries
As mentioned, we can use the output of the ids as an argument to the `get` function:
```python
ids = circuit.nodes.ids({'node_id': [1, 2], 'population': 'thalamus_neurons'})
circuit.nodes.get(ids, properties=['layer'])
```
but, better yet, we can just pass the query to the `get` function as the `id`s are resolved internally:

In [16]:
result = circuit.nodes.get({'node_id': [1, 2], 'population': 'thalamus_neurons'}, properties=['layer'])
pd.concat([df for _,df in result])

Unnamed: 0_level_0,Unnamed: 1_level_0,layer
population,node_ids,Unnamed: 2_level_1
thalamus_neurons,1,Rt
thalamus_neurons,2,Rt


### Querying with regex
We can use regex to query data (e.g., to cover "OR" cases) by using the key `$regex` and specify an expression as a `str`. For example, let's search `thalamus_neurons` by `mtypes` that start with `VPL`:



In [17]:
 circuit.nodes['thalamus_neurons'].get({"mtype": {"$regex": "^VPL_.*"}}, properties=['mtype'])

Unnamed: 0_level_0,mtype
node_ids,Unnamed: 1_level_1
5002,VPL_TC
5003,VPL_TC
5004,VPL_TC
5005,VPL_TC
5006,VPL_TC
...,...
100760,VPL_IN
100761,VPL_IN
100762,VPL_IN
100763,VPL_IN


### Querying ranges of values

As shown above, we can query data based on discrete values of properties such as node set, layer etc. However, all properties are not made equal. 

When there is a `list` in the query, it is _usually_ considered an "OR" condition. There is one exception and that is numeric properties represented by `float`s. Whenever a property is represented by a `float`, a `list` will need to specify exactly two values: start of a range and end of a range (i.e., minimum and maximum values to consider).

Let's take a look which properties in `'thalamus_neurons'` are floats:

In [18]:
circuit.nodes['thalamus_neurons'].property_dtypes

@dynamics:holding_current       float32
@dynamics:threshold_current     float32
etype                          category
layer                            object
model_template                   object
model_type                       object
morph_class                    category
morphology                     category
mtype                          category
orientation_w                   float32
orientation_x                   float32
orientation_y                   float32
orientation_z                   float32
region                           object
rotation_angle_xaxis            float32
rotation_angle_yaxis            float32
rotation_angle_zaxis            float32
synapse_class                  category
x                               float32
y                               float32
z                               float32
dtype: object

Now, let's take `x`, `y`, `z` and query nodes that inside a 50x50x50 box limited by:
* `100 <= x <= 150`
* `500 <= y <= 550`
* `400 <= z <= 450`

In [19]:
circuit.nodes['thalamus_neurons'].get({
    'x': [100, 150],
    'y': [500, 550],
    'z': [400, 450],    
}, properties=['x', 'y', 'z'])

Unnamed: 0_level_0,x,y,z
node_ids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
26101,109.218941,526.994873,433.528503
27832,116.552238,519.171082,414.830933
27993,100.712425,535.644287,401.546112
28285,133.70787,531.594727,444.947723
28377,145.187637,547.493591,403.406708
28378,127.572868,538.761536,422.776123
28380,143.356384,525.541626,409.481781
28381,135.235031,517.312378,428.180695


### Complex (or rather, combined) queries
In some cases, we might want to specify our queries further, which makes them complex but not necessarily complicated.

For example, the issue with compound node sets is that it is always considered an "OR" condition, since it's a list.
What if you wanted to combine two node sets and make it an "AND" condition. For this case, there is a key `$node_set` we must combine with yet another key `$and`:

In [20]:
circuit.nodes.ids({
    '$and': [  # list of queries that are considered as: AND conditions
         {'$node_set': 'thalamus_neurons'},
         {'$node_set': 'Rt_RC'}
]})

CircuitNodeIds([('thalamus_neurons',     0),
            ('thalamus_neurons',     1),
            ('thalamus_neurons',     2),
            ('thalamus_neurons',     3),
            ('thalamus_neurons',     4),
            ('thalamus_neurons',     5),
            ('thalamus_neurons',     6),
            ('thalamus_neurons',     7),
            ('thalamus_neurons',     8),
            ('thalamus_neurons',     9),
            ...
            ('thalamus_neurons', 91247),
            ('thalamus_neurons', 91248),
            ('thalamus_neurons', 91249),
            ('thalamus_neurons', 91250),
            ('thalamus_neurons', 91251),
            ('thalamus_neurons', 91252),
            ('thalamus_neurons', 91253),
            ('thalamus_neurons', 91254),
            ('thalamus_neurons', 91255),
            ('thalamus_neurons', 91256)],
           names=['population', 'node_ids'], length=35567)

This, however, can not be done with node sets external to the circuit.

It may also be that you want to query something as an "OR" condition rather than "AND".
For example querying
```python
circuit.nodes.ids({'mtype': 'VPL_IN', 'region': 'mc2;Rt'})
```
will never return any ids. So how do we make it an "OR" case? That's right, with the help of `$or`:

In [21]:
result = circuit.nodes['thalamus_neurons'].get({
    '$or':[ # same as with $and, except the list is considered as OR condition
        {'mtype': 'VPL_IN'},
        {'region': 'mc2;Rt'}
]}, properties=['mtype', 'region'])

# Let's print all the unique mtype-region pairs:
for pair in result.groupby(['mtype', 'region'], observed=True).groups.keys():
    print(pair)

('Rt_RC', 'mc2;Rt')
('VPL_IN', 'mc0;VPL')
('VPL_IN', 'mc1;VPL')
('VPL_IN', 'mc2;VPL')
('VPL_IN', 'mc3;VPL')
('VPL_IN', 'mc4;VPL')
('VPL_IN', 'mc5;VPL')
('VPL_IN', 'mc6;VPL')


## Conclusion
In this notebook we took a deeper look into node sets and how to query data in SNAP.