# Ensembles in Kosh

Frequently the need arise to run an *ensemble*, e.g producing many datasets that share some common `metadata` or `sources`

Kosh provides a convenience class `KoshEnsemble` that helps you keep all of your datasets in sync.

## The basics

In essence, by creating a `KoshEnsemble` you lock a set of metadata that will be shared by all members of the ensemble. These metadata we be identical for all dataset in the ensemble and can only be edited from the `KoshEnsemble` object.

Additionally you can associate data with the ensemble. The data will then appear as if it was associated with each dataset.


In [1]:
import kosh

store = kosh.connect("ensembles_example.sql", delete_all_contents=True)

# let's create an ensemble. 
# we use the dedicated `create_ensemble` function that works just like the `create` function for datasets

ensemble = store.create_ensemble(name="My Example Dataset", metadata={"root":"/root/path/for/ensemble", "project":"Example"})

print(ensemble)

KOSH ENSEMBLE
	id: 51d499a7215c444181150f95d30d5aa6
	name: My Example Dataset
	creator: cdoutrix

--- Attributes ---
	creator: cdoutrix
	name: My Example Dataset
	project: Example
	root: /root/path/for/ensemble
--- Associated Data (0)---
--- Member Datasets (0)---
	[]


In [2]:
# Let's associated some file common to all datasets with the ensemble
ensemble.associate("../LICENSE", "text")
print(ensemble)

KOSH ENSEMBLE
	id: 51d499a7215c444181150f95d30d5aa6
	name: My Example Dataset
	creator: cdoutrix

--- Attributes ---
	creator: cdoutrix
	name: My Example Dataset
	project: Example
	root: /root/path/for/ensemble
--- Associated Data (1)---
	Mime_type: text
		/g/g19/cdoutrix/git/kosh/LICENSE ( 507b1b686ee24cc886fa3f98773e46ce )
--- Member Datasets (0)---
	[]


In [3]:
# Now let's add a member to our ensemble.
# We use the `create` function which works exactly as the store's `create` function.
ds1 = ensemble.create(name="First dataset", metadata={"param1":1., "param2": "a"})
# Notice that our ensemble attributes and associated data appear on the dataset
print(ds1)

KOSH DATASET
	id: f486f4c07ea04eeea6e3d084e2053808
	name: First dataset
	creator: cdoutrix

--- Attributes ---
	creator: cdoutrix
	name: First dataset
	param1: 1.0
	param2: a
--- Associated Data (1)---
	Mime_type: text
		/g/g19/cdoutrix/git/kosh/LICENSE ( 507b1b686ee24cc886fa3f98773e46ce )
--- Ensembles (1)---
	['51d499a7215c444181150f95d30d5aa6']
--- Ensemble Attributes ---
	--- Ensemble 51d499a7215c444181150f95d30d5aa6 ---
		project: Example
		root: /root/path/for/ensemble



In [4]:
# Dataset 1 also appears as part of the ensemble:
print(ensemble)

KOSH ENSEMBLE
	id: 51d499a7215c444181150f95d30d5aa6
	name: My Example Dataset
	creator: cdoutrix

--- Attributes ---
	creator: cdoutrix
	name: My Example Dataset
	project: Example
	root: /root/path/for/ensemble
--- Associated Data (1)---
	Mime_type: text
		/g/g19/cdoutrix/git/kosh/LICENSE ( 507b1b686ee24cc886fa3f98773e46ce )
--- Member Datasets (1)---
	['f486f4c07ea04eeea6e3d084e2053808']


In [5]:
# We can also create a dataset on its own as usual:
ds2 = store.create(name="Second dataset", metadata={"param1":2., "param2": "b"})
# And later add it to the ensemble
ensemble.add(ds2)
print(ensemble)

KOSH ENSEMBLE
	id: 51d499a7215c444181150f95d30d5aa6
	name: My Example Dataset
	creator: cdoutrix

--- Attributes ---
	creator: cdoutrix
	name: My Example Dataset
	project: Example
	root: /root/path/for/ensemble
--- Associated Data (1)---
	Mime_type: text
		/g/g19/cdoutrix/git/kosh/LICENSE ( 507b1b686ee24cc886fa3f98773e46ce )
--- Member Datasets (2)---
	['f486f4c07ea04eeea6e3d084e2053808', 'a97dae31c064496d947d71899830519e']


In [6]:
# We can also tell a dataset to join an ensemble:
# Let's create a dataset:
ds3 = store.create(name="Third dataset", metadata={"param1":3., "param2": "c"})
# Now let's ask the dataset to join the ensemble:
ds3.join_ensemble(ensemble)
print(ensemble)

KOSH ENSEMBLE
	id: 51d499a7215c444181150f95d30d5aa6
	name: My Example Dataset
	creator: cdoutrix

--- Attributes ---
	creator: cdoutrix
	name: My Example Dataset
	project: Example
	root: /root/path/for/ensemble
--- Associated Data (1)---
	Mime_type: text
		/g/g19/cdoutrix/git/kosh/LICENSE ( 507b1b686ee24cc886fa3f98773e46ce )
--- Member Datasets (3)---
	['f486f4c07ea04eeea6e3d084e2053808', 'a97dae31c064496d947d71899830519e', 'e5181f5390ed434691c5debc4d30023c']


In [7]:
# Now we can access all datasets of an ensemble:
list(ensemble.get_members(ids_only=True))

['f486f4c07ea04eeea6e3d084e2053808',
 'a97dae31c064496d947d71899830519e',
 'e5181f5390ed434691c5debc4d30023c']

In [8]:
# Similarly a dataset can leave or be removed from an ensemble.
dataset = ensemble.create()
print("Ensemble has {} members.".format(len(list(ensemble.get_members(ids_only=True)))))
dataset.leave_ensemble(ensemble)
print("Ensemble has {} members after dataset left.".format(len(list(ensemble.get_members(ids_only=True)))))
ensemble.add(dataset)
print("Ensemble has {} members after adding dataset back.".format(len(list(ensemble.get_members(ids_only=True)))))
ensemble.delete(dataset)
print("Ensemble has {} members after removing dataset.".format(len(list(ensemble.get_members(ids_only=True)))))

Ensemble has 4 members.
Ensemble has 3 members after dataset left.
Ensemble has 4 members after adding dataset back.
Ensemble has 3 members after removing dataset.


# Attributes

As previously mentioned the ensemble attributes appear on all of its members. 

Changing or adding an ensemble attribute propagates to all of its members:


In [9]:
ensemble.root = "foo"
ensemble.new_attribute = "bar"
[(x.root, x.new_attribute) for x in ensemble.get_members()]

[('foo', 'bar'), ('foo', 'bar'), ('foo', 'bar')]

***WARNING:*** You cannot set an attribute belonging to an ensemble from one of its members

In [10]:
try:
    ds1.root = "root_from_ds1"
except KeyError as err:
    print(err)
ds1.root

'The attribute root is controlled by ensemble: 51d499a7215c444181150f95d30d5aa6 and cannot be set here'


'foo'

You can ask a dataset if one of its attributes belongs to an ensemble

In [11]:
print("Is `root` an ensemble attribute?", ds1.is_ensemble_attribute("root"))
print("Is `param1` an ensemble attribute?", ds1.is_ensemble_attribute("param1"))

Is `root` an ensemble attribute? True
Is `param1` an ensemble attribute? False


You can also get which ensemble the attribute comes from:

In [12]:
print("Attribute `root` belongs to ensemble:", ds1.is_ensemble_attribute("root", ensemble_id=True))

Attribute `root` belongs to ensemble: 51d499a7215c444181150f95d30d5aa6


In [13]:
print("Attribute `param1` belongs to ensemble:", ds1.is_ensemble_attribute("param1", ensemble_id=True))

Attribute `param1` belongs to ensemble: 


## Searching

We can search a store for ensembles containing some attributes

In [14]:
ensembles = store.find_ensembles(root="foo", ids_only=True)
print(list(ensembles))

['51d499a7215c444181150f95d30d5aa6']


The ensemble metadata appear as dataset metadata, so we can search dataset based on ensemble attributes

In [15]:
list(store.find(root="foo", ids_only=True))

['f486f4c07ea04eeea6e3d084e2053808',
 'a97dae31c064496d947d71899830519e',
 'e5181f5390ed434691c5debc4d30023c']

Just like for datasets, the `find` function is used to lookup associated sources

In [16]:
next(ensemble.find(mime_type="text", ids_only=True))

'507b1b686ee24cc886fa3f98773e46ce'

The associated data will also appear and be searchable for each individual dataset.

In [17]:
next(ds1.find(mime_type="text", ids_only=True))

'507b1b686ee24cc886fa3f98773e46ce'

We can also search for datasets within an ensemble.

In [18]:
next(ensemble.find_datasets(param1=1, ids_only=True))

'f486f4c07ea04eeea6e3d084e2053808'

## Multiple ensembles

Datasets can be part of multiple ensembles. For example doing  a parameter study for a problem. But also with 2 different tools.



In [19]:
problem1_ensemble = store.create_ensemble(name="problem 1", metadata={"problem":"problem1"})
problem2_ensemble = store.create_ensemble(name="problem 2", metadata={"problem":"problem2"})
tool1_ensemble = store.create_ensemble(name="tool1", metadata={"tool":"tool1"})
tool2_ensemble = store.create_ensemble(name="tool2", metadata={"tool":"tool2"})

for problem in ["problem1", "problem2"]:
    for tool in ["tool1", "tool2"]:
        for param1 in [1,2,3,]:
            ds = store.create(metadata={"param1":param1})
            tool_ensemble = next(store.find_ensembles(tool= tool))
            ds.join_ensemble(tool_ensemble)
            problem_ensemble = next(store.find_ensembles(problem= problem))
            ds.join_ensemble(problem_ensemble)

# now let's find datasets for tool1 and problem1
datasets = list(store.find(tool="tool1", problem="problem1"))
print("We found:",len(datasets),"datasets")
ds = datasets[0]  # belongs to two ensembles
#  Note that string will show which attributes belong to which ensemble
ds

We found: 3 datasets


KOSH DATASET
	id: 27b0ce4cad1f430f8c8dac684418f6e5
	name: Unnamed Dataset
	creator: cdoutrix

--- Attributes ---
	creator: cdoutrix
	name: Unnamed Dataset
	param1: 1
--- Associated Data (0)---
--- Ensembles (2)---
	['1fbde2cc31dd4722befd0b4a612e1dad', '651fb4a6a60d432bbd524ed65aaee48e']
--- Ensemble Attributes ---
	--- Ensemble 1fbde2cc31dd4722befd0b4a612e1dad ---
		tool: tool1
	--- Ensemble 651fb4a6a60d432bbd524ed65aaee48e ---
		problem: problem1


***WARNING:*** In order to belong to multi-ensemble, each ensemble must have a unique set of attributes

Example if another ensemble had the `problem` attribute and a datasets belong to both ensembles, we could not determine which ensemble to grab the `problem` attribute from:

In [20]:
e3 = store.create_ensemble(metadata={"problem":"another problem"})
try:
    ds.join_ensemble(e3)
except Exception as err:
    print(err)

Dataset 27b0ce4cad1f430f8c8dac684418f6e5 is already part of ensemble 651fb4a6a60d432bbd524ed65aaee48e which already provides support for attribute: problem. Bailing


Similarly you cannot create a new attribute on an ensemble if one of its member belongs to another ensemble already controlling this attribute:


In [21]:
try:
    problem1_ensemble.tool = "some tool"
except Exception as err:
    print(err)

A member of this ensemble belongs to ensemble 1fbde2cc31dd4722befd0b4a612e1dad which already controls attribute tool
