<font size="+3">Project</font>

Each data item belongs to a certain project. 

The project exists in the database if there is at least one document (of any type) with the project name. 

# Creating a new project 


# Listing the projects in the database

To get the list of documents in the database can be done performed 
either using code or from the CLI 

## Code 

In [9]:
from hera import datalayer
projectList = datalayer.getProjectList()

for pn in projectList:
    print(pn)


documentation
particlesNanoBBB
The-Project-Name
nanoParticlesBBB
loggingData
tmp
NTA2022
Demography


## CLI 

The list of projects can be obtained from the CLI 

<div class="alert alert-success" role="alert">    
    >>  hera-project project list [--onlyName] [--connectionName CONNECTIONNAME]
</div>    

lists all the projects and summarizes the number of documents in each type (Cache, Measurements, and  Simulation documents). 

When --onlyName is specified, the output is without the summary of the documents. 
The --connectionName flag is specified to change the database connection name. 

# The Project class 

The Project class provides an interface to manage the data items in the project. The 
The project name is [supplied from a configuration file](#file), or by specifying it [explicitly](#id).
The project class also allows the user to specify the [connection name to the database](#connectionName). 
When the connectionName is not specified then the deafult connection is used (e.g the linux user name). 

Additionally, the Project object provides an interface to save configuration parameters.

First, we need to import the Project object.

In [10]:
from hera import Project 

<a id="file"></id>
## Initialize from  a configuration file 

In standard usage the same project name will be used in different 
occasions (like different jupyter notebooks or python scripts). 

To simplify the usage, it is possible to initialize the project 
name from a configuration file. 

The configuration file name is always `caseConfiguration.json`, 
and its the default location is the current directory. Alternatively, 
it is possible to specify the location of the configuration. 
The structure of the configuration is delineated [below](#configurationFile).

Initializing the project using the configuration file is 
is perfomed as follows 

In [4]:
proj = Project()

In the current example, the name of the project is testProject. 
Printing the project name 

In [5]:
print(proj.projectName)

testProject


When the case configuration is in a different directory, it is possible to 
supply the directory name

In [7]:
configurationPath = "."
proj = Project(configurationPath=configurationPath)
print(proj.projectName)

testProject


<a id="explicit"></id>
## Initializing  explicitly  

Initializing the Project by specifying the 
project name explicitly

In [8]:
proj = Project(projectName="myProject")
print(proj.projectName)

myProject


<a id="connectionName"></a>
## Specifying the connection name 

Specifying the connection name is perfomed as follows, 

In [None]:
proj = Project(connectionName="theConnection")

The connection should be present in the conifguration file. 
Management of the database is described [here](DataLayer.ipynb#setup).


<a id="configurationFile"></a>
## Creating a caseConfiguration file. 

The case configuration file can be created manually or using the CLI. 

## Manual 

The project name is specified in the json file. 

The structure of the JSON file is, 

```javascript
{
    "projectName" : <project name>
}
```

## CLI

Creating the `caseConfiguration.json` with the CLI is 

<div class="alert alert-success" role="alert">    
    >> hera-project project create projectName [--directory DIRECTORY]
</div>    

Will create the caseConfiguration file with the requested project name. 
If the DIRECTORY is not specified, the file will be created in the current directory. 
If --directory is present, then the DIRECTORY will be created and the `caseConfiguration.json` will be created 
there. 

# Managing the data. 

Adding, removing and updating the data items is explained by example. 

First, lets create some mock-up data that we can store in the DB.   

In [2]:
import pandas
import numpy
from scipy.stats import norm

x = numpy.linspace(norm.ppf(0.01), norm.ppf(0.99), 100)

dataset1 = pandas.DataFrame(dict(x=x,y=norm.pdf(x,loc=0,scale=1)))
dataset2 = pandas.DataFrame(dict(x=x,y=norm.pdf(x,loc=0,scale=0.5)))
dataset3 = pandas.DataFrame(dict(x=x,y=norm.pdf(x,loc=0.5,scale=0.5)))

print(dataset1.head())

          x         y
0 -2.326348  0.026652
1 -2.279351  0.029698
2 -2.232354  0.033020
3 -2.185357  0.036632
4 -2.138360  0.040550


Now that we have data, we can save it. We would like to keep the connection between the data and the parameters that generated it. 
So that:

- **dataset1** is characterized by loc=0 and scale = 1 
- **dataset2** is characterized by loc=0 and scale = 0.5 
- **dataset3** is characterized by loc=0.5 and scale = 0.5

Therefore, we will save the loc and scale in the metadata.

Before we add the data to the DB, we need to save it to the disk

In [3]:
import os 

# getting the work directory
workingdir = os.path.join(os.path.abspath(os.getcwd()),"examples","datalayer")

dataset1File = os.path.join(workingdir,"dataset1.parquet")
dataset2File = os.path.join(workingdir,"dataset2.parquet")
dataset3File = os.path.join(workingdir,"dataset3.parquet")

Now we can save the dataset. We choose parquet for convinience, but it can be any other format.

In [4]:
dataset1.to_parquet(dataset1File,engine='fastparquet',compression='GZIP')
dataset2.to_parquet(dataset2File,engine='fastparquet',compression='GZIP')
dataset3.to_parquet(dataset3File,engine='fastparquet',compression='GZIP')

When we save the data to the database we need to define a project
and specify the project name. 

To do so, we need to import the project and create it with its name. 
For this example we will use the name `ExampleProject`.

In [5]:
from hera.datalayer import Project

projectName = "ExampleProject"

proj = Project(projectName=projectName)

## Adding data items to the project 

Next, we add the documents to the database. To do this, we must specify the 'type' of the documents. This type is user-defined and enables the user to query all documents of this type.

For this example, we will add the data as a Measurement data. 

In [6]:
proj.addMeasurementsDocument( type="Distribution",
                             dataFormat=proj.datatypes.PARQUET,
                             resource=dataset1File,
                             desc=dict(loc=0,scale=1));

proj.addMeasurementsDocument(type="Distribution",
                             dataFormat=proj.datatypes.PARQUET,
                             resource=dataset2File,
                             desc=dict(loc=0,scale=0.5));

proj.addMeasurementsDocument(type="Distribution",
                             dataFormat=proj.datatypes.PARQUET,
                             resource=dataset3File,
                             desc=dict(loc=0.5,scale=0.5));

## Getting the data

### Getting one record back
Now we will query the database for all the records in which loc=0 and scale=1.

In [7]:
List1 = proj.getMeasurementsDocuments(loc=0,scale=1)

print(f"The number of documents obtained from the query {len(List1)} ")
item0 = List1[0]

The number of documents obtained from the query 1 


Note that for consistency the query always returns a list.

The description of the record that matched the query is

In [8]:
import json 

print("The description of dataset 1")
print(json.dumps(item0.desc, indent=4, sort_keys=True))

The description of dataset 1
{
    "loc": 0,
    "scale": 1
}


Now, we will extract the data.

Using the getData on item0 will retrieve the data 

In [9]:
dataset1FromDB = item0.getData()

Since the data is parquet, the library automatically returns a dask.DataFrame, where 
the data is not loaded until it is computed. 

Alternatively, we can pass the usePandas flag. This flag is used only 
when the datatype is PARQUET. 

In [10]:
dataset1FromDB = item0.getData(usePandas=True)

print(dataset1FromDB)

           x         y
0  -2.326348  0.026652
1  -2.279351  0.029698
2  -2.232354  0.033020
3  -2.185357  0.036632
4  -2.138360  0.040550
..       ...       ...
95  2.138360  0.040550
96  2.185357  0.036632
97  2.232354  0.033020
98  2.279351  0.029698
99  2.326348  0.026652

[100 rows x 2 columns]


### Getting multiple records back

The getMeasurementsDocuments returns all the records that match the criteria. 

Now, lets get all the records where loc=0

In [11]:
List2 = proj.getMeasurementsDocuments(loc=0)

print(f"The number of documents obtained from the query {len(List2)} ")

The number of documents obtained from the query 2 


As another example, let's retrieve all documents of the type 'Distribution'.

In [12]:
List3 = proj.getMeasurementsDocuments(type='Distribution')

print(f"The number of documents obtained from the query {len(List3)} ")

The number of documents obtained from the query 3 


## Updating the data.
The hera system holds the name of the file on the disk and loads the data from it. Therefore, if the datafile on the disk is overwitten, then the data of the record is changed

Lets multiply dataset1 by factor 2. The file name is saved in the resource attribute.

Note that if we just update the data and not the metadata, then we can use the resource property to 
save the new file. 

In [13]:
dataset1['y'] *=2
dataset1FileName = item0.resource
dataset1.to_parquet(dataset1FileName,engine='fastparquet',compression='GZIP',append=False)

In [14]:
dataset1FromDB = item0.getData().compute()
print(dataset1FromDB)

           x         y
0  -2.326348  0.053304
1  -2.279351  0.059397
2  -2.232354  0.066040
3  -2.185357  0.073264
4  -2.138360  0.081099
..       ...       ...
95  2.138360  0.081099
96  2.185357  0.073264
97  2.232354  0.066040
98  2.279351  0.059397
99  2.326348  0.053304

[100 rows x 2 columns]


## Updating the metadata.

Lets assume we want to add another property to the first record. To so we will update item0

In [15]:
item0.desc['new_attribute'] = "some data"
item0.save();

Lets requery the database to see what is the data there. 

In [16]:
item0_fromdb = proj.getMeasurementsDocuments(loc=0,scale=1)[0]
print(json.dumps(item0_fromdb.desc, indent=4, sort_keys=True))

{
    "loc": 0,
    "new_attribute": "some data",
    "scale": 1
}


## Deleting the metadata entry.

We delete the metadata records similarly to the way we add them

The following will delete one record. The simplest method is to erase the document object. 

In [17]:
item0.delete()

Lets query the database again to see if the record was deleted. 

<div class="alert alert-block alert-warning">
Note that the file on the disk is not deleted by deleting the record in the DB. 
</div>

In [18]:
List1 = proj.getMeasurementsDocuments(loc=0,scale=1)

print(f"The number of documents obtained from the query {len(List1)} ")

The number of documents obtained from the query 0 


Another option is to delete the records using the Project interface. 

In [19]:
deletedList1 = proj.deleteMeasurementsDocuments(loc=0)

Lets list all the data records that we deleted

In [20]:
for doc in deletedList1:
    print(json.dumps(doc, indent=4, sort_keys=True))

{
    "_cls": "Metadata.Measurements",
    "_id": {
        "$oid": "65a1d49d9b9128b1cb982b4c"
    },
    "dataFormat": "parquet",
    "desc": {
        "loc": 0,
        "scale": 0.5
    },
    "projectName": "ExampleProject",
    "resource": "/home/yehudaa/Development/hera/hera/doc/source/examples/datalayer/dataset2.parquet",
    "type": "Distribution"
}


## Deleting the data on the disk 

Now we can erase the file from the disk. It is saved in the resource property

In [21]:
import shutil

for doc in deletedList1:
    if os.path.isfile(doc['resource']):
          os.remove(doc['resource'])
    else:
        shutil.rmtree(doc['resource'])

## Delete all the metadata records 

A simple way to delete all the records (be careful)

In [22]:
[x.delete() for x in proj.getMeasurementsDocuments(type="Distribution")]

[None]

# Conifguration 

The configuration interface enables users to store project parameters, which are organized as a dictionary.

Setting the parameters is obtained as follows

In [12]:
proj.setConfig(parameterName="value")

Getting the parameter dictionary is obtained as follows 

In [13]:
proj.getConfig()

{'parameterName': 'value'}

# Datatypes 

The Project class also provides access to the [datatypes](DataLayer.ipynb#datatypes) of the available data formats. 

Specifically, 

In [14]:
proj.datatypes.PARQUET

'parquet'