# Datasets

A [Dataset](https://nexus-forge.readthedocs.io/en/latest/interaction.html#dataset) is a specialization of a [Resource](https://nexus-forge.readthedocs.io/en/latest/interaction.html#resource) that provides users with operations to handle files, record their provenance and describe them with metadata.

In [1]:
from kgforge.core import KnowledgeGraphForge

A configuration file is needed in order to create a KnowledgeGraphForge session. A configuration can be generated using the notebook [00-Initialization.ipynb](00%20-%20Initialization.ipynb).

Note: DemoStore doesn't implement file operations yet. Use the BluBrainNexus store instead when creating a config file.

In [2]:
forge = KnowledgeGraphForge("../../configurations/forge.yml")

## Imports

In [3]:
from kgforge.core import Resource

In [4]:
from kgforge.specializations.resources import Dataset

In [5]:
import pandas as pd

## Creation with files added as parts

In [6]:
! ls -p ../../data | egrep -v /$

associations.tsv
my_data.xwz
my_data_derived.txt
non_existing_person.jpg
persons-with-id.csv
persons.csv
tfidfvectorizer_model_schemaorg_linking


In [7]:
persons = Dataset(forge, name="Interesting Persons")

In [8]:
persons.add_files("../../data/persons.csv")

In [9]:
forge.register(persons)

<action> _register_one
<succeeded> True


In [10]:
forge.as_json(persons)

{'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/d009f14a-874f-4230-a4ac-d55915c68652',
 'type': 'Dataset',
 'hasPart': {'distribution': {'type': 'DataDownload',
   'atLocation': {'type': 'Location',
    'store': {'id': 'https://bluebrain.github.io/nexus/vocabulary/diskStorageDefault',
     'type': 'DiskStorage',
     '_rev': 1}},
   'contentSize': {'unitCode': 'bytes', 'value': 52},
   'contentUrl': 'https://sandbox.bluebrainnexus.io/v1/files/github-users/crisely09/https%3A%2F%2Fsandbox.bluebrainnexus.io%2Fv1%2Fresources%2Fgithub-users%2Fcrisely09%2F_%2Ff1403394-4f78-42d9-adbb-65ae94202ddd',
   'digest': {'algorithm': 'SHA-256',
    'value': '1dacd765946963fda4949753659089c5f532714b418d30788bedddfec47a389f'},
   'encodingFormat': 'text/csv',
   'name': 'persons.csv'}},
 'name': 'Interesting Persons'}

In [11]:
associations = Dataset(forge, name="Associations data")

In [12]:
associations.add_files("../../data/associations.tsv")

In [13]:
associations.add_derivation(persons)

In [14]:
forge.register(associations)

<action> _register_one
<succeeded> True


In [15]:
forge.as_json(associations)

{'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/3339e476-b501-4301-bb00-d1393eaf84d7',
 'type': 'Dataset',
 'derivation': {'type': 'Derivation',
  'entity': {'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/d009f14a-874f-4230-a4ac-d55915c68652?rev=1',
   'type': 'Dataset',
   'name': 'Interesting Persons'}},
 'hasPart': {'distribution': {'type': 'DataDownload',
   'atLocation': {'type': 'Location',
    'store': {'id': 'https://bluebrain.github.io/nexus/vocabulary/diskStorageDefault',
     'type': 'DiskStorage',
     '_rev': 1}},
   'contentSize': {'unitCode': 'bytes', 'value': 477},
   'contentUrl': 'https://sandbox.bluebrainnexus.io/v1/files/github-users/crisely09/https%3A%2F%2Fsandbox.bluebrainnexus.io%2Fv1%2Fresources%2Fgithub-users%2Fcrisely09%2F_%2F8e004e65-b3d9-4bd3-b41b-d2fca9a93723',
   'digest': {'algorithm': 'SHA-256',
    'value': '789aa07948683fe036ac29811814a826b703b562f7d168eb70dee1fabde26859'},
   'encodingFor

In [16]:
# By default the files are downloaded in the current path (path="."). The urls or the files to download can be collected from a different json path (by setting a value for "follow") and 
# the files downloaded to a different path (by setting a value for "path")
# The argument overwrite: bool can be provided to decide whether to overwrite (True) existing files with the same name or
# to create new ones (False) with their names suffixed with a timestamp.
# A cross_bucket argument can be provided to download data from the configured bucket (cross_bucket=False - the default value) 
# or from a bucket different than the configured one (cross_bucket=True). The configured store should support crossing buckets for this to work.
associations.download(source="parts")

In [17]:
# A specific path can be provided.
associations.download(path="./downloaded/", source="parts")

In [18]:
# A specific content type can be downloded.
associations.download(path="./downloaded/", source="parts", content_type="text/tab-separated-values")

In [19]:
! ls -l ./downloaded

total 1824
-rw-r--r--  1 cgonzale  10067     477 Feb  2 10:09 associations.tsv
-rw-r--r--  1 cgonzale  10067     477 Feb 12 14:57 associations.tsv.20240212145722
-rw-r--r--  1 cgonzale  10067     477 Feb 12 14:57 associations.tsv.20240212145724
-rw-r--r--  1 cgonzale  10067     477 Feb 12 14:59 associations.tsv.20240212145905
-rw-r--r--  1 cgonzale  10067     477 Feb 12 19:00 associations.tsv.20240212190039
-rw-r--r--  1 cgonzale  10067     477 Feb 16 15:11 associations.tsv.20240216151123
-rw-r--r--  1 cgonzale  10067     477 May 21 13:21 associations.tsv.20240521132127
-rw-r--r--  1 cgonzale  10067      16 Feb  2 10:09 my_data.xwz
-rw-r--r--  1 cgonzale  10067      16 Feb 12 14:59 my_data.xwz.20240212145905
-rw-r--r--  1 cgonzale  10067      16 Feb 12 19:00 my_data.xwz.20240212190039
-rw-r--r--  1 cgonzale  10067      16 Feb 16 15:11 my_data.xwz.20240216151123
-rw-r--r--  1 cgonzale  10067      24 Feb  2 10:09 my_data_derived.txt
-rw-r--r--  1 cgonzale  10067      24 Feb 12 14:59 my_d

In [20]:
# ! rm -R ./downloaded/

## Creation with files added as distribution

In [21]:
persons = Dataset(forge, name="Interesting Persons")

In [22]:
persons.add_distribution("../../data/associations.tsv")

In [23]:
persons.add_image(path='../../data/non_existing_person.jpg', content_type='application/jpeg', about='Person')

In [24]:
forge.register(persons)

<action> _register_one
<succeeded> True


In [25]:
forge.as_json(persons)

{'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/ed348e0e-da41-446f-90fe-b8103e78e979',
 'type': 'Dataset',
 'distribution': {'type': 'DataDownload',
  'atLocation': {'type': 'Location',
   'store': {'id': 'https://bluebrain.github.io/nexus/vocabulary/diskStorageDefault',
    'type': 'DiskStorage',
    '_rev': 1}},
  'contentSize': {'unitCode': 'bytes', 'value': 477},
  'contentUrl': 'https://sandbox.bluebrainnexus.io/v1/files/github-users/crisely09/https%3A%2F%2Fsandbox.bluebrainnexus.io%2Fv1%2Fresources%2Fgithub-users%2Fcrisely09%2F_%2Ffd26715d-1d04-4f58-8845-e2fe7565353c',
  'digest': {'algorithm': 'SHA-256',
   'value': '789aa07948683fe036ac29811814a826b703b562f7d168eb70dee1fabde26859'},
  'encodingFormat': 'text/tab-separated-values',
  'name': 'associations.tsv'},
 'image': {'@id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/54f4702b-45f2-46ef-b94a-65d12ff8bf96',
  'about': 'Person'},
 'name': 'Interesting Persons'}

In [26]:
# When files are added as distributions, they can be directly downloaded without specifying which json path to use to collect the downlodable urls. In addition, content type and path arguments
# can still be provided
persons.download()

## Creation with resources added as parts

In [27]:
distribution_1 = forge.attach("../../data/associations.tsv")

In [28]:
distribution_2 = forge.attach("../../data/persons.csv")

In [29]:
jane = Resource(type="Person", name="Jane Doe", distribution=distribution_1)

In [30]:
john = Resource(type="Person", name="John Smith", distribution=distribution_2)

In [31]:
persons = [jane, john]

In [32]:
forge.register(persons)

<count> 2
<action> _register_many
<succeeded> True


In [33]:
dataset = Dataset(forge, name="Interesting people")

In [34]:
dataset.add_parts(persons)

In [35]:
forge.register(dataset)

<action> _register_one
<succeeded> True


In [36]:
forge.as_json(dataset)

{'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/876be46c-34c8-43f1-9f48-07f997d5f43c',
 'type': 'Dataset',
 'hasPart': [{'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/04aca629-348a-450c-bf40-e1abf8e9858e?rev=1',
   'type': 'Person',
   'distribution': {'contentUrl': 'https://sandbox.bluebrainnexus.io/v1/files/github-users/crisely09/https%3A%2F%2Fsandbox.bluebrainnexus.io%2Fv1%2Fresources%2Fgithub-users%2Fcrisely09%2F_%2F1be640b8-dcd3-451b-936c-b0525737edc4'},
   'name': 'Jane Doe'},
  {'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/72a4dc8e-347d-478e-9cac-fb61ef3e212d?rev=1',
   'type': 'Person',
   'distribution': {'contentUrl': 'https://sandbox.bluebrainnexus.io/v1/files/github-users/crisely09/https%3A%2F%2Fsandbox.bluebrainnexus.io%2Fv1%2Fresources%2Fgithub-users%2Fcrisely09%2F_%2F18500da1-f8c5-4388-9ae2-686e9429c078'},
   'name': 'John Smith'}],
 'name': 'Interesting people'}

In [37]:
dataset.download(path="./downloaded/", source="parts")

In [38]:
! ls -l ./downloaded

total 1840
-rw-r--r--  1 cgonzale  10067     477 Feb  2 10:09 associations.tsv
-rw-r--r--  1 cgonzale  10067     477 Feb 12 14:57 associations.tsv.20240212145722
-rw-r--r--  1 cgonzale  10067     477 Feb 12 14:57 associations.tsv.20240212145724
-rw-r--r--  1 cgonzale  10067     477 Feb 12 14:59 associations.tsv.20240212145905
-rw-r--r--  1 cgonzale  10067     477 Feb 12 19:00 associations.tsv.20240212190039
-rw-r--r--  1 cgonzale  10067     477 Feb 16 15:11 associations.tsv.20240216151123
-rw-r--r--  1 cgonzale  10067     477 May 21 13:21 associations.tsv.20240521132127
-rw-r--r--  1 cgonzale  10067     477 May 21 13:21 associations.tsv.20240521132129
-rw-r--r--  1 cgonzale  10067      16 Feb  2 10:09 my_data.xwz
-rw-r--r--  1 cgonzale  10067      16 Feb 12 14:59 my_data.xwz.20240212145905
-rw-r--r--  1 cgonzale  10067      16 Feb 12 19:00 my_data.xwz.20240212190039
-rw-r--r--  1 cgonzale  10067      16 Feb 16 15:11 my_data.xwz.20240216151123
-rw-r--r--  1 cgonzale  10067      24 Feb  

In [39]:
# ! rm -R ./downloaded/

### Creation from resources converted as Dataset objects

In [40]:
dataset = Dataset.from_resource(forge, [jane, john], store_metadata=True)
print(*dataset, sep="\n")

{
    id: https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/04aca629-348a-450c-bf40-e1abf8e9858e
    type: Person
    distribution:
    {
        type: DataDownload
        atLocation:
        {
            type: Location
            store:
            {
                id: https://bluebrain.github.io/nexus/vocabulary/diskStorageDefault
                type: DiskStorage
                _rev: 1
            }
        }
        contentSize:
        {
            unitCode: bytes
            value: 477
        }
        contentUrl: https://sandbox.bluebrainnexus.io/v1/files/github-users/crisely09/https%3A%2F%2Fsandbox.bluebrainnexus.io%2Fv1%2Fresources%2Fgithub-users%2Fcrisely09%2F_%2F1be640b8-dcd3-451b-936c-b0525737edc4
        digest:
        {
            algorithm: SHA-256
            value: 789aa07948683fe036ac29811814a826b703b562f7d168eb70dee1fabde26859
        }
        encodingFormat: text/tab-separated-values
        name: associations.tsv
    }
    name: Jane

## Creation from a dataframe

See notebook `07 DataFrame IO.ipynb` for details on conversions of instances of Resource from a Pandas DataFrame.

### basics

In [41]:
dataframe = pd.read_csv("../../data/persons.csv")

In [42]:
dataframe

Unnamed: 0,type,name
0,Person,Marie Curie
1,Person,Albert Einstein


In [43]:
persons = forge.from_dataframe(dataframe)

In [44]:
forge.register(persons)

<count> 2
<action> _register_many
<succeeded> True


In [45]:
dataset = Dataset(forge, name="Interesting people")

In [46]:
dataset.add_parts(persons)

In [47]:
forge.register(dataset)

<action> _register_one
<succeeded> True


In [48]:
forge.as_json(dataset)

{'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/1c12042a-e345-49f6-8e2f-1dc9d7838ca4',
 'type': 'Dataset',
 'hasPart': [{'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/84a9b704-3e43-431c-af6e-e3c8e542aa74?rev=1',
   'type': 'Person',
   'name': 'Marie Curie'},
  {'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/5f7f2e1e-0666-456e-b5c9-51901ddb926f?rev=1',
   'type': 'Person',
   'name': 'Albert Einstein'}],
 'name': 'Interesting people'}

### advanced

In [49]:
dataframe = pd.read_csv("../../data/associations.tsv", sep="\t")

In [50]:
dataframe

Unnamed: 0,id,name,type,agent__type,agent__name,agent__gender__id,agent__gender__type,agent__gender__label,distribution
0,(missing),Curie Association,Association,Person,Marie Curie,http://purl.obolibrary.org/obo/PATO_0000383,LabeledOntologyEntity,female,../../data/scientists-database/marie_curie.txt
1,(missing),Einstein Association,Association,Person,Albert Einstein,http://purl.obolibrary.org/obo/PATO_0000384,LabeledOntologyEntity,male,../../data/scientists-database/albert_einstein...


In [51]:
dataframe["distribution"] = dataframe["distribution"].map(lambda x: forge.attach(x))

In [52]:
associations = forge.from_dataframe(dataframe, na="(missing)", nesting="__")

In [53]:
print(*associations, sep="\n")

{
    type: Association
    agent:
    {
        type: Person
        gender:
        {
            id: http://purl.obolibrary.org/obo/PATO_0000383
            type: LabeledOntologyEntity
            label: female
        }
        name: Marie Curie
    }
    distribution: LazyAction(operation=Store.upload, args=['../../data/scientists-database/marie_curie.txt', None, <kgforge.core.forge.KnowledgeGraphForge object at 0x7fe2734726d0>])
    name: Curie Association
}
{
    type: Association
    agent:
    {
        type: Person
        gender:
        {
            id: http://purl.obolibrary.org/obo/PATO_0000384
            type: LabeledOntologyEntity
            label: male
        }
        name: Albert Einstein
    }
    distribution: LazyAction(operation=Store.upload, args=['../../data/scientists-database/albert_einstein.txt', None, <kgforge.core.forge.KnowledgeGraphForge object at 0x7fe2734726d0>])
    name: Einstein Association
}


In [54]:
forge.register(associations)

<count> 2
<action> _register_many
<succeeded> True


In [55]:
dataset = Dataset(forge, name="Interesting associations")

In [56]:
dataset.add_parts(associations)

In [57]:
forge.register(dataset)

<action> _register_one
<succeeded> True


In [58]:
forge.as_json(dataset)

{'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/7ba8f194-4bbc-49a0-8805-1f8b726c9580',
 'type': 'Dataset',
 'hasPart': [{'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/6081711e-7265-4685-8070-a3a4a36b1400?rev=1',
   'type': 'Association',
   'distribution': {'contentUrl': 'https://sandbox.bluebrainnexus.io/v1/files/github-users/crisely09/https%3A%2F%2Fsandbox.bluebrainnexus.io%2Fv1%2Fresources%2Fgithub-users%2Fcrisely09%2F_%2F1b0fe5e1-2bb8-4e53-a199-51b190157cd4'},
   'name': 'Curie Association'},
  {'id': 'https://sandbox.bluebrainnexus.io/v1/resources/github-users/crisely09/_/afbdf767-4613-4ce8-9d56-98d07867cd01?rev=1',
   'type': 'Association',
   'distribution': {'contentUrl': 'https://sandbox.bluebrainnexus.io/v1/files/github-users/crisely09/https%3A%2F%2Fsandbox.bluebrainnexus.io%2Fv1%2Fresources%2Fgithub-users%2Fcrisely09%2F_%2F3613ea45-79a1-4a60-acd6-739d3fc12f5a'},
   'name': 'Einstein Association'}],
 'name': 'In