# Datasets

A [Dataset](https://nexus-forge.readthedocs.io/en/latest/interaction.html#dataset) is a specialization of a [Resource](https://nexus-forge.readthedocs.io/en/latest/interaction.html#resource) that aims to register (upload) files with eventually metadata in a configured store.

In [None]:
from kgforge.core import KnowledgeGraphForge

A configuration file is needed in order to create a KnowledgeGraphForge session. A configuration can be generated using the notebook [00-Initialization.ipynb](00%20-%20Initialization.ipynb).

Note: DemoStore doesn't implement file operations yet. Use the BluBrainNexus store instead when creating a config file.

In [None]:
forge = KnowledgeGraphForge("../../configurations/forge.yml")

## Imports

In [None]:
from kgforge.core import Resource

In [None]:
from kgforge.specializations.resources import Dataset

In [None]:
import pandas as pd

## Creation with files

In [66]:
! ls -p ../../data | egrep -v /$

associations.tsv
my_data.xwz
my_data_derived.txt
persons.csv
tfidfvectorizer_model_schemaorg_linking


In [67]:
persons = Dataset(forge, name="Interesting Persons")

In [68]:
persons.add_files("../../data/persons.csv")

In [69]:
forge.register(persons)

<action> _register_one
<succeeded> True


In [None]:
forge.as_json(persons)

In [71]:
associations = Dataset(forge, name="Associations data")

In [72]:
associations.add_files("../../data/associations.tsv")

In [73]:
associations.add_derivation(persons)

In [74]:
forge.register(associations)

<action> _register_one
<succeeded> True


In [None]:
forge.as_json(associations)

In [77]:
# The argument overwrite: bool can be provided to decide whether to overwrite (True) existing files with the same name or
# to create new ones (False) with their names suffixed with a timestamp.
# A cross_bucket argument can be provided to download data from the configured bucket (cross_bucket=False - the default value) 
# or from a bucket different than the configured one (cross_bucket=True). The configured store should support crossing buckets for this to work.
associations.download(path="./downloaded/", source="parts")

In [78]:
! ls -l ./downloaded

total 8
-rw-r--r--  1 mfsy  staff  506 Aug 23 11:18 associations.tsv


In [None]:
# ! rm -R ./downloaded/

## Creation with resources

In [79]:
distribution_1 = forge.attach("../../data/associations.tsv")

In [80]:
distribution_2 = forge.attach("../../data/persons.csv")

In [81]:
jane = Resource(type="Person", name="Jane Doe", distribution=distribution_1)

In [82]:
john = Resource(type="Person", name="John Smith", distribution=distribution_2)

In [83]:
persons = [jane, john]

In [84]:
forge.register(persons)

<count> 2
<action> _register_many
<succeeded> True


In [85]:
dataset = Dataset(forge, name="Interesting people")

In [86]:
dataset.add_parts(persons)

In [87]:
forge.register(dataset)

<action> _register_one
<succeeded> True


In [None]:
forge.as_json(dataset)

In [88]:
dataset.download(path="./downloaded/", source="parts")

In [89]:
! ls -l ./downloaded

total 24
-rw-r--r--  1 mfsy  staff  506 Aug 23 11:18 associations.tsv
-rw-r--r--  1 mfsy  staff  506 Aug 23 11:18 associations.tsv.20210823111849
-rw-r--r--  1 mfsy  staff   52 Aug 23 11:18 persons.csv


In [90]:
# ! rm -R ./downloaded/

### Creation from resources

In [None]:
dataset = Dataset.from_resource(forge, [jane, john], store_metadata=True)
print(*dataset, sep="\n")

## Creation from a dataframe

See notebook `07 DataFrame IO.ipynb` for details on conversions of instances of Resource from a Pandas DataFrame.

### basics

In [93]:
dataframe = pd.read_csv("../../data/persons.csv")

In [94]:
dataframe

Unnamed: 0,type,name
0,Person,Marie Curie
1,Person,Albert Einstein


In [95]:
persons = forge.from_dataframe(dataframe)

In [96]:
forge.register(persons)

<count> 2
<action> _register_many
<succeeded> True


In [97]:
dataset = Dataset(forge, name="Interesting people")

In [98]:
dataset.add_parts(persons)

In [99]:
forge.register(dataset)

<action> _register_one
<succeeded> True


In [None]:
forge.as_json(dataset)

### advanced

In [108]:
dataframe = pd.read_csv("../../data/associations.tsv", sep="\t")

In [109]:
dataframe

Unnamed: 0,id,name,type,agent__type,agent__name,agent__gender__id,agent__gender__type,agent__gender__label,distribution
0,(missing),Curie Association,Association,Person,Marie Curie,http://purl.obolibrary.org/obo/PATO_0000383,LabeledOntologyEntity,female,../../data/scientists-database/marie_curie.txt
1,(missing),Einstein Association,Association,Person,Albert Einstein,http://purl.obolibrary.org/obo/PATO_0000384,LabeledOntologyEntity,male,../../data/scientists-database/albert_einstein...


In [110]:
dataframe["distribution"] = dataframe["distribution"].map(lambda x: forge.attach(x))

In [111]:
associations = forge.from_dataframe(dataframe, na="(missing)", nesting="__")

In [112]:
print(*associations, sep="\n")

{
    type: Association
    agent:
    {
        type: Person
        gender:
        {
            id: http://purl.obolibrary.org/obo/PATO_0000383
            type: LabeledOntologyEntity
            label: female
        }
        name: Marie Curie
    }
    distribution: LazyAction(operation=Store.upload, args=['../../data/scientists-database/marie_curie.txt', None])
    name: Curie Association
}
{
    type: Association
    agent:
    {
        type: Person
        gender:
        {
            id: http://purl.obolibrary.org/obo/PATO_0000384
            type: LabeledOntologyEntity
            label: male
        }
        name: Albert Einstein
    }
    distribution: LazyAction(operation=Store.upload, args=['../../data/scientists-database/albert_einstein.txt', None])
    name: Einstein Association
}


In [113]:
forge.register(associations)

<count> 2
<action> _register_many
<succeeded> True


In [114]:
dataset = Dataset(forge, name="Interesting associations")

In [115]:
dataset.add_parts(associations)

In [116]:
forge.register(dataset)

<action> _register_one
<succeeded> True


In [None]:
forge.as_json(dataset)