# Datasets

This notebook demonstrates how to create Datasets. It can be used with one of the two stores: 
* The Demo Store, which is an in-memory store. Some features demonstrated in this notebook are not supported for this store. Do not use it for production.
* Nexus, which uses [Blue Bran Nexus](https://bluebrainnexus.io/) as a backend.

In [1]:
from kgforge.core import KnowledgeGraphForge

## Create the Forge using Demo Store and Demo Model

In [2]:
forge = KnowledgeGraphForge("../../configurations/demo-forge.yml")

## Create the Forge using Nexus Store and RDF Model

To run the demo using Nexus, use the next three cells. Provide the Nexus token and the organization and project you want to use.

In [None]:
import getpass

In [None]:
token = getpass.getpass()

In [None]:
bucket = "dke/kgforge_tests"

In [None]:
forge = KnowledgeGraphForge("../../configurations/demo-forge-nexus.yml", bucket=bucket, token=token)

## Imports

In [3]:
from kgforge.core import Resource

In [4]:
from kgforge.specializations.resources import Dataset

In [5]:
import pandas as pd

## Creation with files

All files inside a directory can be added to a `Dataset` using `add_files()` function. A `DataDownload` resource is created for each file, and the metadata is extracted. The `add_parts()` function will link a resource using the property `hasPart`. 

In [6]:
! ls -p ../../data | egrep -v /$

associations.tsv
persons.csv


In [7]:
dataset = Dataset(forge, name="Interesting files")

In [8]:
dataset.add_files("../../data/")

In [9]:
print(dataset)

{
    type: Dataset
    hasPart: LazyAction(operation=Store.upload, args=['../../data/'])
    name: Interesting files
}


In [10]:
# Dataset registration not suported by DemoStore
forge.register(dataset)

<action> _register_one
<succeeded> False
<error> UploadingError: no file_resource_mapping has been configured


In [11]:
print(dataset)

{
    type: Dataset
    hasPart: LazyAction(operation=Store.upload, args=['../../data/'])
    name: Interesting files
}


## Creation with resources

The `attach()` function allows to generate a `DataDownload` resource that can be linked to another Resource. In the example the property `distribution` is used.

In [12]:
distribution_1 = forge.attach("../../data/associations.tsv")

In [13]:
distribution_2 = forge.attach("../../data/persons.csv")

In [14]:
jane = Resource(type="Person", name="Jane Doe", distribution=distribution_1)

In [15]:
john = Resource(type="Person", name="John Smith", distribution=distribution_2)

In [16]:
persons = [jane, john]

In [17]:
# Dataset registration is not suported by DemoStore
forge.register(persons)

<count> 2
<action> _register_one
<succeeded> False
<error> UploadingError: no file_resource_mapping has been configured


In [18]:
dataset = Dataset(forge, name="Interesting people")

In [19]:
dataset.add_parts(persons)

<action> _reshape
<error> AttributeError: 'Resource' object has no attribute 'id'



In [20]:
print(dataset)

{
    type: Dataset
    name: Interesting people
}


In [21]:
# Dataset registration is not suported by DemoStore
forge.register(dataset)

<action> _register_one
<succeeded> True


In [22]:
# Files download is not suported by DemoStore
dataset.download("parts", "./downloaded/")

<action> collect_values
<error> DownloadingError: path to follow is incorrect



In [23]:
! ls ./downloaded

associations.tsv persons.csv


## Creation from a dataframe

See notebook `DataFrame IO.ipynb` for details on conversions of instances of Resource from a Pandas DataFrame.

In [None]:
dataframe = pd.read_csv("../../data/persons.csv")

In [None]:
dataframe

In [None]:
persons = forge.from_dataframe(dataframe)

In [None]:
forge.register(persons)

In [None]:
dataset = Dataset(forge, name="Interesting people")

In [None]:
dataset.add_parts(persons)

In [None]:
print(dataset)

In [None]:
forge.register(dataset)