# Datasets

A Dataset is a specialization of a `Resource` that aims to register (upload) files with its metadata.

## Initialisation

Run the [Blue Brain Nexus project creation notebook](./00%20-%20Nexus_Project_Initialisation.ipynb) to create a Blue Brain Nexus project if you don't have one.

In [None]:
!pip install git+https://github.com/BlueBrain/nexus-forge

In [None]:
import getpass

The [Nexus web application](https://sandbox.bluebrainnexus.io/web) can be used to login and get a token.

- Step 1: From the opened web page, click on the login button on the right corner and follow the instructions.

![login-ui](https://raw.githubusercontent.com/BlueBrain/nexus-forge/master/examples/notebooks/use-cases/login-ui.png)

- Step 2: At the end you’ll see a token button on the right corner. Click on it to copy the token.

![login-ui](https://raw.githubusercontent.com/BlueBrain/nexus-forge/master/examples/notebooks/use-cases/copy-token.png)


In [None]:
token = getpass.getpass()

In [None]:
# Clone the repository if in Google Colab
import os 

!pwd
tutorial_base_dir = "./nexus-forge"
if os.path.exists(tutorial_base_dir):
  !rm -Rf $tutorial_base_dir

!git clone --single-branch https://github.com/BlueBrain/nexus-forge.git


os.chdir("/".join([tutorial_base_dir,"examples/notebooks/nexus-demo"]))

print("The working directory is now:")
!pwd

In [None]:
#Let get some SHACL shapes from https://github.com/INCF/neuroshapes.git
import os 

neuroshapes_dir = "./neuroshapes"
if os.path.exists(neuroshapes_dir):
  !rm -Rf $neuroshapes_dir
! git clone https://github.com/INCF/neuroshapes.git
! cp -R "./neuroshapes/shapes/neurosciencegraph/datashapes/core/dataset" "./neuroshapes/shapes/neurosciencegraph/commons/" 
! cp -R "./neuroshapes/shapes/neurosciencegraph/datashapes/core/activity" "./neuroshapes/shapes/neurosciencegraph/commons/" 
! cp -R "./neuroshapes/shapes/neurosciencegraph/datashapes/core/entity" "./neuroshapes/shapes/neurosciencegraph/commons/" 
! cp -R "./neuroshapes/shapes/neurosciencegraph/datashapes/core/ontology" "./neuroshapes/shapes/neurosciencegraph/commons/" 
! cp -R "./neuroshapes/shapes/neurosciencegraph/datashapes/core/person" "./neuroshapes/shapes/neurosciencegraph/commons/" 

In [None]:
# Set up some configurations

org ="tutorialnexus"
project ="myProject"
bucket = org+"/"+project
endpoint = "https://sandbox.bluebrainnexus.io/v1"


config = {
  "Model": {
    "name": "RdfModel",
    "origin": "directory",
    "source": "./neuroshapes/shapes/neurosciencegraph/commons/",
    "context": {
      "iri": "./neuroshapes_context.json"
    }
  },
  "Store": {
    "name": "BlueBrainNexus",
    "endpoint": "https://sandbox.bluebrainnexus.io/v1",
    "versioned_id_template": "{x.id}?rev={x._store_metadata._rev}",
    "file_resource_mapping": "../../configurations/nexus-store/file-to-resource-mapping.hjson"
  },
  "Formatters": {
    "identifier": "https://kg.example.ch/{}/{}"
  }
}


In [None]:
from kgforge.core import KnowledgeGraphForge

In [None]:
# Get a KnowledgeGraphForge session

forge = KnowledgeGraphForge(config, endpoint=endpoint,bucket=bucket, token=token)

## Imports

In [None]:
from kgforge.core import Resource
from kgforge.specializations.resources import Dataset
import pandas as pd

## Creation with files

In [None]:
! ls -p ../../data | egrep -v /$

In [None]:
jane = Resource(type="Person", name="Jane Doe")

In [None]:
persons = Dataset(forge, name="Interesting Persons")

In [None]:
persons.add_files("../../data/persons.csv")

In [None]:
persons.add_contribution(jane)

In [None]:
forge.register(persons)

In [None]:
forge.as_json(persons)

In [None]:
associations = Dataset(forge, name="Associations data")

In [None]:
associations.add_files("../../data/associations.tsv")

In [None]:
associations.add_derivation(persons)

In [None]:
associations.add_contribution(jane)

In [None]:
forge.register(associations)

In [None]:
forge.as_json(associations)


In [None]:
associations.download("files", "./downloaded/")

In [None]:
! ls ./downloaded

In [None]:
! rm -R ./downloaded

## Creation with resources

In [None]:
distribution_1 = forge.attach("../../data/associations.tsv")

In [None]:
distribution_2 = forge.attach("../../data/persons.csv")

In [None]:
jane = Resource(type="Person", name="Jane Doe", distribution=distribution_1)

In [None]:
john = Resource(type="Person", name="John Smith", distribution=distribution_2)

In [None]:
persons = [jane, john]

In [None]:
forge.register(persons)

In [None]:
dataset = Dataset(forge, name="Interesting people")

In [None]:
dataset.add_parts(persons)

In [None]:
print(dataset)

In [None]:
forge.register(dataset)

In [None]:
dataset.download("parts", "./downloaded/")

In [None]:
! ls ./downloaded

In [None]:
! rm -R ./downloaded

### specifiying custom content-type

In [None]:
data_file = forge.attach("../../data/my_data.xwz", content_type="application/xwz")

In [None]:
experiment = Resource(type="Experiment", name="generated data", distribution=data_file)

In [None]:
forge.register(experiment)

In [None]:
print(forge.as_json(experiment))

## Creation from a dataframe

See notebook `DataFrame IO.ipynb` for details on conversions of instances of Resource from a Pandas DataFrame.

### basics

In [None]:
dataframe = pd.read_csv("../../data/persons.csv")

In [None]:
dataframe

In [None]:
persons = forge.from_dataframe(dataframe)

In [None]:
forge.register(persons)

In [None]:
dataset = Dataset(forge, name="Interesting people")

In [None]:
dataset.add_parts(persons)

In [None]:
print(dataset)

In [None]:
forge.register(dataset)

### advanced

In [None]:
dataframe = pd.read_csv("../../data/associations.tsv", sep="\t", usecols=["name", "type", "agent__type", "agent__name", "agent__gender__label", "distribution"])

In [None]:
dataframe

In [None]:
dataframe["distribution"] = dataframe["distribution"].map(lambda x: forge.attach(x))

In [None]:
associations = forge.from_dataframe(dataframe, na="(missing)", nesting="__")

In [None]:
print(associations[0])

In [None]:
forge.register(associations)

In [None]:
associations[0]._last_action

In [None]:
associations[0]._synchronized

In [None]:
associations[0]._store_metadata

In [None]:
forge.as_json(associations[0])

In [None]:
dataset = Dataset(forge, name="Interesting associations")

In [None]:
print(dataset)

In [None]:
dataset.add_parts(associations)

In [None]:
print(dataset)

In [None]:
forge.register(dataset)

In [None]:
dataset.download("parts", "./downloaded/")

In [None]:
! ls ./downloaded

In [None]:
! rm -R ./downloaded