# Create and Publish a Dataset

This notebook shows the python client library equivalents to the raw HTTP calls described in the [create-and-publish](https://edelweissdata.com/docs/create-and-publish) walkthrough on the offical EdelweissData documentation website.

This walkthrough shows you how to create a new dataset from a csv file, including setting description and metadata and publishing the dataset. In later walkthroughs you will learn more about the details of the authentication scheme, the query language etc..

In order to create a dataset, you first need to create an In-Progress Dataset. In this stage you can make as many changes to the dataset as you want. Once you are okay with the data in the dataset and want to make the dataset available to others, you can publish it, thus creating a new version.

A published dataset is versioned and as such cannot be modified. In order to modify a published dataset you will need to publish a new version by creating another In-Progress Dataset, apply the new changes you'd like to make and then publish a new version.

Keep in mind that the old version will still be available. When accessing datasets, you can specify in the URL if you want to retrieve the dataset at a specific version (identified by the integer version number starting at 1) or whatever is the latest published version.

**Dataset Lifecycle Flow**

![Dataset life cycle](https://edelweissdata.com/docs/images/dataset-lifecycle.png)

## Getting Started
The steps to publish a new Dataset are as follows

1. Create a Dataset
2. Upload the Data
3. Upload or infer the Schema
4. Upload Metadata and Description
5. Set the visibility of the dataset
6. Publish the Dataset


## API initialization

See the [setup notebook](setup.ipynb) for details on how to install, initialize and authorize the library.

Creating and publishing datasets requires autentication so the intialization code below calls the `authenticate()` method on the api. This will block the script execution and ask you to visit the given URL where you will have to log in with your EdelweissData user and confirm the access. After doing this once the token is stored in your users home directory so that this confirmation will not be necessary anymore on the client (you can disable this behaviour with the `cache_jwt` parameter).

In [1]:
from edelweiss_data import API, QueryExpression as Q

# Set this to the url of the Edelweiss Data server you want to interact with
edelweiss_api_url = 'https://api.edelweissdata.com'

api = API(edelweiss_api_url)

In [2]:
api.authenticate(scopes=["exceedQuota"]) # exceedQuota is an extra permission that you may or may not have that allows you to use more than the usual amount of storage

## Immediate creation and publishing

The entire process outlined above can be done in one convenience call. This gives you less flexibilty but is often all you need:

In [3]:
with open ("trivial.csv") as f:
    metadata = {"category": "test", "metadata-dummy-number": 42.0}
    description = "Test dataset to demonstrate uploading from python. This can use **markdown formatting**."
    dataset = api.create_published_dataset_from_csv_file("Test dataset 1", f, metadata = metadata, is_public = False, description = description)
dataset                                                                       

<PublishedDataset '03482aac-9efe-41b4-b680-e8a3fc8166d8':1 - Test dataset 1>

## Fine grained creation and publishing

This is closer to what creating and publishing dataset looks like in other languages when interacting directly with the HTTP endpoints where roughly every step outlined above translates to one method call.

### Create a Dataset

To create a dataset we have to supply the name of the dataset and call the create_dataset method:

In [4]:
name = "Test dataset 2"
dataset2 = api.create_in_progress_dataset(name)

### Upload data to a Dataset

Now that we have created a dataset, we need to populate it with data. We need to read the csv and upload it using the 

In [5]:
with open("trivial.csv") as f:
    dataset2.upload_data(f)

### Upload the Schema

At this point we have our data stored as CSV in EdelweissData™ . However, It is currently stored as a bunch of string values in the EdelweissData™ .

In order to make the data interesting and allow EdelweissData™  make sense of it, we need to supply a schema.

The schema defines the datatype of the columns in the data. The data types could be simple Data Types like `string`, `integer` or they could be more advanced datatypes like `DateTime` or [Smiles](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system)

Here are the list of Datatypes currently supported

| Data Type        | Representation                      |
| --               | --                                  |
| String           | xsd:string                          |
| Url              | xsd:anyURI                          |
| Boolean          | xsd:boolean                         |
| Integer          | xsd:integer                         |
| Float            | xsd:double                          |
| DateTime         | xsd:dateTime                        |
| Date             | xsd:date                            |
| DatasetId        | edelweiss:datasetid                 |
| SMILES           | cheminf:CHEMINF_000018              |
| Image            | https://schema.org/image            |
| Json             | http://edamontology.org/format_3464 |

There are currently two ways to define the Schema

1. Inference - We can tell Edelweiss to Infer the schema
2. Upload Schema - We supply the correct schema as json

#### Schema Inference

EdelweissData™  can infer the schema based on some heuristics. Schema inference can only infer basic information like the data type. If you use schema inference, consider augmenting the returned schema (e.g. with richer descriptions for each column if you have them) and uploading it again (see the Schema Upload section below for details)

#### Schema Upload

The schema inference works very well for basic data types, however there are situations where you want fine grained control over the schema. To accomplish this you need to construct an instance of the Schema class or modify one returned by schema inference, or supply a json formatted schema file. 

Below the schemafile can either be set to a filename for a schema file generated ahead of time or inferred on the fly if schemafile is set to None

In [6]:
schemafile = None
if schemafile is not None:
    with open(schemafile) as f:
        dataset2.upload_schema_file(f)
else:
    dataset2.infer_schema()

### Upload Metadata and Description

We have successfully inferred the schema; at this point we can move on to publish the dataset. To make our dataset more useful though, it is a good idea to to add a few additional pieces of information. They are:

1. Description - Markdown textual description to help users understand what the data is about
2. Metadata - A Json object that contains pieces of structured metadata that is useful to allow other people to find the dataset. To learn more about how metadata can be used effectively, have a look at the [metadata documentation](metadata)

Both items (as well as the name and the schema if you want) can be uploaded in one method call by using the `update()` method on the PublishedDataset:

In [7]:
metadata = {"category": "test", "metadata-dummy-number": 42.0}
description = "Test dataset to demonstrate uploading from python. This can use **markdown formatting**."
dataset2.update(metadata = metadata, description = description)

## Set the dataset visibility

As a final step before publishing we have to decide if the dataset should be publicly visible (i.e. even anonymous HTTP request can retrieve the dataset) or access restricted (in which case you can control which users can access the dataset and/or create new versions)

The current visibility can be queried on instances of both InProgress and PublishedDatasets by inspecting the is_public property. By default the visibility of a new dataset is set to **public**.

You can set the visibility to either public or access restricted by using the `API.change_dataset_visibility()` method.

In [8]:
api.change_dataset_visibility(dataset2.id, is_public = False)

### Publish the Dataset

Now that we have a schema for the dataset and added metadata and a description we can publish our dataset. In the publishing step EdelweissData™  will validate the schema and also pre-compute some information about our data.

Publishing a Dataset creates a new version of that Dataset. Once published, a version cannot be changed. If you want to update the dataset you can create a new version. The old version will still be available though. In the URL scheme of EdelweissData™ all endpoints that reference published datasets specify either a specific version by number (starting at 1), or the special version string `latest` to indicate that we want to retrieve whatever is the newest version of this dataset.

To document the reason behind publishing new version we need to provide a helpful changelog message when we publish a new version. Publishing is achieved by calling the `publish` method and passing the changelog message:

In [9]:
published_dataset2 = dataset2.publish("Initial publish of the dataset")

In [10]:
published_dataset2.get_data()

Unnamed: 0,First name,Last name,Age
1,John,Ford,60
2,Jane,Ford,59
