# Uploading Datasets

Please read [Downloading Datasets](./downloading.ipynb) first as it explains the general setup.

We connect to SciCat and a file server using a [Client](../generated/classes/scitacean.Client.rst):
```python
from scitacean import Client
from scitacean.transfer.ess import ESSTestFileTransfer
client = Client.from_token(url="https://scicat.ess.eu/api/v3",
                           token=...,
                           file_transfer=ESSTestFileTransfer(
                               host="login.esss.dk",
                               remote_base_path="/somewhere/on/remote/"
                           ))
```
This code is identical to the one used for [downloading](./downloading.ipynb) except for the `remote_base_path` which tells `ESSTestFileTransfer` where to put our files.

As with the downloading guide, we use a fake client instead of the real one shown above.

In [None]:
from scitacean.testing.docs import setup_fake_client
client = setup_fake_client()

This is especially useful here as datasets cannot be deleted from SciCat by regular users, and we don't want to pollute the database with our test data.

First, we need to generate some data to upload:

In [None]:
from pathlib import Path

path = Path("data/witchcraft.dat")
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w") as f:
    f.write("7.9 13 666")

## Create a New Dataset

With the totally realistic data in hand, we can construct a dataset.

In [None]:
from scitacean import Dataset

dset = Dataset.new(
    dataset_name="Spellpower of the Three Witches",
    description="The spellpower of the maiden, mother, and crone.",
    dataset_type="raw",

    owner_group="wyrdsisters",
    access_groups=["witches"],

    owner="Nanny Ogg",
    investigator="Esme Weatherwax",
    contact_email="nogg@wyrd.lancre",
    creation_time="1983-05-01T00:13:42",

    data_format="space-separated",
    source_folder="...",
)

There are many more fields that can be filled in as needed.
See [scitacean.Dataset](../generated/classes/scitacean.Dataset.rst) and [scitacean.DatasetFields](../generated/classes/scitacean._dataset_fields.DatasetFields.rst).

Some fields require an explanation:

- `dataset_type` is either `raw` or `derived`. The main difference is that derived datasets point to one or more input datasets.
- `owner_group` and `access_groups` correspond to users/usergroups on the file server and determine who can access the files.
- `source_folder` must be set but will be overriden by the file transfer. This will be fixed in the future. For now, just put some placeholder.

Now we can attach our file:

In [None]:
dset.add_local_files("data/witchcraft.dat", base_path="data")

Setting the `base_path` to `"data"` means that the file will be uploaded to `source-dir/withcraft.dat` where `source-dir` will be determined by the file transfer.
If we did not set `base_path`, the file would end up in `source-dir/data/withcraft.dat`.

Now, let's inspect the dataset.

In [None]:
dset

In [None]:
len(dset.files)

In [None]:
dset.size  # in bytes

In [None]:
print(f"{dset.files[0].remote_access_path = }")
print(f"{dset.files[0].local_path = }")
print(f"{dset.files[0].size = } bytes")

The file has a `local_path` but no `remote_access_path` which means that it exists on the local file system (where we put it earlier) but not on the remote file server accessible by SciCat.

Likewise, the dataset only exists in memory on our local machine and not on SciCat.
Nothing has been uploaded yet.
So we can freely modify the dataset or bail out by deleting the Python object if we need to.

## Upload the Dataset

Once the dataset is ready, we can upload it using

In [None]:
finalized = client.upload_new_dataset_now(dset)

<div class="alert alert-warning">
    <b>WARNING:</b>

This action cannot be undone by a regular user!
Contact an admin if you uploaded a dataset accidentally.

</div>

[scitacean.Client.upload_new_dataset_now](../generated/classes/scitacean.Client.rst#scitacean.Client.upload_new_dataset_now) uploads the dataset (i.e. metadata) to SciCat and the files to the file server.
And it does so in such a way that it always creates a new dataset and new files without overwriting any existing (meta) data.

It returns a new dataset that is a copy of the input with some updated information generated by SciCat and the file transfer.
For example, it has been assigned a new ID:

In [None]:
finalized.pid

And the remote access path of our file has been set:

In [None]:
finalized.files[0].remote_access_path