
# pyDataverse Data Migration Demo

**European Dataverse Workshop 2020 @ Tromso**

This Jupyter Notebook is part of the [European Dataverse Workshop 2020 at Tromso](https://github.com/AUSSDA/pyDataverse_workshop_tromso). It shows a demo showcase on how to do data migrations into Dataverse with [pyDataverse](https://github.com/AUSSDA/pyDataverse).

This Jupyter Notebook is used as an executable migration script with documentation.

* Date: 24th January 2020
* Location: [UiT - The Arctic University of Norway](https://en.uit.no/startsida), Tromsø
* Trainer: Stefan Kasberger from [AUSSDA - The Austrian Social Science Data Archive](https://aussda.at).
* Materials: [GitHub Repository](https://github.com/AUSSDA/pyDataverse_workshop_tromso)

**Requirements**

* [Dataverse Docker](https://github.com/IQSS/dataverse-docker) (commit: [19b2e86](https://github.com/IQSS/dataverse-docker/commit/19b2e86bdd32a49f4c5940e284844b6d5bf99570))
* [Jupyter Docker](https://hub.docker.com/r/jupyter/datascience-notebook) (digest: [18ef2702c6a2](https://hub.docker.com/layers/jupyter/datascience-notebook/latest/images/sha256-18ef2702c6a25bd26b81e7b6dc831adb2bc294ae7bc9b011150b8f4573c41d4a))
* [pyDataverse](https://github.com/AUSSDA/pyDataverse) (branch: [develop](https://github.com/AUSSDA/pyDataverse/tree/develop))

**Overview**

1. Short introduction into pyDataverse (DONE)
2. Setup the user environment
3. Introduce to pyDataverse CSV templates
4. Import Datasets metadata into pyDataverse
5. Upload Datasets to Dataverse
6. Import Datafiles metadata into pyDataverse
7. Upload Datafiles to Dataverse
8. Delete Datasets (optional)
9. Copy Jupyter Notebook to localhost (optional)

**Software Architecture**

![Software architecture](assets/architecture.png)


**pyDataverse Workflow**

![pyDataverse Workflow](assets/flow-chart.png)

## 1. Introduction pyDataverse

See [presentation.pdf](https://github.com/AUSSDA/pyDataverse_workshop_tromso/presentation.pdf).

## 2. Setup

Before to start, we need to setup our working environment.

Open your terminal.

**Start Dataverse (Docker)**

If not already running, install and start your Dataverse Docker container. Find out more about this inside its [Dataverse Docker GitHub repository](https://github.com/IQSS/dataverse-docker).

**Start Jupyter Notebook (Docker)**

Download and start Docker container for Jupyter notebook ([jupyter/datascience-notebook](https://hub.docker.com/r/jupyter/datascience-notebook)).

```shell
$ docker pull jupyter/datascience-notebook@sha256:18ef2702c6a25bd26b81e7b6dc831adb2bc294ae7bc9b011150b8f4573c41d4a
$ docker run -p 8888:8888 jupyter/scipy-notebook
```

![Start Jupyter Notebook](assets/screenshot_start-jupyter-container.png)

Now you can run the Jupyter Notebook environment by clicking the link with the token outputed to your terminal. This should open a window in your Browser, where you see the container home directory with the folder `work/`.

![Empty Notebook](assets/screenshot_empty-notebook.png)

**Go inside the Jupyter container**

To be able to install pyDataverse and download the demo repositorium, we need to go inside the the container first. To get in, `docker ps` lists up all the running Docker container. Look for the `jupyter/scipy-notebook` container ID and copy it. Then paste it instead of `CONTAINER_ID`in `docker exec -it CONTAINER_ID bash`.

```shell
$ docker ps
$ docker exec -it CONTAINER_ID bash
```

This should get you inside the shell of the Jupyter Notebook container.

![Go into Jupyter Docker container](assets/screenshot_get-into-docker.png)

**Setup Jupyter environment**

Once you are inside, you can install [pydataverse](https://github.com/AUSSDA/pyDataverse). To have a working version, we install it from the commit [fbe9755](https://github.com/AUSSDA/pyDataverse/commit/fbe9755466556f5dd5044bbc732fac0d80a1dc38).

```shell
$ pip install git+https://github.com/aussda/pyDataverse.git@fbe9755466556f5dd5044bbc732fac0d80a1dc38
```

![Install pyDataverse](assets/screenshot_install-pydataverse_2.png)

**Clone the demo repository**

Finally, download the [GitHub Repository](https://github.com/AUSSDA/pyDataverse_demo_tromso) for this demo. It contains all needed scripts, data and files for this demo.

```shell
$ git clone https://github.com/AUSSDA/pyDataverse_demo_tromso.git
```

![Clone Workshop repository](assets/screenshot_clone-repo.png)

Now everything is installed and up and running, so we can get our hands on some real Dataverse data.

Go back into your Browser. In the Jupyter Notebook Dashboard, the `pyDataverse_demo_tromso/` folder can be seen. Go into it and open the `pydataverse.ipynb` file by clicking on it.

![Workshop in Jupyter](assets/screenshot_workshop-notebook.png)

## 3. Datasets and Datafiles templates

The data import approach in this demo, is to use the [pyDataverse CSV templates](https://github.com/AUSSDA/pyDataverse_templates) and its structure. The two needed files for the test data migration are already prepared in our GitHub repository. You can find them as `datasets.csv` and `datafiles.csv` in the GitHub repository directory.

* [datasets.csv](https://github.com/AUSSDA/pyDataverse_demo_tromso/blob/master/datasets.csv)
* [datafiles.csv](https://github.com/AUSSDA/pyDataverse_demo_tromso/blob/master/datafiles.csv)

The general concept of Datasets and Datafiles is, that a Dataset can contain multiple Datafiles. The relation between these two variables is established through the variable `organization.dataset_id`. It is included in both CSV files and connects every single Datafile with its parent Dataset.

To create your own CSV files with your own metadata inside, the use of the [pyDataverse templates](https://github.com/AUSSDA/pyDataverse_templates) is recommended.

**datasets.csv**

[![datasets.csv](assets/screenshot_datasets.png)](https://github.com/AUSSDA/pyDataverse_demo_tromso/blob/master/datasets.csv)

**datafiles.csv**

[![datafiles.csv](assets/screenshot_datafiles.png)](https://github.com/AUSSDA/pyDataverse_demo_tromso/blob/master/datafiles.csv)

## 4. Import Datasets metadata from template into pyDataverse

1. Load the needed Python modules.
2. Import CSV to a Python dictionary
3. Create the pyDataverse `Dataset()` object
4. Import Dataset dictionary into `Dataset()`
5. Print out `Dataset()` attributes

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# load Python modules
import json
from pyDataverse.api import Api
from pyDataverse.models import Datafile
from pyDataverse.models import Dataset
from pyDataverse.utils import read_csv_to_dict
from pyDataverse.utils import read_file
from demo import create_dataset
from demo import delete_dataset
from demo import import_datafile
from demo import parse_dataset_keys
from demo import publish_dataset
from demo import upload_datafile
import os
import subprocess as sp
import time

In [3]:
ds_filename = 'datasets.csv'
license_filename = 'license.html'
terms_filename = 'terms-of-access.html'

data = {}
license_default = read_file(license_filename)
datasets_csv = read_csv_to_dict(ds_filename, delimiter=',')

In [4]:
# Import Datasets metadata from CSV file and save it in a dictionary

for dataset in datasets_csv:
    data = parse_dataset_keys(dataset, data, terms_filename)

In [5]:
# create pyDataverse Dataset object and import data from dictionary

ds_1 = Dataset()
ds_1.set(data['test_1']['metadata'])

In [6]:
# Print out some basic metadata

print(ds_1.title)
print(ds_1.dsDescription[0]['dsDescriptionValue'])

Internet usage 2019
Life Style 2019. Internet usage / media.


## 5. Upload Dataset Metadata via API

1. Get API token
2. Connect to Dataverse API with pyDataverse
3. Upload all `Dataset()` to Dataverse via API
4. View PID's of the uploaded Datasets

First, we have to get an API token for the Dataverse API. 

* Go to [Dataverse API token page](http://localhost:8085/dataverseuser.xhtml?selectTab=apiTokenTab).
* Create your own API token.
* Assign the API token to the `API_TOKEN` variable below, instead of the value `SECRET`, and uncomment the line (remove the "#").

In [7]:
dv_alias = 'root'
BASE_URL = 'http://localhost:8085'
API_TOKEN = '7ab830e0-a493-4e76-96c4-901d4b7ef2bf'

In [8]:
# connect with API

api = Api(BASE_URL, API_TOKEN)

In [9]:
# upload Dataset metadata via API

mapping_dsid2pid = {}

for ds_id, dataset in data.items():
    ds = Dataset()
    ds.set(dataset['metadata'])
    resp, mapping_dsid2pid = create_dataset(api, ds, dv_alias, mapping_dsid2pid, ds_id, BASE_URL)

Dataset with pid 'doi:10.5072/FK2/YQGUXH' created.
http://localhost:8085/dataset.xhtml?persistentId=doi:10.5072/FK2/YQGUXH&version=DRAFT
Dataset with pid 'doi:10.5072/FK2/Y1Y6L3' created.
http://localhost:8085/dataset.xhtml?persistentId=doi:10.5072/FK2/Y1Y6L3&version=DRAFT


In [10]:
# Print out mapping from Dataset ID to DOI

print(mapping_dsid2pid)

{'test_1': 'doi:10.5072/FK2/YQGUXH', 'test_2': 'doi:10.5072/FK2/Y1Y6L3'}


## 6. Import Datafiles metadata from template into pyDataverse

1. Import CSV to a Python dictionary
2. Create the pyDataverse `Datafile()` object
3. Import Datafile dictionary into `Datafile()`
4. Print out attributes

In [11]:
# Import Datafile metadata from CSV and save it in the dictionary.

df_filename = 'datafiles.csv'
datafiles_csv = read_csv_to_dict(df_filename, delimiter=',')

for datafile in datafiles_csv:
    data = import_datafile(datafile, data)

In [12]:
# create pyDataverse Datafile object and import data from dictionary

df_1 = Datafile()
df_1.set(data['test_1']['datafiles']['1']['metadata'])
df_1.set({'pid': mapping_dsid2pid['test_1']})

In [13]:
# Print out some basic metadata

print(df_1.pid)
print(df_1.filename)

doi:10.5072/FK2/YQGUXH
20001_ta_de_v1_0.tsv


## 7. Upload Datafiles metadata and data to Dataverse via API

In [14]:
# upload Datafile metadata and data via API

for ds_id, dataset in data.items():
    pid = mapping_dsid2pid[ds_id]
    for df_id, datafile in dataset['datafiles'].items():
        data_tmp = datafile['metadata']
        data_tmp['pid'] = pid
        df = Datafile()
        df.set(data_tmp)
        filename = os.path.abspath(os.path.join('data', datafile['metadata']['filename']))
        resp = upload_datafile(api, pid, filename, df)

## 8. Delete Datasets via API (optional)

In [15]:
# Delete the Datasets at the End (OPTIONAL)
DELETE_DATASETS = False

if DELETE_DATASETS:
    for ds_id, dataset in data.items():
        resp = delete_dataset(mapping_dsid2pid[ds_id], api)

## 9. Copy Jupyter Notebook to localhost (optional)

Files from within the Docker container can get out via the docker `cp` command. For this you must have the Container ID of the Jupyter Notebook running and pass the local folder in which you want the file to be copied.

```shell
$ docker ps
$ docker cp CONTAINER_ID:/home/jovyan/work/pyDataverse_demo_tromso/pydataverse.ipynb YOUR_LOCAL_DIRECTORY
```

## Resources

* [Dataverse API Docs](http://guides.dataverse.org/en/latest/api/index.html)
* [pyDataverse templates](https://github.com/AUSSDA/pyDataverse_templates)