
# Demo: Use pyDataverse for data migrations into Dataverse

**European Dataverse Workshop @ Tromso**

This Jupyter Notebook is part of the [European Dataverse Workshop at Tromso](https://github.com/AUSSDA/pyDataverse_workshop_tromso). It offers a little demo, how to use [pyDataverse](https://github.com/AUSSDA/pyDataverse) for a data migration into Dataverse. 

This Jupyter notebook guides the demo part as an executable script, but can also be used for other purposes after it.

* Date: 24th January 2020
* Location: [UiT - The Arctic University of Norway](https://en.uit.no/startsida), Tromsø
* Trainer: Stefan Kasberger from [AUSSDA - The Austrian Social Science Data Archive](https://aussda.at).
* Workshop Materials: [GitHub Repository](https://github.com/AUSSDA/pyDataverse_workshop_tromso)

**Requirements**

* [Dataverse Docker](https://github.com/IQSS/dataverse-docker)
* [Jupyter Docker](https://hub.docker.com/r/jupyter/datascience-notebook)
* [pyDataverse](https://github.com/AUSSDA/pyDataverse) (branch: [develop](https://github.com/AUSSDA/pyDataverse/tree/develop))

**Overview**

What we will do:

1. Get a short introduction into pyDataverse (DONE)
2. Prepare the user environment
3. Introduce the pyDataverse CSV templates
4. Import Datasets metadata
5. Upload Datasets to Dataverse
6. Import Datafiles metadata
7. Upload Datafiles to Dataverse
8. Publish Datasets
9. Delete Datasets (optional)
10. Copy Jupyter Notebook to localhost

How we do it:

* install a fresh Jupyter Docker container on your localhost to run the Jupyter notebook with it.
* run the Dataverse Docker container on your local host to test the data migration via API with it.

**Software architecture**

![Software architecture](assets/architecture.png)

**How to use this Jupyter notebook**

* Execute/run a notebook cell: CTRL + ENTER

**pyDataverse Workflow**

![pyDataverse Workflow](assets/flow-chart.png)

## 1. Introduction pyDataverse

See [presentation.pdf](https://github.com/AUSSDA/pyDataverse_workshop_tromso/presentation.pdf).

## 2. Preparations

Before we can start, we need to install some tools and prepare our working environment on our local machines.

Open your local terminal.

**Start Dataverse (Docker)**

If not already running, install and start your Dataverse Docker container first. Find out more about this in the [Dataverse Docker GitHub repository](https://github.com/IQSS/dataverse-docker).

**Start Jupyter Notebook (Docker)**

Download and start Docker container for Jupyter notebook ([jupyter/datascience-notebook](https://hub.docker.com/r/jupyter/datascience-notebook)) on your laptop.

```shell
$ docker run -p 8888:8888 jupyter/scipy-notebook
```

![Start Jupyter Notebook](assets/screenshot_start-jupyter-container.png)

Now you can run your Jupyter Notebook environment by clicking the link with the token mentioned inside your shell. This should open a window in your Browser, where you see the container home directory with the folder `work/`.

![Empty Notebook](assets/screenshot_empty-notebook.png)

**Go inside the Jupyter container**

To be able to install pyDataverse and download the workshop repositorium, we need to go inside the containers via bash. To get in, `docker ps` lists up all the running Docker container. Look for the `jupyter/scipy-notebook` container ID to copy it. Then paste it instead of `CONTAINER_ID`in `docker exec -it CONTAINER_ID bash`.

```shell
$ docker ps
$ docker exec -it CONTAINER_ID bash
```

This should get you inside the shell of the Jupyter Docker container.

![Go into Jupyter Docker container](assets/screenshot_get-into-docker.png)

**Setup environment**

Once you are inside, you can install [pydataverse](https://github.com/AUSSDA/pyDataverse). To have the latest features, we install it from the develop branch.

```shell
$ pip install git+https://github.com/aussda/pyDataverse.git@develop
```

![](assets/screenshot_install-pydataverse_2.png)


**Clone the Workshop repositorium**

And finally, download the [pyDataverse workshop Tromso GitHub Repository](https://github.com/AUSSDA/pyDataverse_workshop_tromso). It contains all needed scripts, data and files for this workshop.

```shell
$ git clone https://github.com/AUSSDA/pyDataverse_workshop_tromso.git
```

![Clone Workshop repository](assets/screenshot_clone-repo.png)

Now everything is installed and up and running, so we can move on to get our hands on some data.

Go back into your Browser. You should now see the `pyDataverse_workshop_tromso/` folder. Get into it and open the `pydataverse.ipynb` file.

![Workshop in Jupyter](assets/screenshot_workshop-notebook.png)

**How to use Jupyter Notebook**

* Run/execute cell: CTRL + ENTER

## 3. Datasets and Datafiles templates

The two needed files for the test data migration are already prepared in our GitHub repository. You can find them as `datasets.csv` and `datafiles.csv` in the GitHub repository directory.

* [datasets.csv](https://github.com/AUSSDA/pyDataverse_workshop_tromso/blob/master/datasets.csv)
* [datafiles.csv](https://github.com/AUSSDA/pyDataverse_workshop_tromso/blob/master/datafiles.csv)

The general concept of Datasets and Datafiles is, that a Dataset can contain multiple Datafiles. The relation between the Dataset and Datafile is established via the variable `aussda.dataset_id`. It is included in both CSV files and connect every Datafile with its related Dataset.

To create your own CSV files with your own metadata inside, we recommend the use of the [pyDataverse templates](https://github.com/AUSSDA/pyDataverse_templates).

**datasets.csv**

[![datasets.csv](assets/screenshot_datasets.png)](https://github.com/AUSSDA/pyDataverse_workshop_tromso/blob/master/datasets.csv)

**datafiles.csv**

[![datafiles.csv](assets/screenshot_datafiles.png)](https://github.com/AUSSDA/pyDataverse_workshop_tromso/blob/master/datafiles.csv)

## 4. Import Datasets metadata from template into pyDataverse

1. Load the needed Python modules.
2. Convert the CSV data to a Python dictionary
3. Create the pyDataverse Dataset object
4. Print out metadata as json string
5. Print out specific metadata variables

In [1]:
# load Python modules
import json
from pyDataverse.api import Api
from pyDataverse.models import Datafile
from pyDataverse.models import Dataset
from pyDataverse.utils import read_csv_to_dict
from pyDataverse.utils import read_file
from pyDataverse.utils import read_json
from workshop import create_dataset
from workshop import delete_dataset
from workshop import import_datafile
from workshop import parse_dataset_keys
from workshop import publish_dataset
from workshop import upload_datafile
import os
import subprocess as sp
import time

%load_ext autoreload
%autoreload 2

In [2]:
ds_filename = 'datasets.csv'
license_filename = 'license.html'
terms_filename = 'terms-of-access.html'

data = {}
license_default = read_file(license_filename)
datasets_csv = read_csv_to_dict(ds_filename, delimiter=',')

In [3]:
# Import Datasets metadata from CSV file and save it in a dictionary

for dataset in datasets_csv:
    data = parse_dataset_keys(dataset, data, terms_filename)

In [4]:
# create pyDataverse Dataset object and import data from dictionary

ds_1 = Dataset()
ds_1.set(data['test_1']['metadata'])

In [5]:
# Print out some basic metadata

print(ds_1.title)
print(ds_1.dsDescription[0]['dsDescriptionValue'])

Internet usage 2019
Life Style 2019. Internet usage / media.


## 5. Upload Dataset Metadata via API

First, we have to get an API token for the Dataverse API. 

* Go to [Dataverse API token page](http://localhost:8085/dataverseuser.xhtml?selectTab=apiTokenTab).
* Create your own API token.
* Assign the API token to the `API_TOKEN` variable below, instead of the value `SECRET`, and uncomment the line.

In [6]:
dv_alias = 'root'
BASE_URL = 'http://localhost:8085'
# API_TOKEN = 'SECRET'

In [7]:
# connect with API

api = Api(BASE_URL, API_TOKEN)

In [8]:
# upload Dataset metadata via API

mapping_dsid2pid = {}

for ds_id, dataset in data.items():
    ds = Dataset()
    ds.set(dataset['metadata'])
    resp, mapping_dsid2pid = create_dataset(api, ds, dv_alias, mapping_dsid2pid, ds_id, BASE_URL)

Dataset with pid 'doi:10.5072/FK2/1KCJLD' created.
http://localhost:8085/dataset.xhtml?persistentId=doi:10.5072/FK2/1KCJLD&version=DRAFT
Dataset with pid 'doi:10.5072/FK2/J5NHLQ' created.
http://localhost:8085/dataset.xhtml?persistentId=doi:10.5072/FK2/J5NHLQ&version=DRAFT


## 6. Import Datafiles metadata from template into pyDataverse

In [9]:
# Import Datafile metadata from CSV and save it in the dictionary.

df_filename = 'datafiles.csv'
datafiles_csv = read_csv_to_dict(df_filename, delimiter=',')

for datafile in datafiles_csv:
    data = import_datafile(datafile, data)

In [10]:
# create pyDataverse Datafile object and import data from dictionary

df_1 = Datafile()
df_1.set(data['test_1']['datafiles']['1']['metadata'])
df_1.set({'pid': mapping_dsid2pid['test_1']})

In [11]:
# Print out some basic metadata

print(df_1.pid)
print(df_1.filename)

doi:10.5072/FK2/1KCJLD
20001_ta_de_v1_0.tsv


In [12]:
# Print out mapping from Dataset ID to DOI

print(mapping_dsid2pid)

{'test_1': 'doi:10.5072/FK2/1KCJLD', 'test_2': 'doi:10.5072/FK2/J5NHLQ'}


## 7. Upload Datafiles metadata and data via API

In [13]:
# upload Datafile metadata and data via API

for ds_id, dataset in data.items():
    pid = mapping_dsid2pid[ds_id]
    for df_id, datafile in dataset['datafiles'].items():
        data_tmp = datafile['metadata']
        data_tmp['pid'] = pid
        df = Datafile()
        df.set(data_tmp)
        filename = os.path.abspath(os.path.join('data', datafile['metadata']['filename']))
        resp = upload_datafile(api, pid, filename, df)

## 8. Publish Datasets via API

In [14]:
# Publish the Datasets

for ds_id, dataset in data.items():
    resp = publish_dataset(mapping_dsid2pid[ds_id], api)

{'status': 'ERROR', 'message': 'This dataset may not be published due to an error when contacting the <a href=http://status.datacite.org target="_blank"/> DataCite </a> Service. Please try again.'}
{'status': 'ERROR', 'message': 'This dataset may not be published due to an error when contacting the <a href=http://status.datacite.org target="_blank"/> DataCite </a> Service. Please try again.'}


## 9. Delete Datasets via API

In [15]:
# Delete the Datasets at the End (OPTIONAL)
DELETE_DATASETS = False

if DELETE_DATASETS:
    for ds_id, dataset in data.items():
        resp = delete_dataset(mapping_dsid2pid[ds_id], api)

## 10. Copy Jupyter Notebook to localhost

Files from within the Docker container can get out via the docker `cp` command. For this you must have the Container ID of the Jupyter Notebook running and pass the local folder in which you want the file to be copied.

```shell
$ docker ps
$ docker cp CONTAINER_ID:/home/jovyan/work/pyDataverse_workshop_tromso/pydataverse.ipynb YOUR_LOCAL_DIRECTORY
```

## Resources

* [Dataverse API Docs](http://guides.dataverse.org/en/latest/api/index.html)
* [pyDataverse templates](https://github.com/AUSSDA/pyDataverse_templates)