
# Demo: Use pyDataverse for data migrations into Dataverse

**European Dataverse Workshop @ Tromso**

This Jupyter Notebook is part of the European Dataverse Workhshop at Tromso. It offers a little demo, how to use [pyDataverse](https://github.com/AUSSDA/pyDataverse) for a data migration.

* Date: 24th January 2020
* Location: [UiT - The Arctic University of Norway](https://en.uit.no/startsida), Tromsø
* Trainer: Stefan Kasberger from [AUSSDA - The Austrian Social Science Data Archive](https://aussda.at).
* Workshop Materials: [GitHub Repository](https://github.com/AUSSDA/european-dataverse-workshop-tromso)

**Requirements**

* [Dataverse Docker](https://github.com/IQSS/dataverse-docker)
* [Jupyter Docker](https://hub.docker.com/r/jupyter/datascience-notebook)
* [pyDataverse](https://github.com/AUSSDA/pyDataverse) ([develop](https://github.com/AUSSDA/pyDataverse/tree/develop))

**Overview**

What we will do:

* Get a short introduction into the idea of pyDataverse (DONE)
* Prepare the environment for the data migration:
* Explain the pyDataverse templates and its usage
* Import the data from the pyDataverse templates into pyDataverse
* Upload the data via the API to our Dataverse

## 1. Introduction pyDataverse

See the [slides](https://github.com/AUSSDA/european-dataverse-workshop-tromso/presentation.pdf).

## 2. Preparations

Open your local terminal.

TODO

* SCREENSHOT Empty Shell

Download and start Docker container for Jupyter notebook ([jupyter/datascience-notebook](https://hub.docker.com/r/jupyter/datascience-notebook)).

```shell
$ docker run -p 8888:8888 jupyter/scipy-notebook
```

TODO

* SCREENSHOT Command, Result

Go into the bash of the Docker container:

```shell
$ docker ps
$ docker exec -it CONTAINER_ID bash
```

TODO

* SCREENSHOT Command, Result

Once you are inside, you can install [pydataverse](https://github.com/AUSSDA/pyDataverse). To have the latest features, we install from the develop branch.

```shell
$ pip install git+https://github.com/aussda/pyDataverse.git@develop
```

TODO

* SCREENSHOT Command, Result

Then you can download the [pyDataverse templates from GitHub](https://github.com/AUSSDA/pyDataverse_templates), which are needed for the import:

```shell
$ cd work/
$ git clone https://github.com/AUSSDA/pyDataverse_templates.git
```

TODO

* SCREENSHOT Command, Result

Finally, we download the [GitHub Repository for this workshop](https://github.com/AUSSDA/pyDataverse_workshop_tromso), with the Jupyter Notebook inside, prepared for this workshop.

```shell
$ git clone https://github.com/AUSSDA/pyDataverse_workshop_tromso.git
```

TODO

* SCREENSHOT Command, Result

* Open Jupyter Notebook `localhost:8888`.
* Open `work/pyDataverse_workshop_tromso/pydataverse.ipynb`

Now everything is installed and up and running, so we can move on to get our hands on some data.

## 3. pyDataverse templates

TODO
* Show templates empty
* show prepared dataset.csv and datafile.csv
* Beziehung Datasets und Datafiles erklären
* Text abändern

After we added some data to the pyDataverse template files (datasets.csv, datafiles.csv), we can import the containing data into pyDataverse.

## 4. Import Dataset Metadata from templates to pyDataverse

1. Load the needed Python modules.
2. Convert the CSV data to a Python dictionary
3. Create the pyDataverse Dataset object
4. Print out metadata as json string
5. Print out specific metadata variables

In [171]:
# load Python modules
import json
from pyDataverse.models import Datafile
from pyDataverse.models import Dataset
from pyDataverse.utils import read_csv_to_dict
from pyDataverse.utils import read_file
from pyDataverse.utils import read_json
import os
import subprocess as sp
import time

In [172]:
ds_filename = 'datasets.csv'
license_filename = 'license.html'
terms_filename = 'terms-of-access.html'

data = {}
license_default = read_file(license_filename)
datasets_csv = read_csv_to_dict(ds_filename)

for dataset in datasets_csv:
    ds_tmp = {}
    
    ds_tmp['termsOfAccess'] = read_file(terms_filename)
    for key, val in dataset.items():
        if not val == '':
            if key == 'aussda.dataset_id':
                ds_id = val
            elif key == 'dataverse.title':
                ds_tmp['title'] = val
            elif key == 'dataverse.subtitle':
                ds_tmp['subtitle'] = val
            elif key == 'dataverse.author':
                ds_tmp['author'] = json.loads(val)
            elif key == 'dataverse.dsDescription':
                ds_tmp['dsDescription'] = []
                ds_tmp['dsDescription'].append({
                    'dsDescriptionValue': val})
            elif key == 'dataverse.keywordValue':
                ds_tmp['keyword'] = json.loads(val)
            elif key == 'dataverse.topicClassification':
                ds_tmp['topicClassification'] = json.loads(val)
            elif key == 'dataverse.language':
                ds_tmp['language'] = json.loads(val)
            elif key == 'dataverse.subject':
                ds_tmp['subject'] = []
                ds_tmp['subject'].append(val)
            elif key == 'dataverse.kindOfData':
                ds_tmp['kindOfData'] = json.loads(val)
            elif key == 'dataverse.datasetContact':
                ds_tmp['datasetContact'] = json.loads(val)
    data[ds_id] = {'metadata': ds_tmp}

In [173]:
ds_1 = Dataset()
ds_1.set(data['test_1']['metadata'])
ds_2 = Dataset()
ds_2.set(data['test_2']['metadata'])

In [175]:
print('== DS 1 ==')
print('Title:', ds_1.title)
print('Description:', ds_1.dsDescription[0]['dsDescriptionValue'])
print('== DS 2 ==')
print('Title:', ds_2.title)
print('Description:', ds_2.dsDescription[0]['dsDescriptionValue'])

== DS 1 ==
Title: Internet usage 2019
Description: Life Style 2019. Internet usage / media.
== DS 2 ==
Title: Attitute of youth towards school and profession 2018
Description: Attitudes of teenagers about school and job.


## 5. Upload Dataset Metadata via API

First, we have to get an API token for the Dataverse API. Go to [localhost:8085]()

In [176]:
# load pyDataverse functionality
from pyDataverse.api import Api

dv_alias = 'root'
BASE_URL = 'http://localhost:8085'
API_TOKEN = '714d16d5-c375-4051-a173-bc7d29fc0799'

api = Api(BASE_URL, API_TOKEN)
resp = api.get_dataverse(dv_alias)
print(resp.json())

{'status': 'OK', 'data': {'id': 1, 'alias': 'root', 'name': 'Root', 'dataverseContacts': [], 'permissionRoot': True, 'description': 'The root dataverse.', 'dataverseType': 'UNCATEGORIZED', 'creationDate': '2019-03-19T08:44:01Z', 'creator': {'id': 1, 'identifier': '@dataverseAdmin', 'displayName': 'Dataverse Admin', 'firstName': 'Dataverse', 'lastName': 'Admin', 'email': 'stefan.kasberger@univie.ac.at', 'superuser': False, 'affiliation': 'AUSSDA', 'position': 'Admin', 'persistentUserId': 'dataverseAdmin', 'createdTime': '2019-03-19T08:44:01Z', 'lastLoginTime': '2020-01-20T08:56:22Z', 'lastApiUseTime': '2020-01-20T17:27:45Z', 'authenticationProviderId': 'builtin'}}}


In [177]:
mapping_dsid2pid = {}

for ds_id, dataset in data.items():
    ds = Dataset()
    ds.set(dataset['metadata'])
    resp = api.create_dataset(dv_alias, ds.json())
    pid = resp.json()['data']['persistentId']
    mapping_dsid2pid[ds_id] = pid
    time.sleep(1)

Dataset with pid 'doi:10.5072/FK2/L7I0TL' created.
Dataset with pid 'doi:10.5072/FK2/VZI3F7' created.


TODO:
* View created Dataset at localhost. BROWSER

## 6. Import Datafile metadata from templates to pyDataverse

In [178]:
df_filename = 'datafiles.csv'
datafiles_csv = read_csv_to_dict(df_filename)

for datafile in datafiles_csv:
    df_tmp = {}
    df_id = None
    ds_id = None
    for key, val in datafile.items():
        if not val == '':
            if key == 'dataverse.description':
                df_tmp['description'] = val
            elif key == 'aussda.filename':
                df_tmp['filename'] = val
            elif key == 'aussda.datafile_id':
                df_tmp['datafile_id'] = val
                df_id = val
            elif key == 'aussda.dataset_id':
                ds_id = val
                df_tmp['dataset_id'] = ds_id
            elif key == 'dataverse.categories':
                df_tmp['categories'] = json.loads(val)
    if 'datafiles' not in data[ds_id]:
        data[ds_id]['datafiles'] = {}
    if df_id not in data[ds_id]['datafiles']:
        data[ds_id]['datafiles'][df_id] = {}
    if 'metadata' not in data[ds_id]['datafiles'][df_id]:
        data[ds_id]['datafiles'][df_id]['metadata'] = {}
    data[ds_id]['datafiles'][df_id]['metadata'] = df_tmp

In [179]:
df_1 = Datafile()
df_1.set(data['test_1']['datafiles']['1']['metadata'])
df_1.set({'pid': mapping_dsid2pid['test_1']})

print('DICT:', df.dict())
print('PID:', df.pid)
print('FILENAME:', df.filename)

DICT: {'description': 'Documentation: Method report and Codebook ', 'categories': ['Documentation'], 'pid': 'doi:10.5072/FK2/YCLFRP', 'filename': '20002_do_de_v1_0.pdf'}
PID: doi:10.5072/FK2/YCLFRP
FILENAME: 20002_do_de_v1_0.pdf


## 7. Upload Datafile metadata via API

In [180]:
for ds_id, dataset in data.items():
    pid = mapping_dsid2pid[ds_id]
    for df_id, datafile in dataset['datafiles'].items():
        data_tmp = datafile['metadata']
        data_tmp['pid'] = pid
        df = Datafile()
        df.set(data_tmp)
        filename = os.path.abspath(os.path.join('data', datafile['metadata']['filename']))
        path = api.native_api_base_url
        path += '/datasets/:persistentId/add?persistentId={0}'.format(pid)
        shell_command = 'curl -H "X-Dataverse-key: {0}"'.format(API_TOKEN)
        shell_command += ' -X POST {0} -F file=@{1}'.format(path, filename)
        shell_command += " -F 'jsonData={0}'".format(df.json())
        result = sp.run(shell_command, shell=True, stdout=sp.PIPE)
        if filename[-4:] == '.sav' or filename[-4:] == '.dta':
            time.sleep(20)
        else:
            time.sleep(2)

## 8. Publish Datasets via API

In [182]:
print(mapping_dsid2pid)

{'test_1': 'doi:10.5072/FK2/L7I0TL', 'test_2': 'doi:10.5072/FK2/VZI3F7'}


In [183]:
for ds_id, dataset in data.items():
    pid = mapping_dsid2pid[ds_id]
    resp = api.publish_dataset(pid, 'major')
    print(resp.json())

{'status': 'ERROR', 'message': 'This dataset may not be published due to an error when contacting the <a href=http://status.datacite.org target="_blank"/> DataCite </a> Service. Please try again.'}
{'status': 'ERROR', 'message': 'This dataset may not be published due to an error when contacting the <a href=http://status.datacite.org target="_blank"/> DataCite </a> Service. Please try again.'}


## 8. Delete Datasets via API

In [185]:
for ds_id, dataset in data.items():
    pid = mapping_dsid2pid[ds_id]
    resp = api.delete_dataset(pid)
    time.sleep(1)

DatasetNotFoundError: ERROR: HTTP 404 - Dataset 'doi:10.5072/FK2/L7I0TL' was not found. MSG: Dataset with Persistent ID doi:10.5072/FK2/L7I0TL not found.

TODO

* View uploaded Datafiles at localhost. LINK generieren

## Resources

* [Dataverse API Docs](http://guides.dataverse.org/en/latest/api/index.html)