
# Use pyDataverse to migrate data into Dataverse

European Dataverse Workshop @ Tromso

Trainer: Stefan Kasberger from [AUSSDA - The Austrian Social Science Data Archive](https://aussda.at).
[GitHub Repository](https://github.com/AUSSDA/european-dataverse-workshop-tromso)

* [pyDataverse](https://github.com/AUSSDA/pyDataverse) ([develop](https://github.com/AUSSDA/pyDataverse/tree/develop))


[Documentation Dataverse API](http://guides.dataverse.org/en/latest/api/index.html). pyDataverse mostly uses the native API.

## Overview

What we will do:

* Get a short introduction into the idea of pyDataverse (DONE)
* Prepare the environment for the data migration:
  * Get and run docker container for Jupyter Notebook
  * Install pyDataverse
  * Get pyDataverse templates for Datasets and Datafiles
* Explain the pyDataverse templates and its usage
* Import the data from the pyDataverse templates into pyDataverse
* Upload the data via the API to our Dataverse



## Introduction pyDataverse

See the [slides](https://github.com/AUSSDA/european-dataverse-workshop-tromso/presentation.pdf).

## Preparations

Open your local terminal.

SCREENSHOT
Empty Shell

Download and start Docker container for Jupyter notebook ([jupyter/datascience-notebook](https://hub.docker.com/r/jupyter/datascience-notebook)).

```shell
$ docker run -p 8888:8888 jupyter/scipy-notebook
```

SCREENSHOT
Command, Result

Go into the bash of the Docker container:

```shell
$ docker ps
$ docker exec -it CONTAINER_ID bash
```

SCREENSHOT
Command, Result

Once you are inside, you can install [pydataverse](https://github.com/AUSSDA/pyDataverse). To have the latest features, we install from the develop branch.

```shell
$ pip install git+https://github.com/aussda/pyDataverse.git@develop
```

SCREENSHOT
Command, Result

Then you can download the [pyDataverse templates from GitHub](https://github.com/AUSSDA/pyDataverse_templates), which are needed for the import:

```shell
$ cd work/
$ git clone https://github.com/AUSSDA/pyDataverse_templates.git
```

SCREENSHOT
Command, Result

Finally, we download the [GitHub Repository for this workshop](https://github.com/AUSSDA/pyDataverse_workshop_tromso), with the Jupyter Notebook inside, prepared for this workshop.

```shell
$ git clone https://github.com/AUSSDA/pyDataverse_workshop_tromso.git
```

SCREENSHOT
Command, Result

* Open Jupyter Notebook `localhost:8888`.
* Open `work/pyDataverse_workshop_tromso/pydataverse.ipynb`

Now everything is installed and up and running, so we can move on to get our hands on some data.

## The pyDataverse templates

In [1]:
* Show templates empty
* show prepared dataset.csv and datafile.csv
* Beziehung Datasets und Datafiles erklären
* Text abändern

SyntaxError: invalid syntax (<ipython-input-1-7a85d3a64840>, line 1)

## Import the pyDataverse template files into pyDataverse

After we added some data to the pyDataverse template files (datasets.csv, datafiles.csv), we can import the containing data into pyDataverse.

Load pyDataverse functions.

In [11]:
# load pyDataverse functionality
from pyDataverse.models import Datafile
from pyDataverse.models import Dataset
from pyDataverse.utils import read_csv_to_dict
from pyDataverse.utils import read_file
from pyDataverse.utils import read_json
import os
import json

In [None]:
* Show data in pyDataverse
* add license, ToU

In [17]:
ds_filename = 'datasets.csv'
license_filename = 'license.html'
terms_filename = 'terms-of-access.html'

data = {}
license_default = read_file(license_filename)
datasets_csv = read_csv_to_dict(ds_filename)

for dataset in datasets_csv:
    ds_tmp = {}
    
    ds_tmp['termsOfAccess'] = read_file(terms_filename)
    for key, val in dataset.items():
        if not val == '':
            if key == 'aussda.dataset_id':
                ds_id = val
            elif key == 'dataverse.title':
                ds_tmp['title'] = val
            elif key == 'dataverse.subtitle':
                ds_tmp['subtitle'] = val
            elif key == 'dataverse.author':
                ds_tmp['author'] = json.loads(val)
            elif key == 'dataverse.dsDescription':
                ds_tmp['dsDescription'] = []
                ds_tmp['dsDescription'].append({
                    'dsDescriptionValue': val})
            elif key == 'dataverse.keywordValue':
                ds_tmp['keyword'] = json.loads(val)
            elif key == 'dataverse.topicClassification':
                ds_tmp['topicClassification'] = json.loads(val)
            elif key == 'dataverse.language':
                ds_tmp['language'] = json.loads(val)
            elif key == 'dataverse.subject':
                ds_tmp['subject'] = []
                ds_tmp['subject'].append(val)
            elif key == 'dataverse.kindOfData':
                ds_tmp['kindOfData'] = json.loads(val)
            elif key == 'dataverse.datasetContact':
                ds_tmp['datasetContact'] = json.loads(val)
    data[ds_id] = {'metadata': ds_tmp}
#print(data)

In [19]:
df_filename = 'datafiles.csv'
datafiles_csv = read_csv_to_dict(df_filename)

for datafile in datafiles_csv:
    df_tmp = {}
    df_id = None
    ds_id = None
    for key, val in datafile.items():
        if not val == '':
            if key == 'dataverse.description':
                df_tmp['description'] = val
            elif key == 'aussda.filename':
                df_tmp['filename'] = val
            elif key == 'aussda.datafile_id':
                df_tmp['datafile_id'] = val
                df_id = val
            elif key == 'aussda.dataset_id':
                ds_id = val
                df_tmp['dataset_id'] = ds_id
            elif key == 'dataverse.categories':
                df_tmp['categories'] = val
    if 'datafiles' not in data[ds_id]:
        data[ds_id]['datafiles'] = {}
    if df_id not in data[ds_id]['datafiles']:
        data[ds_id]['datafiles'][df_id] = {}
    if 'metadata' not in data[ds_id]['datafiles'][df_id]:
        data[ds_id]['datafiles'][df_id]['metadata'] = {}
    data[ds_id]['datafiles'][df_id]['metadata'] = df_tmp
#print(data)

In [20]:
ds_1 = Dataset()
ds_1.set(data['1']['metadata'])
ds_2 = Dataset()
ds_2.set(data['2']['metadata'])

In [21]:
ds_1.dict()

{'datasetVersion': {'metadataBlocks': {'citation': {'fields': [{'typeName': 'kindOfData',
      'multiple': True,
      'typeClass': 'primitive',
      'value': ['numeric']},
     {'typeName': 'language',
      'multiple': True,
      'typeClass': 'controlledVocabulary',
      'value': ['de']},
     {'typeName': 'subject',
      'multiple': True,
      'typeClass': 'controlledVocabulary',
      'value': ['Social Sciences']},
     {'typeName': 'subtitle',
      'multiple': False,
      'typeClass': 'primitive',
      'value': 'How Austrians use the internet in 2019.'},
     {'typeName': 'title',
      'multiple': False,
      'typeClass': 'primitive',
      'value': 'Internet usage 2019'},
     {'typeName': 'author',
      'multiple': True,
      'typeClass': 'compound',
      'value': [{'authorName': {'typeName': 'authorName',
         'typeClass': 'primitive',
         'multiple': False,
         'value': 'Ausserhofer, Julian'},
        'authorAffiliation': {'typeName': 'authorAffilia

In [22]:
ds_2.json()

'{\n  "datasetVersion": {\n    "metadataBlocks": {\n      "citation": {\n        "fields": [\n          {\n            "typeName": "kindOfData",\n            "multiple": true,\n            "typeClass": "primitive",\n            "value": [\n              "numeric"\n            ]\n          },\n          {\n            "typeName": "language",\n            "multiple": true,\n            "typeClass": "controlledVocabulary",\n            "value": [\n              "de"\n            ]\n          },\n          {\n            "typeName": "subject",\n            "multiple": true,\n            "typeClass": "controlledVocabulary",\n            "value": [\n              "Social Sciences"\n            ]\n          },\n          {\n            "typeName": "subtitle",\n            "multiple": false,\n            "typeClass": "primitive",\n            "value": "How Austrian teenagers see school and professional time"\n          },\n          {\n            "typeName": "title",\n            "multiple": 

In [35]:
# load pyDataverse functionality
from pyDataverse.api import Api

dv_alias = 'root'
base_url = 'http://localhost:8085'
api_token = '' # get API_TOKEN

api = Api(base_url, api_token)
#api.dataverse_version
#api.get_dataverse(dv_alias)

ERROR: Could not establish connection to url http://localhost:8085/api/v1/info/server HTTPConnectionPool(host='localhost', port=8085): Max retries exceeded with url: /api/v1/info/server (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f239ddc8da0>: Failed to establish a new connection: [Errno 111] Connection refused')).
Dataverse build version can not be retrieved.


In [None]:
for dataset in [ds_1, ds_2]:
    resp = api.create_dataset(dv_alias, ds.json())
    if 'status' in resp.json():
        if resp.json()['status'] == 'OK':
            history['upload_date'] = ts
            if 'data' in resp.json():
                if 'persistentId' in resp.json()['data']:
                    pid = resp.json()['data']['persistentId']
                    history['pid'] = pid
                    save_history(ds_dir, history)
                    update_dict[ds_id] = {'aussda_doi': pid}
                elif 'id' in resp.json()['data']:
                    dataset_id = resp.json()['data']['id']
                    history['dataverse_datasetId'] = str(dataset_id)
                    resp = api.get_dataset(dataset_id, is_pid=False)
                    pid = DOI_PREFIX_AUSSDA + '/' + resp.json()['data']['identifier']
                    history['pid'] = pid
                    save_history(ds_dir, history)
                    update_dict[ds_id] = {'aussda_doi': pid}
                else:
                    print('ERROR: Create Dataset {0} - no \'persistentId\' in API response.'.format(ds_id))
            else:
                print('ERROR: Create Dataset {0} - no \'data\' in API response.'.format(ds_id))
        else:
            print('ERROR: Create Dataset {0} API Request Status not OK. STATUS = {1}, MSG = {2}'.format(ds_id, resp.json()['status'], resp.json()['data']))
    else:
        print('ERROR: Create Dataset {0} API Request not working.'.format(ds_id))
    time.sleep(1)

In [None]:
ds = Dataset()
ds.import_metadata(ds_filename)
ds.title
ds.description
ds.dict()
ds.json()

df = Datafile()
df.dict()


## Store metadata as JSON

save json locally

In [None]:
dataset

In [None]:
datafile

## Upload the data to Dataverse via API

First, we have to get an API token for the Dataverse API. Go to [localhost:8085]()

In [None]:
api_token = 'SECRET'

NameError: name 'api_token' is not defined

In [None]:
# set dataverse, in which the dataset should be created
dv_alias = 'root'

resp = api.get_dataverse(dv_alias)
resp.json()
resp = api.create_dataset(dv_alias, ds.json())

View created Dataset at localhost.

In [None]:
api.upload_datafile()

View uploaded Datafiles at localhost.