iConference2019

Towards More Reproducible Data Wrangling with OpenRefine

Overview

OpenRefine (OR) is a popular data wrangling tool. During data cleaning , not only a processed dataset will be generated but also some other provenance-related byproducts. One of them is a native OR recipe (JSON file), and the other one is the Operation History (OH) that lists a series of human-readable data cleaning steps, both of which promote research transparency to some extent by containing some (but incomplete) prospective provenance and partial retrospective provenance information ¹, making them difficult to be directly used for reuse and Reproducibility from OpenRefine Web API. In this poster, a prototype consisting of two sub-systems, one of which extends the native OR recipe to generate a complete recipe (a.k.a. enhanced receipt) followed by the second re-runner system, is created to complement the missing information between the actual data cleaning operations and the native OR recipe, which meanwhile facilitates transparency, reproducibility and reusability.

Prototype

The prototype includes two sub-systems.

1.Command-Line OpenRefine Prototype for Enhanced Recipe （CLOPER）

CLOPER aims to enhance transparency and reusability of the native OR recipe, which reads in the original "messy" dataset (d1.csv) and communicates with an OR server through the interface provided by the OR-client. The outputs consist of three products: an enhanced recipe (EnhancedRecipe.JSON) is generated at the back-end; a "cleaned" dataet (d2.csv) and a native OR recipe (NativeORRecipe.JSON) are exported from the OR web UI.

2.Enhanced Recipe Re-Runner (ERRR)

In regards to reproducibility, ERRR re-implements the enhanced recipe (EnhancedRecipe.JSON) that is derived from CLOPER, applies to the same original "messy" dataset (d1.csv). Again ERRR connects to an OR server via OR-client and obtains the same output (d2.csv) associated with the native OR recipe (NativeORRecipe.JSON).

Run prototype on your own computer

To take advantage of all of the sub-systems in the prototype, you need install the software on your computer

1.Check installed version of Python

This prototype requires Python version 2.7. To determine the version of Python installed on your computer use the version option to the python command. For example,

 $ python 
  Python 2.7.10 (default, Jul 15 2017, 17:16:57) 
  [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>>

Instructions for installing Python may be found at Python Download Website.

2.Project directory layout

The directories and the descriptions are listed below:

Directory	Description
OR_Client_Library	The OpenRefine Python Client Library provides an interface to communicating with an OpenRefine server.
CLOPER&ERRR/Dataset	Input "messy" dataset csv file.
CLOPER&ERRR/Enhanced_JSON	Enhanced Recipe generated by CLOPER system.
CLOPER&ERRR/OR_JSON	Native OR Recipe generated by OpenRefine server.
CLOPER&ERRR/OutputDataset	Ouput "clean(ed)" dataset aftering running enhanced recipe.
CLOPER&ERRR/CLOPER.py	Command Line OpenRefine Prototype for Enhanced Recipe.
CLOPER&ERRR/ERRR.py	Enhanced Recipe Re-Runner.
CLOPER&ERRR/OpenRefineOperations.py	Communicate with OR-client library and provides a list of operations to CLOPER and ERRR.
CLOPER&ERRR/Reproducibility.sh	Bash file for testing the reproducibility of ERRR
README/PNG	Figures in Readme.md

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
CLOPER&ERRR		CLOPER&ERRR
OR_Client_Library		OR_Client_Library
README/PNG		README/PNG
poster-abstract		poster-abstract
.gitignore		.gitignore
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iConference2019

Towards More Reproducible Data Wrangling with OpenRefine

Overview

Prototype

1.Command-Line OpenRefine Prototype for Enhanced Recipe （CLOPER）

2.Enhanced Recipe Re-Runner (ERRR)

Run prototype on your own computer

1.Check installed version of Python

2.Project directory layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

iConference2019

Towards More Reproducible Data Wrangling with OpenRefine

Overview

Prototype

1.Command-Line OpenRefine Prototype for Enhanced Recipe （CLOPER）

2.Enhanced Recipe Re-Runner (ERRR)

Run prototype on your own computer

1.Check installed version of Python

2.Project directory layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages