# Managing Data with OpenRefine

*Digital Literacy Support Workshop, October 23rd 2020*

Today's workshop will introduce how to use OpenRefine for data management. The purpose of the workshop is both to introduce to OpenRefine as a data management tool as well as teach common data management practices that is common within quantitative data analysis.

**Main teaching objective of the workshop**: Understand the common challenges in preparing data for analysis and solve these common challenges using OpenRefine.

## Schedule

08.30-09.00: Support for installation and preparations for the workshop

09.00-09.15: Introduction: Data, data sets and data management

09.15-09.25: What is OpenRefine?

09.25-10.00: Introduction to working with OpenRefine

10.00-10.10: Break

10.10-10.30: Filtering and sorting with OpenRefine

10.30-10.50: Examining numbers in OpenRefine

10.50-11.05: Exporting and saving Data from OpenRefine

11.05-11.15: Break

11:15-11:25: Saving your work as a script

11.25-11.50: Data management in Python with [Pandas](pandas.pydata.org/) - A brief glimpse

11.50-12.00: Wrap-up and final questions

# Data, data sets and data management

Working with data is of course not a new practice. The amount of information produced and digitized has however seen an increase in the popularity of computational data analysis (fx use of machine learning). This is both due to the fact of an increase in digitized material (newspapers, books, texts), increase digitally produced data (social media data) and a general increase in volume, necessitating methods for providing a better overview and gaining insights.

The increased popularity of "AI", machine learning and automation, can lead one to make the errorneuos assumption that these tools can just be applied on any data willy-nilly. While these tools can be applied on all sorts of data (numerical, text, images, video, audio and so on), applying various data analysis techniques (machine learning or not) is almost never plug-and-play.

*Data has to be pre-processed in a way so that is "compatible" with the given method*

Pre-processing include steps like making corrections (fixing erros in data), filtering (removing data not suitable for the analysis technique) and mutating (combining or aggregating information in the data). These steps are broadly refered to as data management, data wrangling or data manipulation (although this has other connotations as well...).

## Data and data sets

*Data* can be understood broadly as a type of information about someting. Name, age, education, income, favourite movie, forum posts, e-mails, most played songs in 2020, number of times travelling by bus, super market transactions, house sizes, energy consumption. All such information can be made analyzed to say something about people, organizations, disocurses, trends and the like.

Data as a more or less continuous stream of arbitrary information is however very difficult to work. It always has to be systematized in some way - regardless of it having to be analyzed by a computer or by a person.

When refering to a data set, one usually refers to some sort of delimited amount of data that is systematized in some way.

## Structured and unstructured data

It is common to distinguish between *structured* and *unstructured* data.

**Structured data**

*Structured data* is data that is systematized in some way. The "classic" representation of structured data is data in a tabular format. In such a format, each row contains an *observation* and each column contains a *variable* (statistics terminology) or *feature* (computer science / machine learning terminology).

An observation can be a person, country, text, company, data, municipality etc. while a variable/feature is some information about the given observation.

The table belows shows an example of structured data:

|Name |Age |Occupation |
|-----|------|--------------|
|Lars | 34 | Butcher|
|Gertrud | 62 | Consultant|
|Henning | 43 | Accountant|
|Agnes | 38 | Carpenter|

Structured data is characterised by being systematized in such a way that it is almost immediately suitable for some sort of analysis or inquiry (fx how many observations are above the age of 40, who has a first name starting with "H", who works with construction).

**Unstructured data**

*Unstructured data* is data which - in a nutshell - is not systematized. Text, images and video are typical examples of unstrcuted data as these consists of raw information without anyway of separating one type of information from the other. Many modern data analysis techniques focuses on unstructured data where one either develops techniques for providing some broad overview or techniques for systematizing the data in some way.

Below is an example of unstructed data:

```
["Hvorfor går man ikke i dialog med ⁦@DRC_dk⁩ i stedet for at opsige kontrakten uden varsel. Er det kun for at føre stærk mand politik? DRC yder en fremragende indsats på baggrund af den opgave de har fået #dkpol https://jyllands-posten.dk/indland/ECE12248020/tesfaye-forsoeger-sig-med-en-ny-loesning-paa-alle-udlaendingeministres-problem/ …",
"Alle tæller ❤️ https://twitter.com/cekicozlem/status/1276034922587832326 …",
"Det er så godt arbejde💚 https://twitter.com/fannybroholm/status/1275360842847080449 …",
"Tilfreds med den klima og energiaftale, der er lavet nu. Det er den første delaftale om at nå 70% reduktion i 2030. Særligt glad for at den indeholder principaftale om en CO2 afgiftsreform #dkpol #dkgreen pic.twitter.com/3slrMxLT5B",
"Godt første skridt for den fri natur #dkpol #dkgreen ⁦@alternativet_⁩ https://www.altinget.dk/miljoe/artikel/wermelin-lander-aftale-om-de-foerste-naturnationalparker …",
"Spændende udmelding. ⁦@alternativet_⁩ ønsker også en grøn   Klimaafgift, hvor udgangspunktet er at forureneren betaler #dkgreen #dkpol https://www.altinget.dk/artikel/venstre-og-radikale-laegger-faelles-pres-paa-regeringen-vil-have-ensartet-co2-afgift?SNSubscribed=true&ref=newsletter&refid=fredag-middag-190620&utm_campaign=altingetdk%20Altinget.dk&utm_medium%09=e-mail&utm_source=nyhedsbrev …",
"Så vigtigt at KL tager ansvar for den proces #dkpol #dkgreen https://www.altinget.dk/miljoe/artikel/professor-om-affaldsaftale-kl-og-kommunerne-skal-gribe-chancen-for-at-loese-tingene-selv …",
"Hurra - stor dag for Danmark💚👏🏼👏🏼 https://twitter.com/alternativet_/status/1273555055476723713 …",
"Til klimaforhandlinger i Finansministeriet. Vi sidder og diskuterer rammerne - de næste dage bliver intensive #dkpol #dkgreen @alternativet_ @ Christiansborg Palace  https://www.instagram.com/p/CBi3d0oB9lB/?igshid=ii78cjnx2n72 …",
"Aftale om mindre affald, mindre forbrænding og mere genbrug - god dag for klimaet og miljøet. 1. skridt i en stor miljøpakke #dkpol ⁦@alternativet_⁩ https://www.dr.dk/nyheder/indland/live-regeringen-praesenterer-ny-aftale-om-affald …"]
```

One can still apply the concept of an observation to the data above (in this case a tweet from a Danish politician) but there are no variables or features provided. Therefore there is no immediate structure making it possible to analyze it.

## From data to data analysis

Whether data is structured or unstructured, it is almost never immediately ready for analysis. Data almost always has to be adapted to the analysis one wants to conduct and to the research question one wants to explore.

For structured data this involves identifying the relevant parts of the data, correcting errors and recoding information.

For unstructured data it involved systematizing or standardizing the information in a way so that it can be analyzed.

# What is OpenRefine?

https://datacarpentry.org/openrefine-socialsci/01-introduction/index.html

OpenRefine is a free, open source data management tool for working with structured data. It provides a lot of easy to use features for performing some of the most common data management tasks (correcting errors, filtering, recoding etc.).

## Motivations for learning OpenRefine


- Data is often very messy. OpenRefine provides a set of tools to allow you to identify and amend the messy data.
- It is important to know what you did to your data. Additionally, journals, granting agencies, and other institutions are requiring documentation of the steps you took when working with your data. With OpenRefine, you can capture all actions applied to your raw data and share them with your publication as supplemental material.
- All actions are easily reversed in OpenRefine.
- If you save your work it will be to a new file. OpenRefine always uses a copy of your data and does not modify your original dataset.
- Data cleaning steps often need repeating with multiple files. OpenRefine keeps track of all of your actions and allows them to be applied to different datasets.
- Some concepts such as clustering algorithms are quite complex, but OpenRefine makes it easy to introduce them, use them, and show their power.

## Features


- Open source ([source on GitHub](https://github.com/OpenRefine/OpenRefine)).
- A large growing community, from novice to expert, ready to help. See Getting Help section below.
- Works with large-ish datasets (100,000 rows). Can adjust memory allocation to accommodate larger datasets.
- OpenRefine always keeps your data private on your own computer until you choose to share it. It works by running a small server on your computer and using your web browser to interact with it, but your private data never leaves your computer unless you want it to.

## Getting help for OpenRefine

You can find out a lot more about OpenRefine at http://openrefine.org and check out some great introductory videos. These videos and other on OpenRefine can also be found on YouTube, search under ‘OpenRefine’ There is a [Google Group](https://groups.google.com/forum/?hl=en#!forum/openrefine) that can answer a lot of beginner questions and problems. Information can also be found on (StackOverflow)[https://stackoverflow.com/questions/tagged/openrefine] where you can find a lot of help. As with other programs of this type, OpenRefine libraries are available too, where you can find a script you need and copy it into your OpenRefine instance to run it on your dataset.