# Data as objects and architectures

Sections:

- Data types, files, and tables
- Data table efficiency
- Standardized data organization

This lecture draws from Wickham, Hadley. "Tidy data." Journal of Statistical Software 59.10 (2014): 1-23 and Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., Duff, E. P., ... & Handwerker, D. A. (2016). The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific Data, 3, 160044.


# 1. Data types, files, and tables

We don't typically think about the nature of the data that we interact with in science. We just think of it as a singular entity "data". But not all data is created equal. In this section we will consider what types of data you will deal with and how it can be stored or accessed.



## Data types

Let's first think of the type of data that we typically collect in psychology and neuroscience. Here is a rough list of the types of variables we typically consider:

* reaction times
* accuracy
* choice
* treatment group
* trial duration
* time on task

Of course there are many more, but this is just an example list. Notice not all the same _type_ of data. But they all fall into one of two classes of data types :

* **Quantitiative**: Numerical values that represent some kind of direct measurement with meaningful distances between units. 

* **Qualitiative**: Category or label values and place an individual into one of several groups. Each observation can be placed in only one category, and the categories are mutually exclusive.


In the list above, _reaction time, trial duration, and time on task_ all reflect **quantitative** data, while _accuracy (correct/incorrect), choice, and treatment group_ all reflect **qualitative** data.

Understanding the data type of all of your variables is critical because it defines the assumptions you make in your statistical methods. *Therefore the first step in data science is knowing what your data is in the first place.*


## Data files

Once you know what type of data you'll be working with, you'll have to access it. We can safely assume that you will be starting with data that is stored in a digital file (if you have data in another format then it'll have to be digitized for you to do anything with it). For the purposes of this course, any digital file containing values to be analyized is a **data file**. 

Now data files can reflect many different things. You may have many data files for any given analytical goal. For example, the firing rates of different neurons, recorded in the same experiment, are saved as separate data files or each subject in your experiment may have a single data file that reflects their performance and you wish to analyze across subjects. 

One thing to keep in mind is the format of your data file. In some cases your data will be stored in a format that is easily readible by almost any software environment. We call these **human readable** formats because you can open them up in a text editor and see your data. 

Examples include:
* CSV: comma separated value.
* TXT: a text file with characters (e.g., tab, space) that indicate transitions between cells.
* JSON: JavaScript object notation, lightweight data-interchange format.

However, when storing large amounts of information or data with complex hierarchical relationships (or sometimes when evil software companies want to prevent you from using other software), data will be stored in **binary** format. In this case, rather than easily readible text formats, the data is converted into binary form and requires a translator function to read it. 

Examples of binary formats include:
* MAT: Matlab data format
* SAV: SPSS data format
* R: R data format
* HDF5: Hierarchical data format (5th version)

Now many software packages have tools to read data in different binary formats, but sometimes it helps to think carefully about how to store your data in a manner that maximizes ease of access (for you and other users).

Taken together then, the things you want to know about your data files are:
* What information is in each file?
* Do I know how to read the file?
* Do I need to aggregate across files for may analytical goals?


## Data Tables

It is important to differentiate a **data file**, which is a way of storing information, from a **data table**, which we define as

* **Data table:** aggregated data that is organized in a way that allows for it to be analyzed so as to meet a set of empirical goals. 

In this way a data table can be information aggregated from many data files. However, in many cases the only data file you may have access to is a data table itself.

In the next section we consider the way you organize data tables in an efficient manner.

# 2. Data Table Efficiency

<br>
This section pulls heavily from 
[Wickham 2014](https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf)

Perhaps one of the most time consuming parts of data analysis is organization and cleaning of data. **Data preparation**, the process of getting data files ready for visualization and statistical analysis, can often constrain how you look at your data. Therefore it is worthwhile to spend time thinking about how your data is organized.

Here we will define the concept of **TidyData**, a conceptual approach to guide how you organize your data files themselves.



## Definition of Terms:

Before we get into the idea of organizing your aggregate data file, we should define some key terms.

* **Dataset:** A collection of values.
* **Values:** An analytical unit, either a number of a string.
* **Variable:** All values that measure the same underlying attribute across units (e.g., age, height, group). This is also known as a _feature, independent variable, or predictor variable_.
* **Observation:** All values measured on the same unit (e.g., person, reaction time). This is also sometimes known as a _dependent or response variable_.
* **Table:** A collection of _variables_ and _observations_ organized as a 2-dimensional array with rows and columns.

## Example Table
<br>
Consider this hypothetical data set in Table 1:


|  person |  a  |  b  |
|------|-----|-----|
| Joe  |  -   |  2  |
| June | 16  | 11  |
| Mary |  3  |  1  |


Here the values are ***person***, ***treatment*** (a or b), and ***result***. When organized in this manner, ***person*** is the observation while ***treatment*** and ***result*** are variables. Do you see why?


## Tidy Data

How you can analyze your data will greatly depend on how the data is organized. So it's best to follow a prescribed set of rules when thinking about formatting the data file you will be using in your analysis.


***Tidy data***: a standard way of mapping the _meaning_ (i.e., analytical goals) of a dataset to its structure. 
* A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.


The structure of tidy data is as follows:
* Each _variable_ forms a _column_.
* Each _observation_ forms a _row_.
* Each type of _observational unit_ forms a _table_.

Let's think of this within the example of Table 1 presented above. Here treatment effect was entered as two variables: a variable for _treatment a_ and a variable for _treatment b_, with the values split across the two variables.

This does not fit the definition of tidy data. Why? Because each row contains two observations: one for _treatment a_ and one for _treatment b_.

So then let's look at restructuring Table 1 to be consistent with the principle of tidy data.

| person |  treatment  | result|
|--------|-------------|-------|
| Joe    |  a  | - |
| June    |  a  | 16 |
| Mary    |  a  | 3 |
| Joe    |  b  | 2 |
| June    |  b  | 11 |
| Mary    |  b  | 1 |

Note that here each unique observation is isolated in a single row and the variables (person, treatment, result) are now strictly defined by column.


## How to make messy data tidy

<br>
Before we get into principles of how to make messy data tidy, let's consider how your data can be messy. Wickham (2014) identifies five common problems that make messy data. They are:

* **Problem 1:** Column headers are values, not variable names.
* **Problem 2:** Multiple variables are stored in one column.
* **Problem 3:** Variables are stored in both rows and columns.
* **Problem 4:** Multiple types of observational units are stored in the same table.
* **Problem 5:** A single observational unit is stored in multiple tables.

<br>
___

<br>

**Problem 1: Column headers are values, not variable names.**

The original form of Table 1 above commits Problem 1. Here the variable *treatment* was split across two columns. In this case we fixed the table by combining two columns into a single column and adding a new column that defines the original difference between the _treatment a_ and _treatment b_ columns. This defines a process known as **melting**.

<br>

__Melting:__ unifying data across multiple columns that are subordinate to a common variable. 

<br> 

This is what was done to Table 1. But here's a more formal description of the melting process.

    
**Raw Table**


|  row  |  a  |  b  |  c  |
|-------|-----|-----|-----|
| A | 1 | 4 | 7 |
| B | 2 | 5 | 8 |
| C | 3 | 6 | 9 | 


<br>

**Tidy Table (melted)**

| row | column | value |
|-----|--------|-------|
| A | a | 1 |
| B | a | 2 |
| C | a | 3 |
| A | b | 4 |
| B | b | 5 |
| C | b | 6 |
| A | c | 7 |
| B | c | 8 |
| C | c | 9 |


Notice how the variable with the three levels _a, b, and c_ was split across multiple columns. By unifying the value so that each row is a unique observation and including a new column that defines the level of the variable that defines the levels, you've made this data tidy.

<br>

___

<br>

**Problem 2: Multiple variables are stored in one column.**

Next let's consider when multiple variables stored in a single column. 

Consider a hypothetical table reporting the number of reported cases of dyslexia in two separate countries (in this example it is not a complete table, just to keep things simple) reported by gender and age range.

**Raw Table**

| Country | Group | Cases |
|---------|-------|-------|
| US | M014 | 0 |
| US | M1524 | 0 |
| US | M2534 | 1 |
| US | F014 | 0 |
| US | F1524 | 0 |
| US | F2534 | 2 |
| UK | M014 | 1 |
| UK | M1524 | 0 |
| UK | M2534 | 3 |


Now the data table is not committing Problem 1 because each row is a unique observation, but it is committing Problem 2 because _sex_ and _age_, which are separate variables, are stored together in column 3. Now we will need a new way of cleaning the data, called _splitting_.

<br>

* __Splitting:__ splitting a single column, with multiple variables, into separate columns reflecting unique variables.

<br>

Let's take a look at splitting Table 2. 

**Tidy Table (split)**

| Country | Sex | Age Range | Cases |
|---------|-----|-----------|-------|
| US | M | 0-14 | 0 |
| US | M | 15-24 | 0 |
| US | M | 25-34 | 1 |
| US | F | 0-14 | 0 |
| US | F | 15-24 | 0 |
| US | F | 25-34 | 2 |
| UK | M | 0-14 | 1 |
| UK | M | 15-24 | 0 |
| UK | M | 25-34 | 3 |

Notice that now the levels of _sex_ (i.e., m or f) are unique to column three and the _age_ groups are indicated in a separate variable. Most importantly, the column _cases_ didn't change. Now Table 2 conforms to our definition of tidy data. Each variable is its own unique column and each observation is a unique row to form a specific table.

<br>

___

<br>

**Problem 3: Variables are stored in both rows and columns.**

So far so good. We took slightly messy data and made it tidy. But now let's look at some _REALLY_ messy data!

What happens when variables are stored in both rows and columns (i.e., Problem 3)? Let's take a look at an example of this. The hypothetical table below reports the mean and standard deviation (stdev) of the ages of students in an online class each month.

**Raw Table**

| Date | Measure | Value | 
|------|---------|-------|
| 1/20 |  mean | 23 |
| 1/20 |  stdev | 10 |
| 2/20 |  mean | 35 |
| 2/20 |  stdev | 7 |
| 3/20 |  mean | 29 |
| 3/20 |  stdev | 15 |

Here the _Measure_ column actually reflects two variables (_mean_ and _stdev_) (Problem 2). We can improve efficiency by getting rid of the variable _Measure_ and reporting each as their own variable. We can do this using a a process known as **casting** (also known as **unstacking**).

<br>

* __Casting:__ The inverse of melting where values in a single column, reflecting two different types of variables, are rotated around into separate columns.

<br>

**Tidy Table (cast)**

| Date | Mean | Variance |
|------|------|----------|
| 1/20 | 23 | 10 |
| 2/20 | 35 | 7 |
| 3/20 | 29 | 15 |



Notice that _mean_ and _stdev_ are two different types of values that are part of a singluar observation (a sample from the same monthly online class). Having them separated as variables makes this data table more analytically tractable. Thus, by combining melting and casting we've made a data table that conforms to the definition of tidy data.

<br>

___

<br>

**Problem 4: Multiple types of observational units are stored in the same table.**

So far we have been going over when each table contains a single observational unit. But sometimes you may have a dataset where a data table combines multiple types of observational units (Problem 4). 

This is illustrated in the hypothetical table below showing information on two published papers for the first four years after publishing.

**Raw Table**

| DOI  |  Author  |   Title   | Year | Citations |
|------|----------|-----------|------|-----------|
| .001 | Verstynen| "Big Data"| 2011 | 2   |
| .001 | Verstynen| "Big Data"| 2012 | 10  |
| .001 | Verstynen| "Big Data"| 2013 | 50  |
| .001 | Verstynen| "Big Data"| 2014 | 101 |
| .002 | Holt     | "Theory!" | 2015 | 10  |
| .002 | Holt     | "Theory!" | 2016 | 211 |
| .002 | Holt     | "Theory!" | 2017 | 561 |
| .002 | Holt     | "Theory!" | 2018 | 1014 |

There is a lot going on in this table. There is the identification of each paper, with variables for _DOI_, _author_, and _title_, but there is also information on _year_ and _number of citations_. So are two types of observational units here: 1) the features that are associated with each paper (DOI, author, title), 2) the citations over time. 

Now you can use tools like melting, splitting, and casting here, but keeping them all in the same table means that you're necessarily looking at two separate questions. Here it's better to _split the data into separate tables_, with each table representing a unique observational unit. 

<br>

* __Parsing:__ Taking a table with multiple obervational units and breaking (or parsing) it into multiple tables each with unique observational units.

<br>

**Tidy Table 1**

| DOI  |  Author  |   Title   |
|------|----------|-----------|
| .001 | Verstynen| "Big Data"|
| .002 | Holt .   | "Theory!" |

<br>

**Tidy Table 2**

| DOI  | Year | Citations |
|------|------|-----------|
| .001 | 2011 | 2   |
| .001 | 2012 | 10  |
| .001 | 2013 | 50  |
| .001 | 2014 | 101 |
| .002 | 2015 | 10  |
| .002 | 2016 | 211 |
| .002 | 2017 | 561 |
| .002 | 2018 | 1014 |


Notice how the two tables present different types of information. The first characterizes each paper and the second reports its performance over time. They are linked by the ID number (the DOI) to allow for asking questions on the same paper.

This problem brings up a deeper point about data tables. **Data tables themselves should be thought of (and generated) as discrete analyzable structures for specific questions.** What this means is that, while you may have a "master" data table with many variables and observations, for a lot of analyzes, you will want to parse and reformat a subset of the master table to smaller tables, with specific analyzable goals. Thus, the data table itself is much less static than you might want to think.

<br>

___

**Problem 5: A single observational unit is stored in multiple tables.**

<br>

Let's end by considering the final problem of having a single observation separated across multiple tables. This problem can come up often in psychology or neuroscience experiments where each subject's data file is stored separately. For example, you might have tidy data for one subject, which might look like this,

**Raw Table (Subject s0001)**

| Trial | Condition | RT  | Accuracy
| ------|:---------:|----:|----:|
| 1     |     A     | 380 |  0
| 2     |     B     | 599 |  1
| 3     |     A     | 240 |  1

and a second table for another subject that looks like this,

**Raw Table (Subject s0002)**

| Trial | Condition | RT  | Accuracy
| ------|:---------:|----:|----:|
| 1     |     A     | 692 |  0
| 2     |     B     | 476 |  1
| 3     |     A     | 301 |  1

Since all subjects go through the same experiment, these data files reflect a common observational unit split across subjects. The easiest way to handle this is to concatenate across the separate data table files to make a master data table that has each subject. 

**Tidy Table**

| Subject | Trial | Condition | RT  | Accuracy
| --------| ------|:---------:|----:|----:|
| s0001   | 1     |     A     | 380 |  0
| s0001   | 2     |     B     | 599 |  1
| s0001   | 3     |     A     | 240 |  1
| s0002   | 1     |     A     | 692 |  0
| s0002   | 2     |     B     | 476 |  1
| s0002   | 3     |     A     | 301 |  1


If each of the individual subject data tables is tidy _and contain the same information_, then the concatenated table will, by definition be tidy too. 

# 3. Standardized data organization

(We'll be borrowing a lot of concpets from Gorgolewski et al. 2016 in this section: https://www.nature.com/articles/sdata201644)

<br>
So far we have gone over the concepts of data files and well organized data tables, but now let's take a step back and think carefully about how we orgnize our data files. Remember that you want each data table to reflect a unique observational unit, but your experiment may have multiple observational units. Ideally you'll organize your _data files_ so that you can query them to organize new _data tables_ depending on your analytical goals.

<br>
Thus the concept of data standardization becomes critical. 

* __Data standardization:__  The process of bringing data into a common format, directory organization, and file naming convention that allows for collaborative research, large-scale analytics, and sharing.

Data standardization has several advantages (from Gorgolewski et al. 2016).

* *Minimized curation:* Common standards make it possible for researchers who were not directly involved in data collection to understand and work with the data. This is particularly important to ensure that data remain accessible and usable by different researchers over time, including within a laboratory, between labs, or on public data sharing resources. 

* *Error reduction:* Errors attributed to the misunderstanding of the meaning of a given datum
(e.g., when variable names are not explicitly stated in the data file and standardized across files).

* *Optimized usage of data analysis software* is made possible when the metadata necessary for analysis (i.e., details of the task or imaging protocol) are easily accessible in a standardized and machinereadable way. This enables the application of completely automated analysis workflows, which greatly enhances reproducibility and efficiency.

* *Development of automated tools* for verifying the consistency and completeness of datasets is
realized. Such tools make it easier to spot missing metadata that limit how the data could be analyzed in the future.

<br>
Therefore, it is good to get into the habit early on of how you want to organize your data across studies and what is a format that will maximize efficiency regardless of the nature of the individual experiments. To do this, you will want to think about how to standarize four key things: 
* File types
* File and directory naming conventions
* Directory hierarchy
* Documentation files

## Example: The Brain Imaging Data Structure (BIDS)

In some cases, certain fields will have a standardized way of organizing their data. But many fields, including nearly all the subfields in psychology & neuroscience, do not. 

<br>
The field of neuroimaging, however, is moving towards a way of standardized data structure format called the Brain Imaging Data Structure (BIDS) format. I present it here as an example of how to think about cosistant organizing structures.

<br>
The figure below illustrates the logic of the BIDS format.

![BIDS](imgs/L2BIDS.png)

<br>
Data starts in an unstructured format (in this case a binary format called _dicom_), shown on the left side of the figure. Here each directory of dicome files represents a type of a scan (e.g., structural MRI, functional MRI) on the same subject on the same day. So the goal of BIDS is to convert and organize this data into a standard, intuitive format.

<br> 
On the right is the same data converted into BIDS. Note the hierarchical organization of the directories. The root directory (*my\_dataset*) is at the experiment level. Thus everying in this folder is related to the experiment itself.

Within the experiment directory, are two things: a human readable data file (_participants.tsv_) and a set of directories for each subject in the experiment. The data file provides information about each of the subjects. Within each subject directory, there are separate directories for the type of data collected (e.g., anatomical images, fMRI images, diffusion images). Within each of these directories, the original _dicom_ files have been converted into more usable formats. In some cases these are binary files (e.g., *sub-01\_T1w.nii.gz*) that can be read by multiple data analysis packages. Along with these binary data files, there may also be human readable data files that contain relevant information for interpreting the imaging data (e.g., *sub-01\_task-rest\_bold.json*). 

<br>
There are several things to notice from the BIDS format:

* The hierarchy of every experiment is the same.
* The naming conventions are the same.
* The naming conventions are intuitive.
* The data formats are meant to maximize accessibilty.
* All information needed for analyzing the data set is available and easy to find.

