# Data Management


## General Introduction
Data management can seem relatively trivial for beginners, but starting with best practices will make your future significantly more accessible and help ensure your data is well-organized and secure.



### Where are we now?

Before we start a lesson it is usually helpful to reflect on what we already know about a certain topic or e.g. what a lesson may possibly try to teach us.

So please take a few minutes to reflect on the concept of digital literacy with the following questions.


**1.1 What is your understanding of data managment?**

**1.2 What could you possibly learn?**

**1.3 How do you usually store/manage data?**


```{note}
Feel free to do this in your head or in a separate document. Remember, to interactively engage with the material either open it in MyBinder (the small rocket button at the top of the website) or [download the course material](the small download button on top), go through the [setup process]() and open this file (i.e digital_literacy.ipynb in the introduction folder) using Jupyter Notebooks or VScode.
```

## Roadmap

- **Goals**
- Data management
    1. Data management plan
    2. Setup local folder structure 

## Goals

Specific goals of this first session will be to, e.g.

* general understanding
* get familiar with the process
* provide a checklist to follow
* understanding why project design is an essential step



## Data Management

-----------------------------------

### 1. Data Management Plan

An initial step when starting any research project should be to set up a data management plan (DMP).
This helps you to flesh out, describe, and document what data exactly you want to collect, what you'll be doing with it, and where and how it's stored and eventually shared. 

A DMP helps you stay organized and reduces the potential for surprises in the future (e.g., due to too limited data storage capacities or unexpected costs).
It is at times also required by, e.g., your University or agencies funding research endeavors. 

#### Need more Motivation?


For the public good
- DMP and data mangagement standards, make it inherently easier to reproduce code and analysis pipelines built by others, therefore lowering scientific waste and improving efficiency

For yourself

- You are likely the future user of the data and data analysis pipelines you’ve developed, so keeping your file structure standardized removes the need to remember where you've stored specific pieces of data, etc.

- DMP enable and simplify collaboration; - allow readers/collaborators to gain a quick understanding of what data you'll be collecting, where this data can be found, and what exactly you're planning on doing with it

- Reviewers and funding agencies like to see clear, reproducible results (can't really ignore their opininons, no?)

- Open-science-based funding opportunities and awards are available to incentivize good practices (for instance, the OHBM Replication Award, Mozilla Open Science Fellowship, and so on)

-------------------------------------


### What to consider in your data management plan

Most universities provide templates, tools, or guidance on how to create a DMP, so it is a good idea to check your university's online presence or get in contact with your local library.

For the Goethe University Frankfurt, researchers can use the following tool: [(German) Datenmanagementpläne mit dem Goethe-RDMO](https://rdmorganiser.github.io/)

There are also public tools to collect and share DMPs, such as [DMPonline](https://dmponline.dcc.ac.uk/) for the UK.

Here, you also find [publicly published plans](https://dmponline.dcc.ac.uk/public_plans) that you can use to check what your DMP could/should contain.

The [Turing Way Project](https://the-turing-way.netlify.app/reproducible-research/rdm.html#rr-rdm) lists the following considerations when creating a DMP. Many of the specific points of this checklist have already been discussed in the previous steps.

### Turing way DMP checklist
 

`1. Roles and Responsibilities of project team members`

    - discuss who is responsible for different tasks related to project/data management
    - e.g., who is responsible for maintaining the dataset, how takes care of the research ethics review

----------------------

`2. Type and size of data collected and documentation/metadata generated`

    - i.e., raw, preprocessed, or finalised data (lead to different considerations, as e.g., raw data can generally not be openly shared)
    - the expected size of the dataset
    - how well is the dataset described in additional (metadata) files, 
        - what abbreviations are used, how are, e.g. experimental conditions coded
        - where, when, and how was data collected
        - description of the sample population

----------------------

`3. Type of data storage used and backup procedures that are in place`

    - where is data stored
    - data protection procedures
    - how are backups handled, i.e. location and frequency of backups
    - will a version control system be used?
    - directory structure, file naming conventions

------------------------
`4. Preservation of the research outputs after the project.`

    - public repositories or local storage
    - e.g. OSF
    

--------------------------
`5. Reuse of your research outputs by others`

    Is- the code and coding environment shared? (e.g. GitHub)
    - conditions for reuse of collected dataset (licensing etc.)
  
-------------------------  
`6. Costs`

    - potential costs of equipment and personnel for data collection
    - costs for data storage

To create your DMP, you can either use the discussed tools or create a first draft by noting your thoughts/expectations regarding the above checklist in a document.

We will be touching on most of these points in the coming lesson, but the focus of this workshop will be on the metadata, data storage and the preservation of the research outputs. 

Let's start with a general plan on how to oragnize your local system/storage and build from there.




###  2. Your local system/ Setting up a folder structure 

It is recommended to adopt a standardized approach to structuring your data, as this not only helps you stay consistent but also allows you and possible collaborators to easily identify where specific data is located.

#### General File Naming Conventions

To make sure that it is easily understood what a file contains and to make files easier for computers to process, you should follow certain naming conventions:

    - be consistent
    - use the date in the format YYYYMMDD
    - use underscores `(_)` instead of spaces or
    - use camelCase (capitalized first letter of each word in a phrase) instead of spaces
    - avoid spaces, special characters `(+-"'|?!~@*%{[<>)`,  punctuation `(.,;:)`, slashes and backslashes `(/\)`
    - avoid "version" names, e.g., v1, vers1, final, final_really, etc. (instead, use a version control system like GitHub)

- [MIT cheatsheet for file naming conventions](https://www.dropbox.com/s/ttv3boomxlfgiz5/Handout_fileNaming.pdf?dl=0)


#### Establish a folder hierarchy

Before you begin working on your project, you should start setting up the local folder structure on your system. This helps you keep organized and saves you a lot of work in the long run. 

Your folder hierarchy, of course, depends on your project's specific need (e.g., folders for data, documents, images, etc.) and should be as clear and consistent as possible. The easiest way to achieve this is to copy and adapt an already existing folder hierarchy template for research projects.
    
    
One example (including a template) is the [Transparent project management template for the OSF platform](https://osf.io/4sdn3/) by [C.H.J. Hartgerink](https://osf.io/5fukm/)

   
The contained folder structure would then look like this:

```
project_name/
    └── archive
    │   └── 
        
    └── analyses
    │   └── 
    │   
    └── bibliography
    │   └── 
    │   
    └── data
    │   └── 
    │   
    └── figure
    │   └── 
    │   
    └── functions
    │   └── 
    │   
    └── materials
    │   └── 
    │   
    └── preregister
    │   └── 
    │
    └── submission
    │   └── 
    │   
    └── supplement
        └── 
```   


--------------------------------------------------


Another example would be the ["research project structure"](http://nikola.me/folder_structure.html) by [Nikola Vukovic](http://nikola.me/#home)

Where the folder hierarchy would look like this:
       
       
</br>

```
project_name/
    └── projectManagment/
    │   ├── proposals/
    │   │        └── 
    │   ├── finance/
    │   │       └── 
    │   └── reports/
    │           └── 
    │   
    └── EthicsGovernance
    │   ├── ethicsApproval/
    │   │       └── 
    │   └── consentForms/
    │           └── 
    │   
    └── ExperimentOne/
    │   ├── inputs/
    │   │       └── 
    │   ├── data/
    │   │       └── 
    │   ├── analysis/
    │   │       └── 
    │   └── outputs/
    │           └── 
    │   
    └── Dissemination/
        ├── presentations/
        │       └── 
        ├── publications/
        │       └── 
        └── publicity/
                └── 
```   


</br>

</br>

-------------------------------------

#### Incorporating experimental data/BIDS standard

Now, both of these examples provide an "experiment folder" but tend to utilize/establish their own standards. 

However, we aim to make our folder structure easily understandable, interoperable (e.g., between systems and programs), and reproducible. Therefore, it is best to adapt our "experiment folder" to industry standards.

For most experimental data in the emoirical sciences, the most promising approach will be the [BIDS](https://bids.neuroimaging.io/) (Brain Imaging Data Structure) specification. Originally conceptualized as a standardized format for the organization and description of fMRI data, the format has been extended to encompass other kinds of neuroimaging and behavioral data. Using the BIDS standard will facilitate the integration of your data into most neuroscience analysis pipelines. 



### Acknowledgments:

The go-to resource for creating and maintaining scientific projects was created by the [Turing Way Project](https://the-turing-way.netlify.app/welcome.html). We've adapted some of their material for the Data Management Plan section of the lesson above.

The Turing Way Community. (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research (1.0.2). Zenodo. https://doi.org/10.5281/zenodo.7625728

[Transparent project management template for the OSF plattform](https://osf.io/4sdn3/) by [WEC.H.J. Hartgerink](https://osf.io/5fukm/)

[BIDS Standard Guide](https://bids-standard.github.io/bids-starter-kit/index.html).


- BIDS SPEC
- FELIX 
- PEER




<!-- ## TLDR -->