# Data management

When working on project, it is important to keep your files organized. 

Most tools generate a great number of output files, 
of which only a few will be used for subsequent analyses. 
Sometimes it is necessary to try different tools or different parameters, 
compare the results and decide which ones perform best. 

Without proper data management, you will get lost in a labyrinth of scripts and files. 
This will loose you time and nerves, and more often than not, it will lead to mistakes in your analysis.

Every project is different. And even with a solid initial plan in place, steps in the analysis may need to change as you go.
You therefore need a framework that provides enough structure to keep your data management organized,
but that is flexible enough so you can adapt it on the fly.

There are of course multiple ways to to this. And ultimately, you need to find what works best for you.
To get you started, let's start with the basics.

# Setting up your Project

For this tutorial, you can  think of your "project" as the topic of this practical course. In more general terms, I use the word "project" to describe a set of "analyses", that together aim to answer a  scientific question. 

Each analysis uses data as input, proccesses it using some software, and produces results. A simple data structure for each analysis would be something like this:

## Cookiecutter

You have learned to set up directories with the `cd` command. However, setting up multiple directories for each analysis can quickly become tedious. 

Alternatively, you can use **cookiecutter** to set up your working directory, using a wealth of [available templates](https://www.cookiecutter.io/templates).

### Installation

You can install cookiecutter using `mamba install cookiecutter` in your *base* environment.

### Setting up your first cookiecutter project/analysis

Here is an example call for cookiecutter that will generate a basic directory structure that will serve you well for most projects:

```
cookiecutter gh:patrickmineault/true-neutral-cookiecutter
```

### Template using the Project/Analysis structure

If you want, you can also the [cookiecutter template of our working group](https://github.com/BIONF/bionf_cookiecutter). This one allows you to first set up a "project" which then contains multiple "analyses".

### Disclaimer

It is okay if prefer to set directories up by hand and to not use cookiecutter. Try to find what works best for you :)

# General Words of Advice

### Organize your project in modules (called "analyses" in this tutorial)

* You can use the "work packages" for your course in our DokuWiki as a starting point to what might be a module in a data analysis project
* Feel free to define more fine grained modules wherever needed

### Give your analyses names that refer to their goal 

* "genome_assembly" is a better directory name than "Flye" or "illumina", since you might end up using different strategies to accomplish the task of assemblying a genome

### Try to identify which results are "key findings"

* this will be everything (e.g. tables, figures) that might be relevant when the time comes to present your findings (i.e. writing a protocol/thesis/manuscript) 
* note down where these results are located and how they were generated

### Don't be afraid to reorganize your project structure when things become messy

* You will thank yourself later

### Document as you go

* Everything not saved will be lost

### Keep track of your workflow

* Which analysis generated data that was used as input for which downstream analysis?
* Document the flow of information, for example using specialised tools like [**Xmind**](https://xmind.com/) or use tools you most likely already know like **PowerPoint**.


# Advanced reading

* You can read more about setting up research projects and managing virtual environments in the [Good Research Code Handbook](https://goodresearch.dev/setup) by Patrick Mineault.