## Corpora structure [[documentation]](https://childproject.readthedocs.io/en/latest/format.html)

This small notebook aims to familiarize you with the structure of corpora of longform recordings adapted to the package.

Corpora are divided into three main folders:

 - Metadata (about each recording, participant, and set of annotations)
 - Recordings (the actual audio data)
 - Annotations (the contents of each set of annotations)

![](https://childproject.readthedocs.io/en/latest/_images/structure.png)

For this tutorial, we will use the open-access "VanDam" corpus.

**<- You can browse the structure of the corpus by opening it from the tree to the left**

(In ``data/vandam-data``)

- Metadata are stored in ``data/vandam-data/metadata`` as CSV dataframes.
- Recordings are stored in ``data/vandam-data/recordings``.
- Annotations are stored in ``data/vandam-data/annotations``. Each set of annotations contains the raw annotations (from the annotation software used by the annotator or the algorithm that generated them) and "converted", standardized annotations stored as CSV dataframes.

The CLI API of our package allows you to get a quick overview of the contents of a corpus:

## The python package

Our python package provides several functions for working with the metadata. Let's see what it can do.
First, we open the corpus:

In [1]:
from ChildProject.projects import ChildProject

project = ChildProject("data/vandam-data")
project.read()

We might want to check that there are no errors in the data. For that we can run ``project.validate()``, which outputs a list of errors and a list of warnings.

In [2]:
project.validate()

([], [])

Let's see the content of the metadata. ChildProject loads the metadata for recordings and children as pandas' dataframes: 

In [3]:
project.recordings

Unnamed: 0,experiment,child_id,date_iso,start_time,recording_device_type,recording_filename,duration
2,vandam-daylong,1,2010-07-24,06:58,lena,BN32_010007.mp3,50464512


In [4]:
project.children

Unnamed: 0,experiment,child_id,child_dob,dob_criterion,dob_accuracy
2,vandam-daylong,1,2009-07-24,extrapolated,month


Notice that the metadata contains recordings' date and the children' birth date. Now, based on this, we might want to calculate the age of the child for each recording (in this case, there is just one..):

In [5]:
project.recordings["age"] = project.compute_ages()
project.recordings

Unnamed: 0,experiment,child_id,date_iso,start_time,recording_device_type,recording_filename,duration,age
2,vandam-daylong,1,2010-07-24,06:58,lena,BN32_010007.mp3,50464512,11.991786


Now that you know how metadata is organized within corpora, let's see how to manipulate annotations.