## Samplers [[documentation]](https://childproject.readthedocs.io/en/latest/samplers.html)

Samplers are tools for drawing and extracting portions of the recordings, 
which can then be annotated by humans.

Samplers can draw audio segments randomly, periodically, or in a targeted fashion,
by looking for relevant portions of audio (e.g. conversational blocks, or segments that are known to contain speech).

ChildProject implements the following samplers: `PeriodicSampler`, `VocalizationSampler`, `ConversationSampler`, `HighVolubilitySampler`, `EnergyDetectionSampler`. However, it is also possible to implement custom samplers.

Below, we propose to draw portions of audio will a lot of turn-taking between the key child and the mother, using the `ConversationSampler`. 

### Loading the project

In [2]:

from ChildProject.projects import ChildProject
project = ChildProject("/mnt/data/vandam-data")
project.read()
project.recordings

Unnamed: 0,experiment,child_id,date_iso,start_time,recording_device_type,recording_filename,duration
2,vandam-daylong,1,2010-07-24,06:58,lena,BN32_010007.mp3,50464512


### Initalizing and running the sampler

If you are not a python aficionado, this can be also be done from the commandline:
```bash
child-project sampler /mnt/data/vandam-data segments.csv conversations --annotation-set vtc --interval 1000
```

In [8]:
from ChildProject.pipelines.samplers import ConversationSampler

# initialize sampler
sampler = ConversationSampler(
    project=project,
    annotation_set="vtc", # use VTC annotations to search for conversational blocks
    by="recording_filename", # sample each recording individually
    count=5, # select the five longest conversations per recording
    interval=1000, # less than 1000 ms between turns
    speakers=["CHI", "FEM"]
)

# retrieve samples
samples = sampler.sample()
samples

Unnamed: 0,recording_filename,segment_onset,segment_offset,turns
0,BN32_010007.mp3,38313992,38349761,20
1,BN32_010007.mp3,47086707,47111291,14
2,BN32_010007.mp3,45935525,45953497,11
3,BN32_010007.mp3,24294392,24310992,11
4,BN32_010007.mp3,38418011,38432698,11


The output of a sampler is a list of segments characterized by their corresponding recording and the onset/offset timestamps (in milliseconds). In order to have them annotated, one might want to extract the corresponding audio clips. 

### Extracting audio clips


In [34]:
# export sampled clips into separate audio files
sampler.export_audio("clips")

In [35]:
from glob import glob
from IPython.display import Audio
from pydub import AudioSegment

# list extracted audio clips
clips = glob("clips/*.*")

# listen to the first
AudioSegment.from_mp3(clips[0])

In [4]:
sampler = ConversationSampler(
    project=project,
    annotation_set="cha/aligned",
    count=5,
    interval=1000,
    speakers=["CHI", "FEM"]
)

# retrieve samples
samples = sampler.sample()
samples

Unnamed: 0,recording_filename,segment_onset,segment_offset,turns
0,BN32_010007.mp3,39557599,39673273,33
1,BN32_010007.mp3,38156306,38210816,20
2,BN32_010007.mp3,16835166,17030556,18
3,BN32_010007.mp3,12931631,12992949,17
4,BN32_010007.mp3,1265878,1292774,16


In [6]:
sampler.export_audio("cha_clips")
clips = glob("cha_clips/*.*")
AudioSegment.from_mp3(clips[0])

### Exercice

Compare the rate of CHI vocalizations per hour that would be estimated with human annotation of clips extracted from the following samplers:

 - 20 clips of 30 seconds each
 - 10 clips of 1 minute each
 - 5 clips of 2 minutes each

Which sampling strategy provides the most accurate estimate?
