# Working with phonetic dataframes

This notebook illustrates some commonly-used operations on dataframes that contain phonetic labels. See the ['Using audiolabel' notebook]('using_audiolabel.ipynb') for instructions on reading label files, such as Praat textgrids, into dataframes.

In [1]:
import os
import pandas as pd
from audiolabel import read_label

Load using `read_label`. The source textgrid has three tiers, as shown in the image. A few labels are not visible. ![Image of label tiers](this_is_a_label_file.png).

In [2]:
relpath = '../test/'
fname = 'this_is_a_label_file.TextGrid'
tgpath = os.path.join(relpath, fname)
[phdf, wddf, ctxdf] = read_label(
    tgpath,
    ftype='praat',
    tiers=['phone', 'word', 'context']
)
phdf

Unnamed: 0,t1,t2,phone,fname
0,0.012472,0.192063,DH,../test/this_is_a_label_file.TextGrid
1,0.192063,0.291837,IH1,../test/this_is_a_label_file.TextGrid
2,0.291837,0.441497,S,../test/this_is_a_label_file.TextGrid
3,0.441497,0.501361,IH1,../test/this_is_a_label_file.TextGrid
4,0.501361,0.611111,Z,../test/this_is_a_label_file.TextGrid
5,0.611111,0.660998,AH0,../test/this_is_a_label_file.TextGrid
6,0.660998,0.80068,L,../test/this_is_a_label_file.TextGrid
7,0.80068,0.970295,EY1,../test/this_is_a_label_file.TextGrid
8,0.970295,1.000227,B,../test/this_is_a_label_file.TextGrid
9,1.000227,1.030159,AH0,../test/this_is_a_label_file.TextGrid


## Saving to a `.csv` file

If you need to work with your labels in a spreadsheet or R you can save your dataframe to a `.csv` file with `to_csv`. Normally it is not useful to include the index as a column, which is why `index=False` is used.

In [None]:
ctxdf.to_csv('context.csv', index=False)

## Add a 'duration' column

Label durations are simply the difference between the `t1` and `t2` columns.

In [3]:
phdf['dur_ph'] = phdf.t2 - phdf.t1
phdf

Unnamed: 0,t1,t2,phone,fname,dur_ph
0,0.012472,0.192063,DH,../test/this_is_a_label_file.TextGrid,0.179592
1,0.192063,0.291837,IH1,../test/this_is_a_label_file.TextGrid,0.099773
2,0.291837,0.441497,S,../test/this_is_a_label_file.TextGrid,0.14966
3,0.441497,0.501361,IH1,../test/this_is_a_label_file.TextGrid,0.059864
4,0.501361,0.611111,Z,../test/this_is_a_label_file.TextGrid,0.109751
5,0.611111,0.660998,AH0,../test/this_is_a_label_file.TextGrid,0.049887
6,0.660998,0.80068,L,../test/this_is_a_label_file.TextGrid,0.139683
7,0.80068,0.970295,EY1,../test/this_is_a_label_file.TextGrid,0.169615
8,0.970295,1.000227,B,../test/this_is_a_label_file.TextGrid,0.029932
9,1.000227,1.030159,AH0,../test/this_is_a_label_file.TextGrid,0.029932


<a name="str_extract"></a>
## Extracting columns from a string column

String columns can be parsed into additional variables with the [`str.extract` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html). In our 'phone' column the labels identify individual phones with an optional stress value, which we extract into 'barephone' and 'stress' columns. For convenience we also use `fillna` to ensure cells with missing values contain an empty string instead of NaN.

The names of the capture groups in the [regular expression](https://docs.python.org/3/library/re.html) become the corresponding column names in the output.  See [pythex.org](https://pythex.org/) for a convenient way to practice and test your regular expressions.

In [4]:
phdf.phone.str.extract(r'(?P<barephone>[^\d]+)(?P<stress>\d*)').fillna('')

Unnamed: 0,barephone,stress
0,DH,
1,IH,1.0
2,S,
3,IH,1.0
4,Z,
5,AH,0.0
6,L,
7,EY,1.0
8,B,
9,AH,0.0


Use [`pd.concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) to add the extracted columns to `phdf`. The `axis='columns'` argument indicates that we are adding columns rather than rows, which is the default.

In [5]:
phdf = pd.concat(
    [
        phdf,
        phdf.phone.str.extract(r'(?P<barephone>[^\d]+)(?P<stress>\d*)').fillna('')
    ],
    axis='columns'
)
phdf

Unnamed: 0,t1,t2,phone,fname,dur_ph,barephone,stress
0,0.012472,0.192063,DH,../test/this_is_a_label_file.TextGrid,0.179592,DH,
1,0.192063,0.291837,IH1,../test/this_is_a_label_file.TextGrid,0.099773,IH,1.0
2,0.291837,0.441497,S,../test/this_is_a_label_file.TextGrid,0.14966,S,
3,0.441497,0.501361,IH1,../test/this_is_a_label_file.TextGrid,0.059864,IH,1.0
4,0.501361,0.611111,Z,../test/this_is_a_label_file.TextGrid,0.109751,Z,
5,0.611111,0.660998,AH0,../test/this_is_a_label_file.TextGrid,0.049887,AH,0.0
6,0.660998,0.80068,L,../test/this_is_a_label_file.TextGrid,0.139683,L,
7,0.80068,0.970295,EY1,../test/this_is_a_label_file.TextGrid,0.169615,EY,1.0
8,0.970295,1.000227,B,../test/this_is_a_label_file.TextGrid,0.029932,B,
9,1.000227,1.030159,AH0,../test/this_is_a_label_file.TextGrid,0.029932,AH,0.0


## Including preceding/following labels

The [`shift` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html) can be used to shift label values by one or more rows. Use this to add surrounding context (e.g. previous/next phone) to the labels.

The following cell shifts the phones down one row. The shift inserts NaN into the first row, which we fill with an empty string.

In [6]:
phdf.barephone.shift(1).fillna('')

0       
1     DH
2     IH
3      S
4     IH
5      Z
6     AH
7      L
8     EY
9      B
10    AH
11     L
12     F
13    AY
14     L
Name: barephone, dtype: object

Negative shifts move the values up. Now the last row is filled with an empty string.

In [7]:
phdf.barephone.shift(-1).fillna('')

0     IH
1      S
2     IH
3      Z
4     AH
5      L
6     EY
7      B
8     AH
9      L
10     F
11    AY
12     L
13    sp
14      
Name: barephone, dtype: object

Assign the `shift` values to new columns that record the phone context.

In [8]:
phdf['prev_ph'] = phdf.barephone.shift(1).fillna('')
phdf['next_ph'] = phdf.barephone.shift(-1).fillna('')
phdf

Unnamed: 0,t1,t2,phone,fname,dur_ph,barephone,stress,prev_ph,next_ph
0,0.012472,0.192063,DH,../test/this_is_a_label_file.TextGrid,0.179592,DH,,,IH
1,0.192063,0.291837,IH1,../test/this_is_a_label_file.TextGrid,0.099773,IH,1.0,DH,S
2,0.291837,0.441497,S,../test/this_is_a_label_file.TextGrid,0.14966,S,,IH,IH
3,0.441497,0.501361,IH1,../test/this_is_a_label_file.TextGrid,0.059864,IH,1.0,S,Z
4,0.501361,0.611111,Z,../test/this_is_a_label_file.TextGrid,0.109751,Z,,IH,AH
5,0.611111,0.660998,AH0,../test/this_is_a_label_file.TextGrid,0.049887,AH,0.0,Z,L
6,0.660998,0.80068,L,../test/this_is_a_label_file.TextGrid,0.139683,L,,AH,EY
7,0.80068,0.970295,EY1,../test/this_is_a_label_file.TextGrid,0.169615,EY,1.0,L,B
8,0.970295,1.000227,B,../test/this_is_a_label_file.TextGrid,0.029932,B,,EY,AH
9,1.000227,1.030159,AH0,../test/this_is_a_label_file.TextGrid,0.029932,AH,0.0,B,L


## Including other metadata

Looking ahead, and assuming you will combine multiple label files from multiple subjects, you can also include subject-related metadata to your dataframes before merging tiers and files. As an example, we'll add a `speaker` identifier to `ctxdf`.

In [9]:
ctxdf['speaker'] = 'SID1'
ctxdf

Unnamed: 0,t1,t2,context,fname,speaker
0,0.012472,0.611111,happy,../test/this_is_a_label_file.TextGrid,SID1
1,0.611111,1.139909,sad,../test/this_is_a_label_file.TextGrid,SID1
2,1.139909,1.648753,happy,../test/this_is_a_label_file.TextGrid,SID1


Here we manually assigned an identifier. In a research project you can keep track of your subjects in a way that allows you to assign speaker metadata automatically. A couple of possibilities:

1. Include subject info in your file naming scheme so that it can be parsed automatically from the `fname` column using `str.extract`, similar to the way we [added `barephone` and `stress` columns](#str_extract).
1. Keep dataframe-compatible text file, spreadsheet, or database that maps subject metadata to filepaths. Use one of the Pandas loading functions, such as `pd.read_csv` to load the metadata and merge with one of your dataframes.

<a name="merging-tiers"></a>
## Merging tiers

It can be useful to merge tiers based on the starting time of the labels (`t1`). For instance, you can add the 'word' metadata to the 'phone' with the [`pd.merge_asof` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html).

It is often the case that phonetic tiers have different size labels, and `merge_asof` works best if the left dataframe is the one where the labels are shorter than the right dataframe. In this case multiple 'phone' labels make up the 'word' labels, so `phdf` is used as the left dataframe, which means it is the first argument to `merge_asof`.

**The examples shown here assume the tiers are strictly hierarchical, meaning each 'phone' belongs to one 'word' only, and each 'word' to one 'context' only.** If you need to merge non-hierarchical tiers check the `merge_asof` documentation to determine how to handle your data.

Use tests like the following to ensure a strict hierarchy exists. The `assert` statements check that the boundaries in a containing tier match one of the boundaries in the contained tier. If any of the following tests fail then your tiers are not strictly hierarchical.

In [10]:
# words contain phones
assert(wddf.t1.isin(phdf.t1).all())
assert(wddf.t2.isin(phdf.t2).all())

# contexts contain words
assert(ctxdf.t1.isin(wddf.t1).all())
assert(ctxdf.t2.isin(wddf.t2).all())

Since the tests succeed we can proceed to merging the 'phone' and 'word' tiers.

In [11]:
phwddf = pd.merge_asof(
    phdf.rename({'t1': 't1_ph', 't2': 't2_ph'}, axis='columns'),
    wddf.drop('fname', axis='columns') \
        .rename({'t1': 't1_wd', 't2': 't2_wd'}, axis='columns'),
    left_on='t1_ph',
    right_on='t1_wd'
)
phwddf

Unnamed: 0,t1_ph,t2_ph,phone,fname,dur_ph,barephone,stress,prev_ph,next_ph,t1_wd,t2_wd,word
0,0.012472,0.192063,DH,../test/this_is_a_label_file.TextGrid,0.179592,DH,,,IH,0.012472,0.441497,THIS
1,0.192063,0.291837,IH1,../test/this_is_a_label_file.TextGrid,0.099773,IH,1.0,DH,S,0.012472,0.441497,THIS
2,0.291837,0.441497,S,../test/this_is_a_label_file.TextGrid,0.14966,S,,IH,IH,0.012472,0.441497,THIS
3,0.441497,0.501361,IH1,../test/this_is_a_label_file.TextGrid,0.059864,IH,1.0,S,Z,0.441497,0.611111,IS
4,0.501361,0.611111,Z,../test/this_is_a_label_file.TextGrid,0.109751,Z,,IH,AH,0.441497,0.611111,IS
5,0.611111,0.660998,AH0,../test/this_is_a_label_file.TextGrid,0.049887,AH,0.0,Z,L,0.611111,0.660998,A
6,0.660998,0.80068,L,../test/this_is_a_label_file.TextGrid,0.139683,L,,AH,EY,0.660998,1.139909,LABEL
7,0.80068,0.970295,EY1,../test/this_is_a_label_file.TextGrid,0.169615,EY,1.0,L,B,0.660998,1.139909,LABEL
8,0.970295,1.000227,B,../test/this_is_a_label_file.TextGrid,0.029932,B,,EY,AH,0.660998,1.139909,LABEL
9,1.000227,1.030159,AH0,../test/this_is_a_label_file.TextGrid,0.029932,AH,0.0,B,L,0.660998,1.139909,LABEL


#### Adding more variables

Now that the 'phone' and 'word' tiers are merged you might want to add another variable to indicate whether a phone is word-initial or word final. You can do that by comparing their times.

In [12]:
phwddf['is_wdinit_ph'] = phwddf.t1_ph == phwddf.t1_wd
phwddf['is_wdfin_ph'] = phwddf.t2_ph == phwddf.t2_wd
phwddf

Unnamed: 0,t1_ph,t2_ph,phone,fname,dur_ph,barephone,stress,prev_ph,next_ph,t1_wd,t2_wd,word,is_wdinit_ph,is_wdfin_ph
0,0.012472,0.192063,DH,../test/this_is_a_label_file.TextGrid,0.179592,DH,,,IH,0.012472,0.441497,THIS,True,False
1,0.192063,0.291837,IH1,../test/this_is_a_label_file.TextGrid,0.099773,IH,1.0,DH,S,0.012472,0.441497,THIS,False,False
2,0.291837,0.441497,S,../test/this_is_a_label_file.TextGrid,0.14966,S,,IH,IH,0.012472,0.441497,THIS,False,True
3,0.441497,0.501361,IH1,../test/this_is_a_label_file.TextGrid,0.059864,IH,1.0,S,Z,0.441497,0.611111,IS,True,False
4,0.501361,0.611111,Z,../test/this_is_a_label_file.TextGrid,0.109751,Z,,IH,AH,0.441497,0.611111,IS,False,True
5,0.611111,0.660998,AH0,../test/this_is_a_label_file.TextGrid,0.049887,AH,0.0,Z,L,0.611111,0.660998,A,True,True
6,0.660998,0.80068,L,../test/this_is_a_label_file.TextGrid,0.139683,L,,AH,EY,0.660998,1.139909,LABEL,True,False
7,0.80068,0.970295,EY1,../test/this_is_a_label_file.TextGrid,0.169615,EY,1.0,L,B,0.660998,1.139909,LABEL,False,False
8,0.970295,1.000227,B,../test/this_is_a_label_file.TextGrid,0.029932,B,,EY,AH,0.660998,1.139909,LABEL,False,False
9,1.000227,1.030159,AH0,../test/this_is_a_label_file.TextGrid,0.029932,AH,0.0,B,L,0.660998,1.139909,LABEL,False,False


Last we merge the labels from the 'context' tier. Since the 't2' column name from `ctxdf` does not match any column names in `phwddf` it won't get an automatic suffix, so we rename the 't2' column to add the '\_ctx' suffix before merging.

In [13]:
pwcdf = pd.merge_asof(
    phwddf,
    ctxdf.drop('fname', axis='columns').rename({'t1': 't1_ctx', 't2': 't2_ctx'}, axis='columns'),
    left_on='t1_ph',
    right_on='t1_ctx'
)
pwcdf

Unnamed: 0,t1_ph,t2_ph,phone,fname,dur_ph,barephone,stress,prev_ph,next_ph,t1_wd,t2_wd,word,is_wdinit_ph,is_wdfin_ph,t1_ctx,t2_ctx,context,speaker
0,0.012472,0.192063,DH,../test/this_is_a_label_file.TextGrid,0.179592,DH,,,IH,0.012472,0.441497,THIS,True,False,0.012472,0.611111,happy,SID1
1,0.192063,0.291837,IH1,../test/this_is_a_label_file.TextGrid,0.099773,IH,1.0,DH,S,0.012472,0.441497,THIS,False,False,0.012472,0.611111,happy,SID1
2,0.291837,0.441497,S,../test/this_is_a_label_file.TextGrid,0.14966,S,,IH,IH,0.012472,0.441497,THIS,False,True,0.012472,0.611111,happy,SID1
3,0.441497,0.501361,IH1,../test/this_is_a_label_file.TextGrid,0.059864,IH,1.0,S,Z,0.441497,0.611111,IS,True,False,0.012472,0.611111,happy,SID1
4,0.501361,0.611111,Z,../test/this_is_a_label_file.TextGrid,0.109751,Z,,IH,AH,0.441497,0.611111,IS,False,True,0.012472,0.611111,happy,SID1
5,0.611111,0.660998,AH0,../test/this_is_a_label_file.TextGrid,0.049887,AH,0.0,Z,L,0.611111,0.660998,A,True,True,0.611111,1.139909,sad,SID1
6,0.660998,0.80068,L,../test/this_is_a_label_file.TextGrid,0.139683,L,,AH,EY,0.660998,1.139909,LABEL,True,False,0.611111,1.139909,sad,SID1
7,0.80068,0.970295,EY1,../test/this_is_a_label_file.TextGrid,0.169615,EY,1.0,L,B,0.660998,1.139909,LABEL,False,False,0.611111,1.139909,sad,SID1
8,0.970295,1.000227,B,../test/this_is_a_label_file.TextGrid,0.029932,B,,EY,AH,0.660998,1.139909,LABEL,False,False,0.611111,1.139909,sad,SID1
9,1.000227,1.030159,AH0,../test/this_is_a_label_file.TextGrid,0.029932,AH,0.0,B,L,0.660998,1.139909,LABEL,False,False,0.611111,1.139909,sad,SID1


## Combining multiple label files

Concatenating dataframes from multiple files is another type of combining you might like to do. Normally you will do this after you have merged multiple tiers from the same label file.

We'll start by observing what happens when you use `pd.concat` to add the `ctxdf` dataframe to itself. By default the rows of the input dataframes are stacked. Note that the index has repeated values.

In [14]:
pd.concat([ctxdf, ctxdf])

Unnamed: 0,t1,t2,context,fname,speaker
0,0.012472,0.611111,happy,../test/this_is_a_label_file.TextGrid,SID1
1,0.611111,1.139909,sad,../test/this_is_a_label_file.TextGrid,SID1
2,1.139909,1.648753,happy,../test/this_is_a_label_file.TextGrid,SID1
0,0.012472,0.611111,happy,../test/this_is_a_label_file.TextGrid,SID1
1,0.611111,1.139909,sad,../test/this_is_a_label_file.TextGrid,SID1
2,1.139909,1.648753,happy,../test/this_is_a_label_file.TextGrid,SID1


To clean up the index we add `ignore_index=True`. Now the index values are consecutive integers.

In [15]:
pd.concat([ctxdf, ctxdf], ignore_index=True)

Unnamed: 0,t1,t2,context,fname,speaker
0,0.012472,0.611111,happy,../test/this_is_a_label_file.TextGrid,SID1
1,0.611111,1.139909,sad,../test/this_is_a_label_file.TextGrid,SID1
2,1.139909,1.648753,happy,../test/this_is_a_label_file.TextGrid,SID1
3,0.012472,0.611111,happy,../test/this_is_a_label_file.TextGrid,SID1
4,0.611111,1.139909,sad,../test/this_is_a_label_file.TextGrid,SID1
5,1.139909,1.648753,happy,../test/this_is_a_label_file.TextGrid,SID1


### Combine multiple label files by iteration

`pd.concat` can combine dataframes from a list of arbitrary length. An efficient way to construct a single dataframe from a large set of input label files is to iterate over the input filenames and create a list of dataframes for each textgrid, then stack them with `pd.concat`. (This is faster than incrementally adding to the master dataframe by using `pd.concat` every time a new textgrid is loaded.)

We start with a set of label file names in the form of a dataframe, as defined in the next cell. An easy way to construct a similar dataframe from a directory tree is provided by the [`dir2df` function](https://github.com/rsprouse/phonlab/blob/master/doc/Retrieving%20filenames%20in%20a%20directory%20tree%20with%20%60dir2df()%60.ipynb).

In [16]:
tgdf = pd.DataFrame({
    'relpath': '../test',
    'fname': ['this_is_a_label_file.TextGrid', 'this_is_a_label_file_scaled.TextGrid'],
    'subject': ['1', '2']
})
tgdf

Unnamed: 0,relpath,fname,subject
0,../test,this_is_a_label_file.TextGrid,1
1,../test,this_is_a_label_file_scaled.TextGrid,2


Next we define a function that reads a textgrid from a row of `tgdf` and returns a dataframe that was merged using the techniques described in the ['Merging tiers' section](#merging-tiers) above.

If you wish you can add additional metadata, such as duration, as you load the textgrid tiers in the `tg2df` function. Adding phone durations is shown.

In [17]:
def tg2df(row):
    '''Load 'phone', 'word', and 'context' tiers from a textgrid and merge them.
    
    Parameters
    ----------
    
    row: named tuple
    A namedtuple as provided by `itertuples` that can be used to load a Praat
    textgrid from a path identified by row.relpath and row.fname. The textgrid is
    expected to have 'phone', 'word', and 'context' tiers.

    Returns
    -------
    
    mergedf: the merged dataframe.
    '''
    [phdf, wddf, ctxdf] = read_label(
        os.path.join(row.relpath, row.fname),
        ftype='praat',
        tiers=['phone', 'word', 'context']
    )
    # Throw an error if tiers are not strictly hierarchical.
    # words contain phones
    assert(wddf.t1.isin(phdf.t1).all())
    assert(wddf.t2.isin(phdf.t2).all())

    # contexts contain words
    assert(ctxdf.t1.isin(wddf.t1).all())
    assert(ctxdf.t2.isin(wddf.t2).all())
    
    # Add phone duration and speaker
    phdf['dur_ph'] = phdf.t2 - phdf.t1
    phdf['speaker'] = row.subject

    # Merge phone and word tiers.
    phwddf = pd.merge_asof(
        phdf.rename({'t1': 't1_ph', 't2': 't2_ph'}, axis='columns'),
        wddf.drop('fname', axis='columns') \
            .rename({'t1': 't1_wd', 't2': 't2_wd'}, axis='columns'),
        left_on='t1_ph',
        right_on='t1_wd'
    )

    # Add word-init and -final columns
    phwddf['is_wdinit_ph'] = phwddf.t1_ph == phwddf.t1_wd
    phwddf['is_wdfin_ph'] = phwddf.t2_ph == phwddf.t2_wd

    # Merge context tier and return the result.
    return pd.merge_asof(
        phwddf,
        ctxdf.drop('fname', axis='columns').rename({'t1': 't1_ctx', 't2': 't2_ctx'}, axis='columns'),
        left_on='t1_ph',
        right_on='t1_ctx'
    )

The [`itertuples` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.itertuples.html) iterates over the rows of a dataframe. We can use it to apply the `tg2df` function to each textgrid in `tgdf` and compile the results into a list of dataframes. 

In [18]:
dflist = [tg2df(row) for row in tgdf.itertuples()]

`pd.concat` stacks the dataframes from each of the textgrids. The `ignore_index=True` argument ensures that the index of the combined dataframe has no repeated values. Otherwise the index would have repetitions starting with 0 for each input textgrid.

In [19]:
alldf = pd.concat(dflist, ignore_index=True)
alldf

Unnamed: 0,t1_ph,t2_ph,phone,fname,dur_ph,speaker,t1_wd,t2_wd,word,is_wdinit_ph,is_wdfin_ph,t1_ctx,t2_ctx,context
0,0.012472,0.192063,DH,../test/this_is_a_label_file.TextGrid,0.179592,1,0.012472,0.441497,THIS,True,False,0.012472,0.611111,happy
1,0.192063,0.291837,IH1,../test/this_is_a_label_file.TextGrid,0.099773,1,0.012472,0.441497,THIS,False,False,0.012472,0.611111,happy
2,0.291837,0.441497,S,../test/this_is_a_label_file.TextGrid,0.14966,1,0.012472,0.441497,THIS,False,True,0.012472,0.611111,happy
3,0.441497,0.501361,IH1,../test/this_is_a_label_file.TextGrid,0.059864,1,0.441497,0.611111,IS,True,False,0.012472,0.611111,happy
4,0.501361,0.611111,Z,../test/this_is_a_label_file.TextGrid,0.109751,1,0.441497,0.611111,IS,False,True,0.012472,0.611111,happy
5,0.611111,0.660998,AH0,../test/this_is_a_label_file.TextGrid,0.049887,1,0.611111,0.660998,A,True,True,0.611111,1.139909,sad
6,0.660998,0.80068,L,../test/this_is_a_label_file.TextGrid,0.139683,1,0.660998,1.139909,LABEL,True,False,0.611111,1.139909,sad
7,0.80068,0.970295,EY1,../test/this_is_a_label_file.TextGrid,0.169615,1,0.660998,1.139909,LABEL,False,False,0.611111,1.139909,sad
8,0.970295,1.000227,B,../test/this_is_a_label_file.TextGrid,0.029932,1,0.660998,1.139909,LABEL,False,False,0.611111,1.139909,sad
9,1.000227,1.030159,AH0,../test/this_is_a_label_file.TextGrid,0.029932,1,0.660998,1.139909,LABEL,False,False,0.611111,1.139909,sad


## Getting summary statistics

Pandas has features for calculating [summary statistics](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html). Several are illustrated below.

### Aggregating statistics

#### Mean durations

In [27]:
alldf.mean()

t1_ph           1.203016e+00
t2_ph           1.366644e+00
dur_ph          1.636281e-01
speaker         3.703704e+27
t1_wd           9.885034e-01
t2_wd           1.544240e+00
is_wdinit_ph    4.000000e-01
is_wdfin_ph     4.000000e-01
t1_ctx          8.288662e-01
t2_ctx          1.649002e+00
dtype: float64

#### Median durations

In [28]:
alldf.median()

t1_ph           1.016440
t2_ph           1.181066
dur_ph          0.139683
speaker         1.500000
t1_wd           0.882993
t2_wd           1.222222
is_wdinit_ph    0.000000
is_wdfin_ph     0.000000
t1_ctx          0.611111
t2_ctx          1.222222
dtype: float64

#### `describe`

In [29]:
alldf.describe()

Unnamed: 0,t1_ph,t2_ph,dur_ph,t1_wd,t2_wd,t1_ctx,t2_ctx
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,1.203016,1.366644,0.163628,0.988503,1.54424,0.828866,1.649002
std,0.840377,0.86019,0.110719,0.791842,0.893928,0.756236,0.853182
min,0.012472,0.192063,0.019955,0.012472,0.441497,0.012472,0.611111
25%,0.590533,0.695918,0.069841,0.4839,0.882993,0.024943,1.139909
50%,1.01644,1.181066,0.139683,0.882993,1.222222,0.611111,1.222222
75%,1.621939,1.86763,0.226984,1.321995,2.279819,1.222222,2.279819
max,3.257596,3.297506,0.458957,3.257596,3.297506,2.279819,3.297506


### Aggregating statistics by group

#### Mean durations by context

In [30]:
alldf.groupby('context').mean()

Unnamed: 0_level_0,t1_ph,t2_ph,dur_ph,t1_wd,t2_wd,is_wdinit_ph,is_wdfin_ph,t1_ctx,t2_ctx
context,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
happy,1.159448,1.344029,0.18458,0.994822,1.513643,0.444444,0.444444,0.770333,1.608428
sad,1.268367,1.400567,0.1322,0.979025,1.590136,0.333333,0.333333,0.916667,1.709864


#### Mean durations by context and barephone

In [32]:
alldf.groupby(['context', 'phone']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,t1_ph,t2_ph,dur_ph,t1_wd,t2_wd,is_wdinit_ph,is_wdfin_ph,t1_ctx,t2_ctx
context,phone,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
happy,AY1,1.889456,2.233673,0.344218,1.709864,2.443197,0.0,0.0,1.709864,2.473129
happy,DH,0.018707,0.288095,0.269388,0.018707,0.662245,1.0,0.0,0.018707,0.916667
happy,F,1.709864,1.889456,0.179592,1.709864,2.443197,1.0,0.0,1.709864,2.473129
happy,IH1,0.47517,0.594898,0.119728,0.340476,0.789456,0.5,0.0,0.018707,0.916667
happy,L,2.233673,2.443197,0.209524,1.709864,2.443197,0.0,1.0,1.709864,2.473129
happy,S,0.437755,0.662245,0.22449,0.018707,0.662245,0.0,1.0,0.018707,0.916667
happy,Z,0.752041,0.916667,0.164626,0.662245,0.916667,0.0,1.0,0.018707,0.916667
happy,sp,2.443197,2.473129,0.029932,2.443197,2.473129,1.0,1.0,1.709864,2.473129
sad,AH0,1.208503,1.268367,0.059864,0.954082,1.35068,0.5,0.5,0.916667,1.709864
sad,B,1.455442,1.50034,0.044898,0.991497,1.709864,0.0,0.0,0.916667,1.709864


#### `describe` by categories

In [33]:
alldf.groupby(['context', 'phone']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,t1_ph,t1_ph,t1_ph,t1_ph,t1_ph,t1_ph,t1_ph,t1_ph,t2_ph,t2_ph,...,t1_ctx,t1_ctx,t2_ctx,t2_ctx,t2_ctx,t2_ctx,t2_ctx,t2_ctx,t2_ctx,t2_ctx
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
context,phone,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
happy,AY1,2.0,1.889456,0.890698,1.259637,1.574546,1.889456,2.204365,2.519274,2.0,2.233673,...,1.994841,2.279819,2.0,2.473129,1.165844,1.648753,2.060941,2.473129,2.885317,3.297506
happy,DH,2.0,0.018707,0.008819,0.012472,0.01559,0.018707,0.021825,0.024943,2.0,0.288095,...,0.021825,0.024943,2.0,0.916667,0.432121,0.611111,0.763889,0.916667,1.069444,1.222222
happy,F,2.0,1.709864,0.806038,1.139909,1.424887,1.709864,1.994841,2.279819,2.0,1.889456,...,1.994841,2.279819,2.0,2.473129,1.165844,1.648753,2.060941,2.473129,2.885317,3.297506
happy,IH1,4.0,0.47517,0.292057,0.192063,0.336111,0.412812,0.551871,0.882993,4.0,0.594898,...,0.024943,0.024943,4.0,0.916667,0.352825,0.611111,0.611111,0.916667,1.222222,1.222222
happy,L,2.0,2.233673,1.052964,1.489116,1.861395,2.233673,2.605952,2.978231,2.0,2.443197,...,1.994841,2.279819,2.0,2.473129,1.165844,1.648753,2.060941,2.473129,2.885317,3.297506
happy,S,2.0,0.437755,0.20636,0.291837,0.364796,0.437755,0.510714,0.583673,2.0,0.662245,...,0.021825,0.024943,2.0,0.916667,0.432121,0.611111,0.763889,0.916667,1.069444,1.222222
happy,Z,2.0,0.752041,0.354515,0.501361,0.626701,0.752041,0.877381,1.002721,2.0,0.916667,...,0.021825,0.024943,2.0,0.916667,0.432121,0.611111,0.763889,0.916667,1.069444,1.222222
happy,sp,2.0,2.443197,1.151734,1.628798,2.035998,2.443197,2.850397,3.257596,2.0,2.473129,...,1.994841,2.279819,2.0,2.473129,1.165844,1.648753,2.060941,2.473129,2.885317,3.297506
sad,AH0,4.0,1.208503,0.585272,0.611111,0.902948,1.111224,1.41678,2.000454,4.0,1.268367,...,1.222222,1.222222,4.0,1.709864,0.658127,1.139909,1.139909,1.709864,2.279819,2.279819
sad,B,2.0,1.455442,0.686102,0.970295,1.212868,1.455442,1.698016,1.94059,2.0,1.50034,...,1.069444,1.222222,2.0,1.709864,0.806038,1.139909,1.424887,1.709864,1.994841,2.279819


### Count records by category

In [34]:
alldf['phone'].value_counts()

L      6
IH1    4
AH0    4
S      2
DH     2
EY1    2
F      2
AY1    2
Z      2
B      2
sp     2
Name: phone, dtype: int64