
# Data curation

Man Ho Wong | m.wong@pitt.edu | Feb 27th, 2022

This notebook search for the datasets needed for this project in the following database:
- [CHILDES](https://childes.talkbank.org/)  
  *Reference:* MacWhinney, B. (2000). The CHILDES Project: Tools for analyzing talk. Third Edition. Mahwah, NJ: Lawrence Erlbaum Associates.

I may not need datasets from [Wordbank](http://wordbank.stanford.edu/) as I found that CHILDES probably has all the data I need.

I will also explore the datasets on the way to get a sense of the contents and the structures of the datasets (such as participant information, annotations, data format, etc.), as well as some basic statistics about the datasets. After that, I will identify the information I need in the datasets and compile the data for data processing later.

---


# 1 Searching for suitable corpora in CHILDES

CHILDES is a multilingual database containing corpora with transcriptions, audio recordings and/or video recordings of child speech and child-directed speech (CDS) at different developmental stages. Each corpus has a separate directory for each participant, and each directory contains the recording transcripts stored in CHAT formats. ([Example](https://childes.talkbank.org/access/Eng-NA/Brown.html))

For this project, I will need to collect the transcipts for both the child speech and the associated CDS. Additionally, I will need the participant information (i.e. child age, sex and socioeconomic status (SES), mother's education) and some basic annotations of the words (i.e. morphemes and lexical categories). Participant information can be found in the header of each CHAT file as the metadata of the file. Annotation information can be found as dependent tiers embedded in the transcription.

I will first search for the datasets meeting the needs of the project. Here are the basic search criteria:
- Language: North American English (Note: I may need other languages later for comparison)
- Participants: contains only child and mother (i.e. other people such as investigators were not involved)
- Child information: contains child age, sex and socioeconomic status (SES)
- Mother information: contains socioeconomic status (SES), education
- Annotation: contains morpheme and/or grammatical tiers

Let's take a quick look at a sample CHAT file first to see how the data is organized.


## 1.1 Reading CHAT file

The `PyLangAcq` package allows users to read CHAT files directly from a zip file. You can download and install it with the following code:  
`$ pip install --upgrade pylangacq`

For documentation, you can visit their [website](https://pylangacq.org/).

I will use the Brown Corpus of CHILDES as an example below. The corpus has been downloaded from [here](https://childes.talkbank.org/data/Eng-NA/Brown.zip) and stored under `data_samples/childes/Brown.zip`. There are three folders in this corpus, each folder contains a dataset (a collection of CHAT files) for each child:

```
Brown.zip/  
    |--Adam/  
    |--Eve/  
    |--Sarah/
```

I will use the `read_chat()` function of `PyLangAcq` to read all the CHAT files in the dataset `Adam`:

In [1]:
import pylangacq

# Read CHAT files in the dataset 'Adam' in 'Brown.zip':
path = 'data_samples/childes/Brown.zip'
adam = pylangacq.read_chat(path, 'Adam')

print(type(adam))
print('Number of CHAT files:', adam.n_files())

<class 'pylangacq.chat.Reader'>
Number of CHAT files: 55


In [2]:
# Ages when recordings were made
print('Ages (year, month, day):', adam.ages())  # output: a list of tuples

Ages (year, month, day): [(2, 3, 4), (2, 3, 18), (2, 4, 3), (2, 4, 15), (2, 4, 30), (2, 5, 12), (2, 6, 3), (2, 6, 17), (2, 7, 1), (2, 7, 14), (2, 8, 1), (2, 8, 16), (2, 9, 4), (2, 9, 18), (2, 10, 2), (2, 10, 16), (2, 10, 30), (2, 11, 13), (2, 11, 28), (3, 0, 11), (3, 0, 25), (3, 1, 9), (3, 1, 26), (3, 2, 9), (3, 2, 21), (3, 3, 4), (3, 3, 18), (3, 4, 1), (3, 4, 18), (3, 5, 1), (3, 5, 15), (3, 5, 29), (3, 6, 9), (3, 7, 7), (3, 8, 1), (3, 8, 14), (3, 8, 26), (3, 9, 16), (3, 10, 15), (3, 11, 1), (3, 11, 14), (4, 0, 14), (4, 1, 15), (4, 2, 17), (4, 3, 9), (4, 4, 1), (4, 4, 13), (4, 5, 11), (4, 6, 24), (4, 7, 1), (4, 7, 29), (4, 9, 2), (4, 10, 2), (4, 10, 23), (5, 2, 12)]


As shown above, `read_chat()` read the CHAT files and creates a `Reader` object. This is a `dataclass` storing data and metadata across all the CHAT files in `Adam`. You can access the data stored in the `Reader` by calling the appropriate methods, such as `.n_files()` for number of CHAT files in the dataset. For example, `Adam` has 55 CHAT files. We can also get the ages when recordings were made by calling `.ages()`. Let's see what other information we can get from the `Reader` object in the next section.

## 1.2 Accessing metadata stored in a CHAT file

Metadata such as age range, date of recording, participants, etc. are stored in the header of each CHAT file. We can access these information by calling the `.header()` method. Here is the header for the first CHAT file in `adam`:

In [3]:
adam.headers()[0]

{'UTF8': '',
 'PID': '11312/c-00015632-1',
 'Languages': ['eng'],
 'Participants': {'CHI': {'name': 'Adam',
   'language': 'eng',
   'corpus': 'Brown',
   'age': '2;03.04',
   'sex': 'male',
   'group': 'TD',
   'ses': 'MC',
   'role': 'Target_Child',
   'education': '',
   'custom': ''},
  'MOT': {'name': 'Mother',
   'language': 'eng',
   'corpus': 'Brown',
   'age': '',
   'sex': 'female',
   'group': '',
   'ses': '',
   'role': 'Mother',
   'education': '',
   'custom': ''},
  'URS': {'name': 'Ursula_Bellugi',
   'language': 'eng',
   'corpus': 'Brown',
   'age': '',
   'sex': '',
   'group': '',
   'ses': '',
   'role': 'Investigator',
   'education': '',
   'custom': ''},
  'RIC': {'name': 'Richard_Cromer',
   'language': 'eng',
   'corpus': 'Brown',
   'age': '',
   'sex': '',
   'group': '',
   'ses': '',
   'role': 'Investigator',
   'education': '',
   'custom': ''},
  'COL': {'name': 'Colin_Fraser',
   'language': 'eng',
   'corpus': 'Brown',
   'age': '',
   'sex': '',
   

The output above, is a multilevel `dictionary`. TO retrieve a specific piece of information we need, we can use the `dictionary` keys as usual.  
Let's check if 'Adam' is a male as its biblical name suggests:

In [4]:
adam.headers()[0]['Participants']['CHI']['sex']

'male'


## 1.3 Accessing annotations

Next, I will check what kinds of annotation information are stored in each CHAT file. I will use the `.tokens()` method to access the tokens with annotation information. This method creates a `list` of `Token` objects:

In [5]:
tokens = adam.tokens()
tokens[:5]  # first five tokens

[Token(word='play', pos='n', mor='play', gra=Gra(dep=1, head=2, rel='MOD')),
 Token(word='checkers', pos='n', mor='checker-PL', gra=Gra(dep=2, head=0, rel='INCROOT')),
 Token(word='.', pos='.', mor='', gra=Gra(dep=3, head=2, rel='PUNCT')),
 Token(word='big', pos='adj', mor='big', gra=Gra(dep=1, head=2, rel='MOD')),
 Token(word='drum', pos='n', mor='drum', gra=Gra(dep=2, head=0, rel='INCROOT'))]

Each `Token` is a `dataclass` with attributes (e.g. `word`,`pos`, etc.) as shown in the above example.  
Annotations for each word are stored as the `Token`'s attributes (i.e. attributes other than `word`):

In [6]:
print("Second token in 'Adam':")
print('Word: {}\nMorpheme: {}\nPart of speech: {}'.format(
    tokens[1].word, tokens[1].mor, tokens[1].pos))

Second token in 'Adam':
Word: checkers
Morpheme: checker-PL
Part of speech: n



## 1.4 Searching for suitable corpora

Now that we know what kinds of information are stored in each dataset and how we can access them, we can start searching for the corpora for the project according to the criteria set previously.

There are dozens of English corpora in CHILDES. We don't need to download them all at once just to look for the corpora we need. `PyLangAcq` allows user to read a corpus directly with the corpus's URL. We can read the corpora one by one, and keep only the ones we need. To get the URLs for all the North American English corpora, one can use some web scraping tools to get all the links from the database's website and look for the corpus URLs from there. However, a much faster way is to take advantage of the TalkBank's [browsable database](https://sla.talkbank.org/TBB/childes):  
1. navigate to CHILDES's North American English datasets (Eng-NA)
2. copy the list of corpora directly to a spreadsheet program and save it as a `csv` file (example: `data_samples/childes/eng_NA_corpus_list.csv`)
3. construct the URL for each corpus simply from the name of the corpus we get from step 2 (see below)

CHILDES has a very well organized structure. Each corpus has the same URL format as follow:  
`https://childes.talkbank.org/data/LANGUAGE/NAME_OF_CORPUS.zip`  
For example, the URL for the Brown Corpus is: https://childes.talkbank.org/data/Eng-NA/Brown.zip

In [7]:
import pandas as pd

# Read the list of corpora into a Pandas Series:
corpus_list = pd.read_csv('data/childes/eng_NA_corpus_list.csv', 
                          header=None, index_col=False, squeeze = True)

print('There are {} NA English corpora in CHILDES.'.format(len(corpus_list)))
print('Here are the first 10 corpora:')
corpus_list.head(10)

There are 47 NA English corpora in CHILDES.
Here are the first 10 corpora:


0        Bates
1    Bernstein
2        Bliss
3        Bloom
4     Bohannon
5    Braunwald
6        Brent
7        Brown
8        Clark
9    Demetras1
Name: 0, dtype: object

There are 47 corpora in CHILDES that we can potentially use! Let's try to look for the corpora matching some of the criteria set previously, e.g. corpora containing information about child's/ mother's SES and educational background. As shown previously, we can use `.headers` to access these information in the CHAT files.

In [8]:
search_result = []  # To store a list of corpora matching the criteria

# Search each corpus in the list:
for corpus_name in corpus_list:
    
    # Construct the URL for the corpus from its name:
    corpus_url = 'https://childes.talkbank.org/data/Eng-NA/'+corpus_name+'.zip'
    
    # read the corpus into a Reader object:
    print('Searching {}...'.format(corpus_name))
    corpus = pylangacq.read_chat(corpus_url)
    
    # Inspect each CHAT file in the corpus:
    found = False    
    for file in range(corpus.n_files()):
        
        # Search criteria:
        # dataset must contain only child ('CHI') and mother (MOT) as participants
        # (as we are looking at child-directed speech). 
        # dataset must include child's or mother's SES/education.
        if (
            (('CHI' in corpus.headers()[file]['Participants']) and 
            ('MOT' in corpus.headers()[file]['Participants']) and
            ('INV' not in corpus.headers()[file]['Participants'])) 
            
            and
            
            ((corpus.headers()[file]['Participants']['MOT']['ses'] != '') or
             (corpus.headers()[file]['Participants']['CHI']['ses'] != '') or
             (corpus.headers()[file]['Participants']['MOT']['education'] != ''))
        ):
            
            print('{} matches the criteria!\n'.format(corpus_name))
            search_result.append(corpus_name) 
            found = True
            
        # Break the for loop if criteria are matched and move on to another
        # corpus. (If one CHAT file contains the needed information, I will keep
        # the whole corpus for data processing later and remove individual CHAT
        # files missing the needed information.)
        if found == True:
            break

print('Search completed!')

Searching Bates...
Bates matches the criteria!

Searching Bernstein...
Bernstein matches the criteria!

Searching Bliss...
Searching Bloom...
Searching Bohannon...
Searching Braunwald...
Searching Brent...
Searching Brown...
Brown matches the criteria!

Searching Clark...
Clark matches the criteria!

Searching Demetras1...
Searching Demetras2...
Demetras2 matches the criteria!

Searching Evans...
Searching Feldman...
Searching Garvey...
Searching Gathercole...
Searching Gelman...
Searching Gleason...
Gleason matches the criteria!

Searching Gopnik...
Searching HSLLD...
HSLLD matches the criteria!

Searching Haggerty...
Searching Hall...
Hall matches the criteria!

Searching Hicks...
Searching Higginson...
Searching Kuczaj...
Searching MacWhinney...
Searching McCune...
Searching McMillan...
Searching Morisset...
Searching Nelson...
Searching NewEngland...
Searching NewmanRatner...
NewmanRatner matches the criteria!

Searching Peters...
Searching PetersonMcCabe...
Searching Post...
Post 

Let's see which corpora contain the data we need:

In [9]:
print('\n{} corpora matching the criteria:'.format(len(search_result)))
search_result


11 corpora matching the criteria:


['Bates',
 'Bernstein',
 'Brown',
 'Clark',
 'Demetras2',
 'Gleason',
 'HSLLD',
 'Hall',
 'NewmanRatner',
 'Post',
 'VanHouten']

Nice! We have narrowed down the number of corpora we need to process from 47 to 11.
Next, I will create a list of `Reader` objects each contains all the CHAT files in each corpus:

In [10]:
corpora_to_use = []

for corpus_name in search_result:
    
    # Construct the URL for the corpus from its name:
    corpus_url = 'https://childes.talkbank.org/data/Eng-NA/'+corpus_name+'.zip'
   
    # read the corpus into a Reader object:
    print('Reading {}...'.format(corpus_name))
    corpus = pylangacq.read_chat(corpus_url)
    
    # create a list of corpora (as Reader objects) to use in this project
    corpora_to_use.append(corpus)

corpora_to_use

Reading Bates...
Reading Bernstein...
Reading Brown...
Reading Clark...
Reading Demetras2...
Reading Gleason...
Reading HSLLD...
Reading Hall...
Reading NewmanRatner...
Reading Post...
Reading VanHouten...


[<pylangacq.chat.Reader at 0x1f4953e37f0>,
 <pylangacq.chat.Reader at 0x1f493fefbe0>,
 <pylangacq.chat.Reader at 0x1f4a20af610>,
 <pylangacq.chat.Reader at 0x1f4a20af460>,
 <pylangacq.chat.Reader at 0x1f493fef5b0>,
 <pylangacq.chat.Reader at 0x1f4abcd53d0>,
 <pylangacq.chat.Reader at 0x1f4dc783970>,
 <pylangacq.chat.Reader at 0x1f4dc7920a0>,
 <pylangacq.chat.Reader at 0x1f48ed80f40>,
 <pylangacq.chat.Reader at 0x1f48edcf2b0>,
 <pylangacq.chat.Reader at 0x1f48edb4160>]

---

# 2 Basic statistics

## 2.1 Token count

We can use the methods `.tokens()` and `.utterances` to access the token and utterance information stored in the `Reader` objects. Let's see how many tokens are three in these corpora:

In [11]:
all_tokens = []  # To store a list of lists of tokens for all corpora
token_sum = 0    # Total token counter

for idx, corpus in enumerate(corpora_to_use):
    corpus_name = corpora_to_use[idx].headers()[0]['Participants']['CHI']['corpus']
    
    # Get token info and store them in 'all_tokens':
    tokens = corpus.tokens()  # list of Token objects
    all_tokens.append(tokens)
    
    # Print result
    print('Token count in {}: {}'.format(corpus_name, len(tokens)))
    token_sum = token_sum + len(tokens)

# Print result
print('\nTotal token count: {}'.format(token_sum))

Token count in Bates: 56304
Token count in Bernstein: 83040
Token count in Brown: 880322
Token count in Clark: 258699
Token count in Demetras2: 99363
Token count in Gleason: 317306
Token count in HSLLD: 1650889
Token count in Hall: 1340000
Token count in NewmanRatner: 1049697
Token count in Post: 185246
Token count in VanHouten: 63884

Total token count: 5984750



## 2.2 Utterance count

Similarly, we can access the utterance infomation stored in each `Reader` object. For example:

In [12]:
corpora_to_use[0].utterances()[1]  # first utterance

0,1
*CHI:,.
%gpx:,looks at chicken
%act:,holds nesting cups
%pho:,wi


The example above shows the second utterance and their annotation information in the first corpus ('Bates'), including the words, the speaker and more. To look at child-directed speech (CDS) specifically, we can set the `.utterances()`'s `participants` option to `MOT`:

In [13]:
corpora_to_use[0].utterances(participants='MOT')[0] # first utterance by mother

0,1,2,3,4
*MOT:,what's,CLITIC,that,?
%mor:,pro:int|what,cop|be&3S,pro:dem|that,?
%gra:,1|2|SUBJ,2|0|ROOT,3|2|PRED,4|2|PUNCT
%act:,holds object out to Amy,holds object out to Amy,holds object out to Amy,holds object out to Amy


Next, I will look at how many utterances are there in the data. How many utterances are there in the child speech, and how many in the mother's CDS?

In [14]:
all_utt_chi = []  # To store a list of lists of child utterances for all corpora
all_utt_mot = []  # To store a list of lists of mother utterances for all corpora
utt_chi_sum = 0   # Total child utterance counter
utt_mot_sum = 0   # Total mother utterance counter
utt_sum = 0       # Total utterance counter

for idx, corpus in enumerate(corpora_to_use):
    corpus_name = corpora_to_use[idx].headers()[0]['Participants']['CHI']['corpus']
    
    # Get utterances and store them in 'all_utt_chi' or 'all_utt_mot':
    utt_chi = corpus.utterances(participants='CHI')  # list of Utterance object
    utt_mot = corpus.utterances(participants='MOT')  # list of Utterance object
    all_utt_chi.append(utt_chi)
    all_utt_mot.append(utt_mot)
    
    # Print results
    print('Child utterance count in {}: {}'.format(corpus_name, len(utt_chi)))
    print('Mother utterance count in {}: {}'.format(corpus_name, len(utt_mot)))
    utt_chi_sum = utt_chi_sum + len(utt_chi)
    utt_mot_sum = utt_mot_sum + len(utt_mot)

utt_sum = utt_sum + utt_chi_sum + utt_mot_sum
utt_chi_pc = round (utt_chi_sum/utt_sum*100, 2)  # Child utterance percentage
utt_mot_pc = round (utt_mot_sum/utt_sum*100, 2)  # Mother utterance percentage

# Print results
print('\nTotal child utterance count: {}'.format(utt_chi_sum))
print('\nTotal mother utterance count: {}'.format(utt_mot_sum))
print('\nTotal utterance count: {}'.format(utt_sum))
print('\nUtterance count percentage: {}% by child; {}% by mother'
      .format(utt_chi_pc, utt_mot_pc))

Child utterance count in Bates: 5572
Mother utterance count in Bates: 8579
Child utterance count in Bernstein: 167
Mother utterance count in Bernstein: 11749
Child utterance count in Brown: 96952
Mother utterance count in Brown: 60252
Child utterance count in Clark: 18169
Mother utterance count in Clark: 1944
Child utterance count in Demetras2: 9411
Mother utterance count in Demetras2: 6227
Child utterance count in Gleason: 20137
Mother utterance count in Gleason: 19545
Child utterance count in HSLLD: 112615
Mother utterance count in HSLLD: 148301
Child utterance count in Hall: 75655
Mother utterance count in Hall: 37028
Child utterance count in NewmanRatner: 28934
Mother utterance count in NewmanRatner: 160889
Child utterance count in Post: 8380
Mother utterance count in Post: 20189
Child utterance count in VanHouten: 5132
Mother utterance count in VanHouten: 4770

Total child utterance count: 381124

Total mother utterance count: 479473

Total utterance count: 860597

Utterance count

---

# 3 Data objects for further analysis

The above code have created several data objects ready for further analysis, which are the `Reader`, `Token` and the `Utterance` objects. I will pickle these objects so that I don't need create these objects again every time.

In [15]:
import pickle

data = [corpora_to_use, all_tokens, all_utt_chi, all_utt_mot]

f = open('data/childes/selected_corpora.pkl', 'wb')  
pickle.dump(data, f, -1)
f.close()