
# Data curation

Man Ho Wong | m.wong@pitt.edu | Feb 27th, 2022

This notebook search for the datasets needed for this project in the following databases:
- [CHILDES](https://childes.talkbank.org/)
- [Wordbank](http://wordbank.stanford.edu/) (I may not need this dataset as CHILBES may probably have all the data I need.)

I will also explore the datasets on the way to get a sense of the contents and the structures of the datasets (such as participant information, annotations, data format, etc.), as well as some basic statistics about the datasets. After that, I will identify the information I need in the datasets and compile the data for data processing later.

---


# 1 CHILDES

CHILDES Intro, format, example etc.

For this project, I will need to collect the transcipts for both the child speech and the child-directed speech (CDS). Additionally, I will need the participant information (i.e. child age, sex and socioeconomic status (SES), mother's education) and some basic annotations of the words (i.e. morphemes and lexical categories). Participant information can be found in the header of each CHAT file as the metadata of the file. Annotation information can be found as dependent tiers embedded in the transcription.

I will first search for the datasets meeting the needs of the project. Here are the basic search criteria:
- Language: North American English (Note: I may need other languages later for comparison)
- Participants: contains only child and mother (i.e. other people such as investigators were not involved)
- Child information: contains child age, sex and socioeconomic status (SES)
- Mother information: contains socioeconomic status (SES), education
- Annotation: contains morpheme and/or grammatical tiers

Let's take a quick look at a sample CHAT file first to see how the data is organized.


## 1.1 Reading CHAT file

The `PyLangAcq` package allows users to read CHAT files directly from a zip file. You can download and install it with the following code:  
`$ pip install --upgrade pylangacq`

For documentation, you can visit their [website](https://pylangacq.org/).

I will use the Brown Corpus of CHILDES as an example below. The corpus has been downloaded from [here](https://childes.talkbank.org/data/Eng-NA/Brown.zip) and stored under `data_samples/childes/Brown.zip`. There are three folders in this corpus, each folder contains a dataset (a collection of CHAT files) for each child:

```
Brown.zip/  
    |--Adam/  
    |--Eve/  
    |--Sarah/
```

I will use the `read_chat()` function of `PyLangAcq` to read all the CHAT files in the dataset `Adam`:

In [4]:
import pylangacq

# Read CHAT files in the dataset 'Adam' in 'Brown.zip':
path = 'data_samples/childes/Brown.zip'
adam = pylangacq.read_chat(path, 'Adam')

print(type(adam))
print('Number of CHAT files:', adam.n_files())

<class 'pylangacq.chat.Reader'>
Number of CHAT files: 55


In [9]:
# Ages when recordings were made
print('Ages (year, month, day):', adam.ages())  # output: a list of tuples

Ages (year, month, day): [(2, 3, 4), (2, 3, 18), (2, 4, 3), (2, 4, 15), (2, 4, 30), (2, 5, 12), (2, 6, 3), (2, 6, 17), (2, 7, 1), (2, 7, 14), (2, 8, 1), (2, 8, 16), (2, 9, 4), (2, 9, 18), (2, 10, 2), (2, 10, 16), (2, 10, 30), (2, 11, 13), (2, 11, 28), (3, 0, 11), (3, 0, 25), (3, 1, 9), (3, 1, 26), (3, 2, 9), (3, 2, 21), (3, 3, 4), (3, 3, 18), (3, 4, 1), (3, 4, 18), (3, 5, 1), (3, 5, 15), (3, 5, 29), (3, 6, 9), (3, 7, 7), (3, 8, 1), (3, 8, 14), (3, 8, 26), (3, 9, 16), (3, 10, 15), (3, 11, 1), (3, 11, 14), (4, 0, 14), (4, 1, 15), (4, 2, 17), (4, 3, 9), (4, 4, 1), (4, 4, 13), (4, 5, 11), (4, 6, 24), (4, 7, 1), (4, 7, 29), (4, 9, 2), (4, 10, 2), (4, 10, 23), (5, 2, 12)]


As shown above, `read_chat()` read the CHAT files and creates a `Reader` object. This is a `dataclass` storing data and metadata across all the CHAT files in `Adam`. You can access the data stored in the `Reader` by calling the appropriate methods, such as `.n_files()` for number of CHAT files in the dataset. For example, `Adam` has 55 CHAT files. We can also get the ages when recordings were made by calling `.ages()`. Let's see what other information we can get from the `Reader` object in the next section.

## 1.2 Accessing metadata stored in a CHAT file

Metadata such as age range, date of recording, participants, etc. are stored in the header of each CHAT file. We can access these information by calling the `.header()` method. Here is the header for the first CHAT file in `adam`:

In [13]:
adam.headers()[0]

{'UTF8': '',
 'PID': '11312/c-00015632-1',
 'Languages': ['eng'],
 'Participants': {'CHI': {'name': 'Adam',
   'language': 'eng',
   'corpus': 'Brown',
   'age': '2;03.04',
   'sex': 'male',
   'group': 'TD',
   'ses': 'MC',
   'role': 'Target_Child',
   'education': '',
   'custom': ''},
  'MOT': {'name': 'Mother',
   'language': 'eng',
   'corpus': 'Brown',
   'age': '',
   'sex': 'female',
   'group': '',
   'ses': '',
   'role': 'Mother',
   'education': '',
   'custom': ''},
  'URS': {'name': 'Ursula_Bellugi',
   'language': 'eng',
   'corpus': 'Brown',
   'age': '',
   'sex': '',
   'group': '',
   'ses': '',
   'role': 'Investigator',
   'education': '',
   'custom': ''},
  'RIC': {'name': 'Richard_Cromer',
   'language': 'eng',
   'corpus': 'Brown',
   'age': '',
   'sex': '',
   'group': '',
   'ses': '',
   'role': 'Investigator',
   'education': '',
   'custom': ''},
  'COL': {'name': 'Colin_Fraser',
   'language': 'eng',
   'corpus': 'Brown',
   'age': '',
   'sex': '',
   

The output above, is a multilevel `dictionary`. TO retrieve a specific piece of information we need, we can use the `dictionary` keys as usual.  
Let's check if 'Adam' is a male as its biblical name suggests:

In [14]:
adam.headers()[0]['Participants']['CHI']['sex']

'male'


## 1.3 Accessing annotations

Next, I will check what kinds of annotation information are stored in each CHAT file. I will use the `.tokens()` method to access the tokens with annotation information. This method creates a `list` of `Token` objects:

In [15]:
tokens = adam.tokens()
tokens[:5]  # first five tokens

[Token(word='play', pos='n', mor='play', gra=Gra(dep=1, head=2, rel='MOD')),
 Token(word='checkers', pos='n', mor='checker-PL', gra=Gra(dep=2, head=0, rel='INCROOT')),
 Token(word='.', pos='.', mor='', gra=Gra(dep=3, head=2, rel='PUNCT')),
 Token(word='big', pos='adj', mor='big', gra=Gra(dep=1, head=2, rel='MOD')),
 Token(word='drum', pos='n', mor='drum', gra=Gra(dep=2, head=0, rel='INCROOT'))]

Each `Token` is a `dataclass` with attributes (e.g. `word`,`pos`, etc.) as shown in the above example.  
Annotations for each word are stored as the `Token`'s attributes (i.e. attributes other than `word`):

In [17]:
print("Second token in 'Adam':")
print('Word: {}\nMorpheme: {}\nPart of speech: {}'.format(
    tokens[1].word, tokens[1].mor, tokens[1].pos))

Second token in 'Adam':
Word: checkers
Morpheme: checker-PL
Part of speech: n



## 1.4 Searching for suitable datasets

Now that we know what kinds of information are stored in each dataset and how we can access them, we can start searching for the datasets for the project according to the criteria set previously.

There are dozens of English corpora in CHILDES. We don't need to download them all at once just to look for the corpora we need. `PyLangAcq` allows user to read a corpus directly with the corpus's URL. We can read the corpora one by one, and keep only the ones we need. To get the URLs for all the North American English corpora, one can use some web scraping tools to get all the links from the database's website and look for the corpus URLs from there. However, a much faster way is to take advantage of the TalkBank's [browsable database](https://sla.talkbank.org/TBB/childes):  
1. navigate to CHILDES's North American English datasets (Eng-NA)
2. copy the list of corpora directly to a spreadsheet program and save it as a `csv` file (example: `data_samples/childes/eng_NA_corpus_list.csv`)
3. construct the URL for each corpus simply from the name of the corpus we get from step 2 (see below)

CHILDES has a very well organized structure. Each corpus has the same URL format as follow:  
`https://childes.talkbank.org/data/LANGUAGE/NAME_OF_CORPUS.zip`  
For example, the URL for the Brown Corpus is: https://childes.talkbank.org/data/Eng-NA/Brown.zip

In [26]:
import pandas as pd

# Read the list of corpora into a Pandas Series:
corpus_list = pd.read_csv('data/childes/eng_NA_corpus_list.csv', 
                          header=None, index_col=False, squeeze = True)

print('There are {} NA English corpora in CHILDES.'.format(len(corpus_list)))
print('Here are the first 10 corpora:')
corpus_list.head(10)

There are 47 NA English corpora in CHILDES.
Here are the first 10 corpora:


0        Bates
1    Bernstein
2        Bliss
3        Bloom
4     Bohannon
5    Braunwald
6        Brent
7        Brown
8        Clark
9    Demetras1
Name: 0, dtype: object

Let's try to search the corpora matching some of the criteria set previously, e.g. corpora containing information about child's/ mother's SES and educational background. As shown previously, we can use `.headers` to access these information in the CHAT files.

In [31]:
search_result = []  # To store a list of corpora matching the criteria

# Search each corpus in the list:
for corpus_name in corpus_list:
    
    # Construct the URL for the corpus from its name:
    corpus_url = 'https://childes.talkbank.org/data/Eng-NA/'+corpus_name+'.zip'
    
    # read the corpus into a Reader object:
    print('Searching {}...'.format(corpus_name))
    corpus = pylangacq.read_chat(corpus_url)
    
    # Inspect each CHAT file in the corpus:
    found = False    
    for file in range(corpus.n_files()):
        
        # Search criteria:
        # dataset must contain only child ('CHI') and mother (MOT) as participants
        # (as we are looking at child-directed speech). 
        # dataset must include child's or mother's SES/education.
        if (
            (('CHI' in corpus.headers()[file]['Participants']) and 
            ('MOT' in corpus.headers()[file]['Participants']) and
            ('INV' not in corpus.headers()[file]['Participants'])) 
            
            and
            
            ((corpus.headers()[file]['Participants']['MOT']['ses'] != '') or
             (corpus.headers()[file]['Participants']['CHI']['ses'] != '') or
             (corpus.headers()[file]['Participants']['MOT']['education'] != ''))
        ):
            
            print('{} matches the criteria!\n'.format(corpus_name))
            search_result.append(corpus_name) 
            found = True
            
        # Break the for loop if criteria are matched and move on to another
        # corpus. (If one CHAT file contains the needed information, I will keep
        # the whole corpus for data processing later and remove individual CHAT
        # files missing the needed information.)
        if found == True:
            break       

print('\nCorpora matching the criteria:')
search_result

Searching Bates...
Bates matches the criteria!

Searching Bernstein...
Bernstein matches the criteria!

Searching Bliss...
Searching Bloom...
Searching Bohannon...
Searching Braunwald...
Searching Brent...
Searching Brown...
Brown matches the criteria!

Searching Clark...
Clark matches the criteria!

Searching Demetras1...
Searching Demetras2...
Demetras2 matches the criteria!

Searching Evans...
Searching Feldman...
Searching Garvey...
Searching Gathercole...
Searching Gelman...
Searching Gleason...
Gleason matches the criteria!

Searching Gopnik...
Searching HSLLD...
HSLLD matches the criteria!

Searching Haggerty...
Searching Hall...
Hall matches the criteria!

Searching Hicks...
Searching Higginson...
Searching Kuczaj...
Searching MacWhinney...
Searching McCune...
Searching McMillan...
Searching Morisset...
Searching Nelson...
Searching NewEngland...
Searching NewmanRatner...
NewmanRatner matches the criteria!

Searching Peters...
Searching PetersonMcCabe...
Searching Post...
Post 

['Bates',
 'Bernstein',
 'Brown',
 'Clark',
 'Demetras2',
 'Gleason',
 'HSLLD',
 'Hall',
 'NewmanRatner',
 'Post',
 'VanHouten']