
# PyLangAcq Notes

The `PyLangAcq` package allows users to read CHAT data from CHILDES directly.  
Reference: https://pylangacq.org/quickstart.html

Download and install by:  
`$ pip install --upgrade pylangacq`


In [1]:
import pylangacq

---

## 1 Reading CHAT data

- directory path: can be local or remote (e.g. URL)
- Windows users: may need to put code under `if __name__ == "__main__":` to avoid error
- `.read_chat(PATH, FOLDER)`
    - this creates a `Reader` object (for associated methods, see [this](https://pylangacq.org/api.html#pylangacq.Reader))

- The following example reads data from `Eve` in Brown Corpus: https://sla.talkbank.org/TBB/childes/Eng-NA/Brown/Eve

In [5]:
url = "https://childes.talkbank.org/data/Eng-NA/Brown.zip"
eve = pylangacq.read_chat(url, "Eve")
print('Number of files:', eve.n_files())    # number of CHAT file in 'eve'

Number of files: 20


---

## 2 Accessing metadata stored in the Header of a CHAT file

e.g. age, date, participants, etc. See [this](https://pylangacq.org/headers.html#headers).

In [6]:
print('Ages:', eve.ages())    # ages when recordings were made
                              # format: year, month, day

Ages: [(1, 6, 0), (1, 6, 0), (1, 7, 0), (1, 7, 0), (1, 8, 0), (1, 9, 0), (1, 9, 0), (1, 9, 0), (1, 10, 0), (1, 10, 0), (1, 11, 0), (1, 11, 0), (2, 0, 0), (2, 0, 0), (2, 1, 0), (2, 1, 0), (2, 2, 0), (2, 2, 0), (2, 3, 0), (2, 3, 0)]


---

## 3 Accessing Transcriptions and Annotations

#### Accessing transcriptions with `.words()` method

`.words()` options:
`words(participants=None, exclude=None, by_utterances=False, by_files=False)`  
- `participants`: Participants to be included, e.g. `'CHI'` (child), `'MOT'` (mother), `{'MOT','INV'}` (mother and investigators).
- `exclude`: Participants to be excluded; cannot be used with `participants`.
- `by_utterances`: `False` by default; if `True`, output will be organized by utterances.
- `by_files`: `False` by default; if `True`, output will be organized by files.


In [19]:
# To access words across ALL CHAT files of 'eve':
words = eve.words()    # list of strings
print('Total word count:',len(words))

# To access words in individual CHAT files:
words_by_files = eve.words(by_files=True)  # list of lists of strings
for i, words_one_file in enumerate(words_by_files):
    print('Word count in file',i,':', len(words_one_file))

# Example: First 8 words in the first CHAT file:
words_by_files[0][0:8]

Total word count: 119799
Word count in file 0 : 5810
Word count in file 1 : 5258
Word count in file 2 : 2493
Word count in file 3 : 5742
Word count in file 4 : 5707
Word count in file 5 : 4338
Word count in file 6 : 5298
Word count in file 7 : 8901
Word count in file 8 : 4454
Word count in file 9 : 4535
Word count in file 10 : 4196
Word count in file 11 : 6193
Word count in file 12 : 4444
Word count in file 13 : 5202
Word count in file 14 : 8075
Word count in file 15 : 7361
Word count in file 16 : 10870
Word count in file 17 : 8407
Word count in file 18 : 6903
Word count in file 19 : 5612


['more', 'cookie', '.', 'you', '0v', 'more', 'cookies', '?']


#### Accessing annotations

`.words()` method returns words without annotations. To access words with annotation info, use `.tokens()`:  
`.tokens(participants=None, exclude=None, by_utterances=False, by_files=False)`

A `list` of `Token` objects will be created:

In [22]:
some_tokens = eve.tokens()[:5]    # first five tokens in 'eve'
some_tokens

[Token(word='more', pos='qn', mor='more', gra=Gra(dep=1, head=2, rel='QUANT')),
 Token(word='cookie', pos='n', mor='cookie', gra=Gra(dep=2, head=0, rel='INCROOT')),
 Token(word='.', pos='.', mor='', gra=Gra(dep=3, head=2, rel='PUNCT')),
 Token(word='you', pos='pro:per', mor='you', gra=Gra(dep=1, head=2, rel='SUBJ')),
 Token(word='0v', pos='0v', mor='v', gra=Gra(dep=2, head=0, rel='ROOT'))]

`Token` is a `dataclass` with attributes (e.g. `word`,`pos`, etc.) as shown in the above example.  

To access annotation info of each word (i.e. `Token` attributes other than `word`):

In [23]:
for token in some_tokens:
    print(token.word, token.pos)

more qn
cookie n
. .
you pro:per
0v 0v


To access unsegmented transcription and annotation info of utterance (e.g. time marks, or any unparsed tiers), use:  
`.utterances(participants=None, exclude=None, by_files=False)`

In [27]:
eve.utterances()[0]   # first utterance in 'eve'

0,1,2,3
*CHI:,more,cookie,.
%mor:,qn|more,n|cookie,.
%gra:,1|2|QUANT,2|0|INCROOT,3|2|PUNCT
%int:,"distinctive , loud","distinctive , loud","distinctive , loud"



## 4 Linguistic analysis

#### Word Frequencies and Ngrams

`.word_frequencies(keep_case=True, participants=None, exclude=None, by_files=False)`:
- `keep_case`: If `False`, lowercase will be used for subsequence processing. (The default is `True`. Since CHILDES does not capitalize first word, so keeping this `True` should be fine.)

`.word_ngrams(n, keep_case=True, participants=None, exclude=None, by_files=False)`:
- `n`: Specify the ngram type, e.g. `2` for bigrams.

In [38]:
word_freq = eve.word_frequencies()    # a collections.Counter object
print('Most common words:\n', word_freq.most_common(5))

bigrams = eve.word_ngrams(2)          # a collections.Counter object
print('Most common bigrams:\n', bigrams.most_common(5))

Most common words:
 [('.', 20071), ('?', 6358), ('you', 3695), ('the', 2524), ('it', 2363)]
Most common bigrams:
 [(('it', '.'), 703), (('that', '?'), 619), (('what', '?'), 560), (('yeah', '.'), 510), (('there', '.'), 471)]


#### Developmental Measures

- `.mlu([participant])`: mean lengths of utterance (MLU)
- `.mlum([participant])`: mean lengths of utterance by morphemes
- `.mluw([participant])`: mean lengths of utterance by words
- `.ttr([keep_case, participant])`: type-token ratios (TTR)
- `.ipsyn([participant])`: indexes of productive syntax (IPSyn)

In [41]:
eve.mlu(participant='CHI')

[2.309041835357625,
 2.488372093023256,
 2.8063241106719365,
 2.6153846153846154,
 2.8866855524079322,
 3.208955223880597,
 3.179732313575526,
 3.4171011470281543,
 3.840077071290944,
 3.822669104204753,
 3.883668903803132,
 4.177847113884555,
 4.2631578947368425,
 3.976890756302521,
 4.457182320441989,
 4.422776911076443,
 4.498338870431894,
 4.292035398230088,
 4.3813169984686064,
 3.320964749536178]