
# PyLangAcq Notes

Man Ho Wong | m.wong@pitt.edu | Feb 27th, 2022

The `PyLangAcq` package allows users to read CHAT data from CHILDES directly.  
Reference: https://pylangacq.org/quickstart.html

Download and install by:  
`$ pip install --upgrade pylangacq`


In [1]:
import pylangacq

---

## 1 Reading CHAT data

- directory path: can be local or remote (e.g. URL)
- Windows users: may need to put code under `if __name__ == "__main__":` to avoid error
- `.read_chat(PATH, FOLDER)`
    - this creates a `Reader` object (for associated methods, see [this](https://pylangacq.org/api.html#pylangacq.Reader))

- The following example reads data from `Adam` in Brown Corpus: https://sla.talkbank.org/TBB/childes/Eng-NA/Brown/Adam  
Reference: Brown, R. (1973) A first language: the early stages. Cambridge: Harvard University Press. 

In [2]:
url = "https://childes.talkbank.org/data/Eng-NA/Brown.zip"
adam = pylangacq.read_chat(url, "Adam")
print('Number of files:', adam.n_files())    # number of CHAT file in 'adam'

Number of files: 55


---

## 2 Accessing metadata stored in the Header of a CHAT file

e.g. age, date, participants, etc. See [this](https://pylangacq.org/headers.html#headers).

In [3]:
print('Ages:', adam.ages())    # ages when recordings were made
                              # format: year, month, day

Ages: [(2, 3, 4), (2, 3, 18), (2, 4, 3), (2, 4, 15), (2, 4, 30), (2, 5, 12), (2, 6, 3), (2, 6, 17), (2, 7, 1), (2, 7, 14), (2, 8, 1), (2, 8, 16), (2, 9, 4), (2, 9, 18), (2, 10, 2), (2, 10, 16), (2, 10, 30), (2, 11, 13), (2, 11, 28), (3, 0, 11), (3, 0, 25), (3, 1, 9), (3, 1, 26), (3, 2, 9), (3, 2, 21), (3, 3, 4), (3, 3, 18), (3, 4, 1), (3, 4, 18), (3, 5, 1), (3, 5, 15), (3, 5, 29), (3, 6, 9), (3, 7, 7), (3, 8, 1), (3, 8, 14), (3, 8, 26), (3, 9, 16), (3, 10, 15), (3, 11, 1), (3, 11, 14), (4, 0, 14), (4, 1, 15), (4, 2, 17), (4, 3, 9), (4, 4, 1), (4, 4, 13), (4, 5, 11), (4, 6, 24), (4, 7, 1), (4, 7, 29), (4, 9, 2), (4, 10, 2), (4, 10, 23), (5, 2, 12)]


---

## 3 Accessing Transcriptions and Annotations

#### Accessing transcriptions with `.words()` method

`.words()` options:
`words(participants=None, exclude=None, by_utterances=False, by_files=False)`  
- `participants`: Participants to be included, e.g. `'CHI'` (child), `'MOT'` (mother), `{'MOT','INV'}` (mother and investigators).
- `exclude`: Participants to be excluded; cannot be used with `participants`.
- `by_utterances`: `False` by default; if `True`, output will be organized by utterances.
- `by_files`: `False` by default; if `True`, output will be organized by files.


In [4]:
# To access words across ALL CHAT files of 'adam':
words = adam.words()    # list of strings
print('Total word count:',len(words))

# To access words in individual CHAT files:
words_by_files = adam.words(by_files=True)  # list of lists of strings
for i, words_one_file in enumerate(words_by_files):
    print('Word count in file',i,':', len(words_one_file))

# Example: First 8 words in the first CHAT file:
words_by_files[0][0:8]

Total word count: 353686
Word count in file 0 : 6304
Word count in file 1 : 7567
Word count in file 2 : 5427
Word count in file 3 : 4429
Word count in file 4 : 5346
Word count in file 5 : 4558
Word count in file 6 : 5618
Word count in file 7 : 4807
Word count in file 8 : 5760
Word count in file 9 : 6118
Word count in file 10 : 5769
Word count in file 11 : 5124
Word count in file 12 : 3888
Word count in file 13 : 4303
Word count in file 14 : 4841
Word count in file 15 : 4285
Word count in file 16 : 6514
Word count in file 17 : 6605
Word count in file 18 : 8304
Word count in file 19 : 7854
Word count in file 20 : 6446
Word count in file 21 : 6507
Word count in file 22 : 7607
Word count in file 23 : 6341
Word count in file 24 : 7587
Word count in file 25 : 7370
Word count in file 26 : 7819
Word count in file 27 : 7284
Word count in file 28 : 7369
Word count in file 29 : 7693
Word count in file 30 : 6385
Word count in file 31 : 7194
Word count in file 32 : 5500
Word count in file 33 : 6014

['play', 'checkers', '.', 'big', 'drum', '.', 'big', 'drum']


#### Accessing annotations

`.words()` method returns words without annotations. To access words with annotation info, use `.tokens()`:  
`.tokens(participants=None, exclude=None, by_utterances=False, by_files=False)`

A `list` of `Token` objects will be created:

In [5]:
tokens = adam.tokens()
tokens[:5]  # first five tokens in 'adam'

[Token(word='play', pos='n', mor='play', gra=Gra(dep=1, head=2, rel='MOD')),
 Token(word='checkers', pos='n', mor='checker-PL', gra=Gra(dep=2, head=0, rel='INCROOT')),
 Token(word='.', pos='.', mor='', gra=Gra(dep=3, head=2, rel='PUNCT')),
 Token(word='big', pos='adj', mor='big', gra=Gra(dep=1, head=2, rel='MOD')),
 Token(word='drum', pos='n', mor='drum', gra=Gra(dep=2, head=0, rel='INCROOT'))]

`Token` is a `dataclass` with attributes (e.g. `word`,`pos`, etc.) as shown in the above example.  

To access annotation info of each word (i.e. `Token` attributes other than `word`):

In [6]:
for token in tokens[:5]:
    print(token.word, token.pos)

play n
checkers n
. .
big adj
drum n


To access unsegmented transcription and annotation info of utterance (e.g. time marks, or any unparsed tiers), use:  
`.utterances(participants=None, exclude=None, by_files=False)`

In [7]:
adam.utterances()[0]   # first utterance in 'adam'

0,1,2,3
*CHI:,play,checkers,.
%mor:,n|play,n|checker-PL,.
%gra:,1|2|MOD,2|0|INCROOT,3|2|PUNCT
%pho:,<1> pe,<1> pe,<1> pe



## 4 Linguistic analysis

#### Word Frequencies and Ngrams

`.word_frequencies(keep_case=True, participants=None, exclude=None, by_files=False)`:
- `keep_case`: If `False`, lowercase will be used for subsequence processing. (The default is `True`. Since CHILDES does not capitalize first word, so keeping this `True` should be fine.)

`.word_ngrams(n, keep_case=True, participants=None, exclude=None, by_files=False)`:
- `n`: Specify the ngram type, e.g. `2` for bigrams.

In [8]:
word_freq = adam.word_frequencies()    # a collections.Counter object
print('Most common words:\n', word_freq.most_common(5))

bigrams = adam.word_ngrams(2)          # a collections.Counter object
print('Most common bigrams:\n', bigrams.most_common(5))

Most common words:
 [('.', 49060), ('?', 22259), ('you', 11353), ('I', 9465), ('it', 7573)]
Most common bigrams:
 [(('it', '.'), 1908), (('do', 'you'), 1622), (('that', '?'), 1538), (('what', '?'), 1350), (('I', "don't"), 1334)]


#### Developmental Measures

- `.mlu([participant])`: mean lengths of utterance (MLU)
- `.mlum([participant])`: mean lengths of utterance by morphemes
- `.mluw([participant])`: mean lengths of utterance by words
- `.ttr([keep_case, participant])`: type-token ratios (TTR)
- `.ipsyn([participant])`: indexes of productive syntax (IPSyn)

In [9]:
adam.mlu(participant='CHI')

[3.004731861198738,
 3.0116822429906542,
 3.3062645011600926,
 2.656449553001277,
 3.1534883720930234,
 3.1988023952095808,
 3.4627720504009165,
 3.8165413533834585,
 3.4375,
 3.448559670781893,
 3.6364653243847873,
 3.3973509933774833,
 3.3095238095238093,
 3.294915254237288,
 3.587096774193548,
 3.4803030303030305,
 3.9605110336817653,
 3.421859039836568,
 4.182600382409178,
 4.301318267419962,
 5.064278187565859,
 4.820202020202021,
 4.525600835945664,
 4.158956109134045,
 4.449136276391554,
 4.435255712731229,
 4.96242774566474,
 4.756550218340611,
 5.106502242152466,
 4.7413249211356465,
 4.696286472148541,
 4.814593301435407,
 5.117554858934169,
 5.095052083333333,
 5.263636363636364,
 4.773690078037904,
 5.147688838782413,
 5.349804941482445,
 5.291743119266055,
 5.296572280178838,
 5.3102981029810294,
 5.640463917525773,
 5.798695246971109,
 5.728958630527817,
 6.333333333333333,
 6.167098445595855,
 5.884197828709288,
 6.019059720457434,
 6.125408942202835,
 5.737684729064039,