This repo contains scripts to process CHILDES data, with POS and syntactic dependency relations
-
Gloss inconsistency: many many cases of “wanna” changed into “want to”; I was wondering if to some extent childes-db was looking at stem (or minor bug in their code)
-
Childes-db ‘ s way of handling repetition
my [/] my paper
(they treated it as 3 tokens with the gloss being “my my paper”
-
There’s no specific documentation of how cases such as
<that> dat book
should be treated- Therefore this potentially leads to different estimation of number of tokens
- In general, glosses associated with these codes:
[/?], [/-], [/], [//], and [///]
are not clearly defined
-
POS does not match*
- Based on CHILDES annotation,
what dat
has a pos sequence ofpro:int, adv
- Based on childes-db, it’s
pro:int, det
; which is potentially more accurate, but still in consistent (and I’m hesitant to make my own subject decisions)
- Based on CHILDES annotation,
-
Whether unintelligible tokens should be counted as separate number of tokens
- Although this was handled in our data generation process where I added a separate column of number of unintelligible tokens
-
POS information in childes-db is not complete; this was handled in our data generation process as well
-
Overall, the scripts here count number of tokens based on stem
-
Annotation errors (some examples)
- gloss:
what
elseis on there ?
stem:what
elsebe&3S on there
pos:pro:int
postaux prep n
dependenc relations:1|0|INCROOT
2|1|PUNCT3|2|INCROOT 4|3|JCT 5|4|POBJ 6|3|PUNCT
- gloss:
does your writing look like his ?
stem:do&3S your write-PRESP look like his
pos:mod det:poss n:gerund v conj det:poss
dependency relations:1|4|AUX 2|3|DET 3|4|SUBJ 4|0|ROOT 5|4|JCT 6|7|DET
7|5|POBJ - gloss:
⌊come on , let's go→⌋
stem:come on cm let~us go
pos:v adv cm v~pro:obj v
dependency relations:1|0|ROOT 2|1|JCT 3|1|LP 4|1|ENUM 5|4|OBJ 6|1|PUNCT
(Sprott/01Goldilocks.cha) - more dependency relations than the number of tokens
- gloss:
-
Manually added target_child_sex in 020010.cha and 020324.cha in Bloom/Peter
-
Lacking morphosyntactic and dependency annotation in 2014-Indiff folder in Gelman corpus
-
⌈nay , nay , nay , nay , nay⌉ (Sprott/01Goldilocks.cha)---- What do?
-
**dependency relation between auxiliaries and head verb
-
Each sentences starts with two lines of additional information
- file name + speaker name
### 020304.cha Adam
(for comparison to original .cha file) - utterance order + gloss
### 1 play checkers .
- file name + speaker name
-
format follows the 10 column tab-delimited format of Universal Dependencies
- ID: Word index
- FORM: Word form or punctuation symbol. currently, it is lemma based on tokenization from CHILDES
- lemma: currently it's the same as FORM; except for cases with n't (not as form, n't as lemma for ease of identification(
- POS
- Left empty
- FEATS: empty
- HEAD: index of syntactic head
- DEPREL: dependency relation with the syntactic head
- Speaker name + Speaker code + Speaker role (e.g. Adam CHI Target_Child); separated by space
- Child gender + age + type (e.g. male 27 TD); separated by space
-
gloss not matched yet (see notes above); but does not necessarily affect current work on negation per se