# CHILDES Narrative English Hicks Corpus

Caroline Gish | cng18@pitt.edu

nbviewer view: <https://nbviewer.org/github/Data-Science-for-Linguists-2022/UDS-child-speech/blob/main/notebooks/childes_exploration.ipynb>


*This notebook contains exploration of the dataset and explains how I got to the information I wanted.*

---

### Table of Contents

- [1. Overview of dataset](http://localhost:8888/notebooks/notebooks/childes_exploration.ipynb#1.-Overview-of-dataset) provides relevant sources and gives a run-down of what the dataset tiers are
- [2. Trying out `chamd`](http://localhost:8888/notebooks/notebooks/childes_exploration.ipynb#2.-Trying-out-chamd) briefly discusses my attempt to utilize the `chamd` library (unsuccessful)
- [3. Trying out `PyLangAcq`](http://localhost:8888/notebooks/notebooks/childes_exploration.ipynb#3.-Trying-out-PyLangAcq) details my intitial exploration of the Hicks dataset using the `pylangacq` library
    - [3-1. More exploration](http://localhost:8888/notebooks/notebooks/childes_exploration.ipynb#3-1.-More-exploration) contains more data exploration
- [4. Creating DataFrame](http://localhost:8888/notebooks/notebooks/childes_exploration.ipynb#4.-Creating-DataFrame) is where I create a DataFrame with the desired data
- [5. Linguistic phenomenon of interest](http://localhost:8888/notebooks/notebooks/childes_exploration.ipynb#5.-Linguistic-phenomenon-of-interest) explains the linguistic phenomenon I will be further exploring and analyzing
    - [5-1. "try to" constructions](http://localhost:8888/notebooks/notebooks/childes_exploration.ipynb#5-1.-%22try-to%22-constructions) explains how I got to the linguistic phenomenon of interest

## 1. Overview of dataset

**Source:** https://childes.talkbank.org/access/Eng-NA/Hicks.html 
(also has description of research goals and annotation schema)

Initial exploration will involve trying to read in the CHAT files. 

**Resources:**

- .cha file viewer: https://filext.com/file-extension/CHA

- paper on child genre development: https://www.jstor.org/stable/40171463?seq=1#metadata_info_tab_contents

- CLAN documentation: https://talkbank.org/manuals/CLAN.pdf

- CHAT documentation: https://talkbank.org/manuals/CHAT.pdf

**Relevant tiers:**

- %mor: Morphological Tier
    - This tier codes morphemic segments by type and part of speech.

- %gra: Grammatical Relations Tier            
    - This tier is used to code dependency structures with tagged grammatical relations (Sagae, Davis, Lavie, MacWhinney, & Wintner, 2007; Sagae, Lavie, & MacWhinney, 2005; Sagae, MacWhinney, & Lavie, 2004).

- %cod: Coding Tier                                       
    - This is the general purpose coding tier. It can be used for mixing codes into a single tier for economy or ease of entry.

In [1]:
import numpy as np
import pandas as pd
import pickle as pkl

%pprint

Pretty printing has been turned OFF


In [2]:
# every returned Out[] displayed  
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## 2. Trying out `chamd`

In [3]:
from chamd import ChatReader

In [4]:
reader = ChatReader()
chat = reader.read_file('../data/Hicks/1st/event/evt004.cha') # or read_string

My first go at exploring the CHAT files involved utilizing the `chamd` library. I've deleted almost all of the code I wrote using `chamd`, but I've still included this section so that I have record of using it (and also as a helpful tip for anyone else to NOT use it and instead use `PyLangAcq` as I do below).

## 3. Trying out PyLangAcq

PyLangAcq: Language Acquisition Research in Python: https://pylangacq.org/

reading CHAT data: https://pylangacq.org/read.html

In [5]:
import pylangacq

- "`read_chat()` automatically handles everything behind the scenes for you, from downloading the ZIP file, unzipping it, traversing through the CHAT files found, as well as parsing the files"

In [6]:
# initializing a reader from URL

url = 'https://childes.talkbank.org/data/Eng-NA/Hicks.zip'
hicks = pylangacq.read_chat(url)

In [7]:
hicks.info(verbose='True')  # set 'verbose' to True 
                            # to see all files

213 files
14273 utterances
104908 words
        Utterance Count    Word Count  File Path
----  -----------------  ------------  -------------------------------
#1                   66           522  Hicks/1st/event/evt004.cha
#2                   79           560  Hicks/1st/event/evt005.cha
#3                   57           423  Hicks/1st/event/evt009.cha
#4                   58           395  Hicks/1st/event/evt010.cha
#5                   57           469  Hicks/1st/event/evt012.cha
#6                   79           623  Hicks/1st/event/evt016.cha
#7                   41           309  Hicks/1st/event/evt018.cha
#8                   90           658  Hicks/1st/event/evt019.cha
#9                   73           531  Hicks/1st/event/evt021.cha
#10                  82           587  Hicks/1st/event/evt024.cha
#11                  75           553  Hicks/1st/event/evt026.cha
#12                  89           698  Hicks/1st/event/evt027.cha
#13                  77           513  Hicks/1st

This is looking much more promising! Acceses to different files and the ability to count different things is already built in.

Not sure how/why I didn't find this before `chamd`. PyLangAcq has a lot more documentation (that's also organized well) and it's taking care of everything I need.

In [8]:
hicks.head()
hicks.tail()

0,1,2,3,4,5,6,7,8
*RES:,so,this,is,again,Deborah,and,Ari,.
%mor:,adv|so,pro:dem|this,cop|be&3S,adv|again,n:prop|Deborah,coord|and,n:prop|Ari,.
%gra:,1|2|JCT,2|3|SUBJ,3|0|ROOT,4|3|JCT,5|3|PRED,6|5|CONJ,7|6|COORD,8|3|PUNCT

0,1,2,3,4,5,6,7,8
*RES:,and,we're,CLITIC,gonna,CLITIC,be,sportscasters,.
%mor:,coord|and,pro:sub|we,aux|be&PRES,part|go-PRESP,inf|to,cop|be,n|+n|sports+n|caster-PL,.
%gra:,1|4|LINK,2|4|SUBJ,3|4|AUX,4|0|ROOT,5|6|INF,6|4|COMP,7|6|PRED,8|4|PUNCT

0,1,2,3,4,5,6,7,8,9
*RES:,and,say,what's,CLITIC,happening,in,the,movie,.
%mor:,coord|and,v|say,pro:int|what,aux|be&3S,part|happen-PRESP,prep|in,det:art|the,n|movie,.
%gra:,1|2|LINK,2|0|ROOT,3|5|SUBJ,4|5|AUX,5|2|COMP,6|5|JCT,7|8|DET,8|6|POBJ,9|2|PUNCT

0,1,2,3,4,5,6,7
*RES:,and,I'm,CLITIC,gonna,CLITIC,start,.
%mor:,coord|and,pro:sub|I,aux|be&1S,part|go-PRESP,inf|to,v|start,.
%gra:,1|4|LINK,2|4|SUBJ,3|4|AUX,4|0|ROOT,5|6|INF,6|4|COMP,7|4|PUNCT

0,1,2,3,4,5,6,7,8,9
*RES:,the,little,boy,is,getting,some,money,out,.
%mor:,det:art|the,adj|little,n|boy,aux|be&3S,part|get-PRESP,qn|some,n|money,adv|out,.
%gra:,1|3|DET,2|3|MOD,3|5|SUBJ,4|5|AUX,5|0|ROOT,6|7|QUANT,7|5|OBJ,8|5|JCT,9|5|PUNCT


0,1,2,3,4,5,6,7,8,9,10
*CHI:,and,and,then,all,the,people,loosed,their,balloon,.
%mor:,coord|and,coord|and,adv:tem|then,qn|all,det:art|the,n|person&PL,v|loose-PAST,det:poss|their,n|balloon,.
%gra:,1|7|LINK,2|7|LINK,3|7|JCT,4|6|QUANT,5|6|DET,6|7|SUBJ,7|0|ROOT,8|9|DET,9|7|OBJ,10|7|PUNCT
%cod:,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq

0,1,2,3,4,5,6,7,8,9
*CHI:,and,lots_of,balloons,came,flying,to,the,boy,.
%mor:,coord|and,qn|lots_of,n|balloon-PL,v|come&PAST,part|fly-PRESP,prep|to,det:art|the,n|boy,.
%gra:,1|4|LINK,2|3|QUANT,3|4|SUBJ,4|0|ROOT,5|4|XJCT,6|5|JCT,7|8|DET,8|6|POBJ,9|4|PUNCT
%cod:,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas

0,1,2,3,4,5,6,7,8,9
*EXP:,wow,is,that,the,end,of,the,story,?
%mor:,co|wow,cop|be&3S,comp|that,det:art|the,n|end,prep|of,det:art|the,n|story,?
%gra:,1|2|COM,2|0|ROOT,3|2|PRED,4|5|DET,5|3|OBJ,6|5|NJCT,7|8|DET,8|6|POBJ,9|2|PUNCT

0,1,2
*XXX:,yeah,.
%mor:,co|yeah,.
%gra:,1|0|INCROOT,2|1|PUNCT

0,1,2
*EXP:,great,!
%mor:,adj|great,!
%gra:,1|0|INCROOT,2|1|PUNCT


This is in a great format that preserves all the separation between tokens across all tiers. When a child is speaking, the %cod tier gets tacked on as the fourth and final tier. 

Of note for now: Contractions and clitics are transcribed in a specific way with the word itself first and then CLITIC as the next "word". Can read more about this in the CHAT manual. 

In [9]:
# all file paths

hicks.file_paths()[:10]
hicks.file_paths()[-10:]

['Hicks/1st/event/evt004.cha', 'Hicks/1st/event/evt005.cha', 'Hicks/1st/event/evt009.cha', 'Hicks/1st/event/evt010.cha', 'Hicks/1st/event/evt012.cha', 'Hicks/1st/event/evt016.cha', 'Hicks/1st/event/evt018.cha', 'Hicks/1st/event/evt019.cha', 'Hicks/1st/event/evt021.cha', 'Hicks/1st/event/evt024.cha']

['Hicks/Kinder/story/story046.cha', 'Hicks/Kinder/story/story047.cha', 'Hicks/Kinder/story/story051.cha', 'Hicks/Kinder/story/story052.cha', 'Hicks/Kinder/story/story053.cha', 'Hicks/Kinder/story/story054.cha', 'Hicks/Kinder/story/story056.cha', 'Hicks/Kinder/story/story057.cha', 'Hicks/Kinder/story/story058.cha', 'Hicks/Kinder/story/story059.cha']

In [10]:
# setting variables for grade level

first = pylangacq.read_chat(url, match="1st")
second = pylangacq.read_chat(url, match="2nd")
fifth = pylangacq.read_chat(url, match="5th")
Del = pylangacq.read_chat(url, match="Del")  # only has event and story, though
kinder = pylangacq.read_chat(url, match="Kinder")

The "Del" directory of files only contains narratives of the event and story genre (no report genre). To keep things consistent across all the grade levels, I may end up not using the "Del" files. I'll keep them in for now, though.

In [11]:
for file in first, second, fifth, Del, kinder:
    print('=====\n')
    print('number of files:', file.n_files())
    print('file path example:', file.file_paths()[:2])
    print('number of files per genre:', int(file.n_files()/3))
    print('total number of utterances:', len(file.utterances()))
    print('total number of words:', len(file.words()), '\n')

=====

number of files: 60
file path example: ['Hicks/1st/event/evt004.cha', 'Hicks/1st/event/evt005.cha']
number of files per genre: 20
total number of utterances: 3981
total number of words: 28408 

=====

number of files: 54
file path example: ['Hicks/2nd/event/evt001.cha', 'Hicks/2nd/event/evt002.cha']
number of files per genre: 18
total number of utterances: 3843
total number of words: 26823 

=====

number of files: 15
file path example: ['Hicks/5th/event/evt01.cha', 'Hicks/5th/event/evt02.cha']
number of files per genre: 5
total number of utterances: 1233
total number of words: 8545 

=====

number of files: 24
file path example: ['Hicks/Del/event/evt01.cha', 'Hicks/Del/event/evt02.cha']
number of files per genre: 8
total number of utterances: 1408
total number of words: 10560 

=====

number of files: 60
file path example: ['Hicks/Kinder/event/evt013.cha', 'Hicks/Kinder/event/evt014.cha']
number of files per genre: 20
total number of utterances: 3808
total number of words: 2724

There are a lot less files for fifth-grade students. There are 3 different genres, so that means there are all only 5 files per genre for fifth-grade students.

In [12]:
# setting variables for narrative genre

event = pylangacq.read_chat(url, match="event")
report = pylangacq.read_chat(url, match="report")
story = pylangacq.read_chat(url, match="story")

In [13]:
for file in event, report, story:
    print('=====\n')
    print('number of files:', file.n_files())
    print('file path example:', file.file_paths()[:2])
    print('total number of utterances:', len(file.utterances()))
    print('total number of words:', len(file.words()), '\n')

=====

number of files: 75
file path example: ['Hicks/1st/event/evt004.cha', 'Hicks/1st/event/evt005.cha']
total number of utterances: 5623
total number of words: 38753 

=====

number of files: 63
file path example: ['Hicks/1st/report/rep004.cha', 'Hicks/1st/report/rep005.cha']
total number of utterances: 3331
total number of words: 24338 

=====

number of files: 75
file path example: ['Hicks/1st/story/story004.cha', 'Hicks/1st/story/story005.cha']
total number of utterances: 5319
total number of words: 38486 



This seems to match up - remember the "Del" directory did not have any report files (so 63 report files instead of 75), but it did have event and story files. 

So, overall, it looks like the sets (by genre) are comparable in size. 

### 3-1. More exploration

#### Accessing CHAT headers:

In [14]:
# ages()
# returns CHI ages in tuple of 3 integers
# (years, months, days)

# ages of first graders
first.ages()

# same ages in months
first.ages(months=True)

[(5, 9, 0), (6, 6, 0), (7, 0, 0), (6, 5, 0), (6, 8, 0), (6, 6, 0), (6, 1, 0), (6, 1, 0), (6, 7, 0), (6, 10, 0), (6, 7, 0), (6, 6, 0), (6, 7, 0), (6, 11, 0), (6, 9, 0), (6, 7, 0), (6, 5, 0), (6, 7, 0), (6, 6, 0), (6, 9, 0), (5, 9, 0), (6, 6, 0), (7, 0, 0), (6, 5, 0), (6, 8, 0), (6, 6, 0), (6, 1, 0), (6, 1, 0), (6, 7, 0), (6, 10, 0), (6, 7, 0), (6, 6, 0), (6, 7, 0), (6, 11, 0), (6, 9, 0), (6, 7, 0), (6, 5, 0), (6, 7, 0), (6, 6, 0), (6, 9, 0), (5, 9, 0), (6, 6, 0), (7, 0, 0), (6, 5, 0), (6, 8, 0), (6, 6, 0), (6, 1, 0), (6, 1, 0), (6, 7, 0), (6, 10, 0), (6, 7, 0), (6, 6, 0), (6, 7, 0), (6, 11, 0), (6, 9, 0), (6, 7, 0), (6, 5, 0), (6, 7, 0), (6, 6, 0), (6, 9, 0)]

[69.0, 78.0, 84.0, 77.0, 80.0, 78.0, 73.0, 73.0, 79.0, 82.0, 79.0, 78.0, 79.0, 83.0, 81.0, 79.0, 77.0, 79.0, 78.0, 81.0, 69.0, 78.0, 84.0, 77.0, 80.0, 78.0, 73.0, 73.0, 79.0, 82.0, 79.0, 78.0, 79.0, 83.0, 81.0, 79.0, 77.0, 79.0, 78.0, 81.0, 69.0, 78.0, 84.0, 77.0, 80.0, 78.0, 73.0, 73.0, 79.0, 82.0, 79.0, 78.0, 79.0, 83.0, 81.0, 79.0, 77.0, 79.0, 78.0, 81.0]

In [15]:
# dates_of_recording()
# returns dates of recording

# recording date by file
first.dates_of_recording(by_files=True)

# unique recording dates
first.dates_of_recording()

[{datetime.date(1987, 10, 8)}, {datetime.date(1987, 10, 14)}, {datetime.date(1987, 10, 15)}, {datetime.date(1987, 10, 16)}, {datetime.date(1987, 10, 16)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 22)}, {datetime.date(1987, 10, 23)}, {datetime.date(1987, 10, 29)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 30)}, {datetime.date(1987, 10, 30)}, {datetime.date(1987, 11, 5)}, {datetime.date(1987, 11, 5)}, {datetime.date(1987, 11, 5)}, {datetime.date(1987, 11, 6)}, {datetime.date(1987, 11, 6)}, {datetime.date(1987, 11, 6)}, {datetime.date(1987, 10, 8)}, {datetime.date(1987, 10, 14)}, {datetime.date(1987, 10, 15)}, {datetime.date(1987, 10, 16)}, {datetime.date(1987, 10, 16)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 22)}, set(), {datetime.date(1987, 10, 29)}, {datetime.date(1987, 10, 29)}, {datetime.date(1987, 10, 30)}, {datetim

{datetime.date(1987, 11, 6), datetime.date(1987, 10, 22), datetime.date(1987, 10, 30), datetime.date(1987, 10, 16), datetime.date(1987, 10, 23), datetime.date(1987, 10, 29), datetime.date(1987, 10, 8), datetime.date(1987, 10, 14), datetime.date(1987, 10, 19), datetime.date(1987, 11, 5), datetime.date(1987, 10, 15)}

all 1987

In [16]:
# languages()
# returns language info 

first.languages()

{'eng'}

Already know that this dataset is all English.

In [17]:
# participants()
# returns participants

first.participants()

{'RES', 'XXX', 'CHI'}

Note to self: look up what the 'XXX' participant label is.

&darr;

According to the [CHAT transcription manual](https://talkbank.org/manuals/CHAT.pdf), XXX stands for the three-letter speaker ID, which would be RES (researcher) and CHI (child), here. This XXX popping up here could be a remnant from a time they forgot to remove the template and add the actual ID?

In [18]:
# full header information in a dict 

first.headers()[0]
first.headers()[-1]

{'UTF8': '', 'PID': '11312/c-00025368-1', 'Languages': ['eng'], 'Participants': {'CHI': {'name': 'Ari', 'language': 'eng', 'corpus': 'Hicks', 'age': '5;09.', 'sex': 'female', 'group': '', 'ses': '', 'role': 'Target_Child', 'education': '', 'custom': ''}, 'RES': {'name': 'Deborah', 'language': 'eng', 'corpus': 'Hicks', 'age': '', 'sex': '', 'group': '', 'ses': '', 'role': 'Investigator', 'education': '', 'custom': ''}}, 'Comment': 'Task is eventcast', 'Date': {datetime.date(1987, 10, 8)}, 'Types': 'cross, narrative, TD'}

{'UTF8': '', 'PID': '11312/c-00025427-1', 'Languages': ['eng'], 'Participants': {'CHI': {'name': 'Emily', 'language': 'eng', 'corpus': 'Hicks', 'age': '6;09.', 'sex': 'female', 'group': '', 'ses': '', 'role': 'Target_Child', 'education': '', 'custom': ''}, 'RES': {'name': 'Deborah', 'language': 'eng', 'corpus': 'Hicks', 'age': '', 'sex': '', 'group': '', 'ses': '', 'role': 'Investigator', 'education': '', 'custom': ''}}, 'Comment': 'Task is story', 'Date': {datetime.date(1987, 11, 6)}, 'Types': 'cross, narrative, TD'}

There's not a whole lot of information in the headers of this dataset, but for demographic information, I think I will want to have access to the participants' age and sex. Not sure if I need their names? The grade level comes from the file name. As for additional information, it might be nice to see the comments, though, actually, this might only tell me about the narrative genre (which I also already know from the file name).

*Edit: For this project, what is important is the actual text and the dependent tiers, mainly the %cod tier.

#### Accessing transcriptions and annotations: 

For the specific purposes of my project, the transcription and annotation information is the important information needed.  

- asterisk that comes before the participant code signals a transcription line
- transcriptions are word-segmented by spaces
- punctuation treated as words
- dependent tiers marked by a % 
    - %mor tier
    - %gre tier
    - %cod tier (only for participants' (*CHI) utterances in this dataset)

In [19]:
# only care about CHI utterances
# group by utterance to create a list of lists

for file in fifth[:2]:
    print(file.file_paths(), file.words(participants="CHI", by_utterances=True,))

['Hicks/5th/event/evt01.cha'] [['he', 'puts', 'his', 'balloon', '.'], ['and', 'says', '.'], ["what's", 'he', 'saying', "what's", 'he', 'saying', '?'], ["he's", 'saying', '"/.'], ['stay', 'there', '!'], ['he', 'goes', 'into', 'the', 'bakery', 'shop', '.'], ['and', 'he', 'gets', 'something', '.'], ['oh', 'no', 'along', 'come', 'the', 'mean', 'boys', '.'], ["they're", 'looking', 'too', '.'], ['ah', "they're", 'looking', 'too', '.'], ["they're", 'feasting', 'their', 'eyes', '.'], ['agh', 'one', 'boy', 'spots', 'the', 'balloon', '!'], ['oh', 'they', 'run', '.'], ['and', 'they', 'catch', 'it', '.'], ['and', 'they', 'grab', 'it', '.'], ['and', 'they', 'run', '.'], ['the', 'little', 'boy', 'walks', 'out', 'of', 'the', 'pastry', 'shop', '.'], ['and', 'looks', 'for', 'his', 'balloon', '.'], ['he', 'walks', 'out', '.'], ['he', 'looks', 'up', '.'], ['he', 'looks', 'down', '.'], ['here', 'go', 'the', 'boys', '.'], ['up', 'the', 'hill', 'to', 'the', 'big', 'top', 'carrying', 'the', 'balloon', '.'], 

In [20]:
# .tokens() gives word-based annotations

hicks_tokens = hicks.tokens(participants="CHI")
hicks_tokens[:2]

[Token(word='so', pos='adv', mor='so', gra=Gra(dep=1, head=2, rel='JCT')), Token(word='then', pos='adv:tem', mor='then', gra=Gra(dep=2, head=4, rel='JCT'))]

In [21]:
for token in hicks_tokens[:2]:
    print("word:", token.word)
    print("part-of-speech tag:", token.pos)
    print("morphological information:", token.mor)
    print("grammatical relation:", token.gra, '\n')

word: so
part-of-speech tag: adv
morphological information: so
grammatical relation: Gra(dep=1, head=2, rel='JCT') 

word: then
part-of-speech tag: adv:tem
morphological information: then
grammatical relation: Gra(dep=2, head=4, rel='JCT') 



In [22]:
# utterances - info beyond tokens
# filter by CHI utterances

hicks_utterances = hicks.utterances(participants="CHI")

In [23]:
hicks_utterances[0]

0,1,2,3,4,5,6,7,8,9,10
*CHI:,so,then,they,catch,the,balloon,around,the,bakery,.
%mor:,adv|so,adv:tem|then,pro:sub|they,v|catch,det:art|the,n|balloon,prep|around,det:art|the,n|bakery,.
%gra:,1|2|JCT,2|4|JCT,3|4|SUBJ,4|0|ROOT,5|6|DET,6|4|OBJ,7|4|JCT,8|9|DET,9|7|POBJ,10|4|PUNCT
%cod:,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal


Coding Tier `%cod`:

This is the general purpose coding tier. It can be used for mixing codes into a single
tier for economy or ease of entry. 

Here is an example:

*MOT: you want Mommy to do it?

`%cod: $MLU=6 $NMV=2 $RDE $EXP`

So, it looks like the %cod tier annotations do not line up with the tokens...interesting. This makes things a bit more challenging since I don't know what tokens are being annotated. 

In [24]:
tiers = [u.tiers for u in hicks_utterances]
tiers[0]

{'CHI': 'so then they catch [!] the balloon around the bakery .', '%mor': 'adv|so adv:tem|then pro:sub|they v|catch det:art|the n|balloon prep|around det:art|the n|bakery .', '%gra': '1|2|JCT 2|4|JCT 3|4|SUBJ 4|0|ROOT 5|6|DET 6|4|OBJ 7|4|JCT 8|9|DET 9|7|POBJ 10|4|PUNCT', '%cod': '$ind $pr $t:sq $str $pro:kds $np:bal'}

In [25]:
# formatted view of all tiers

for u in hicks_utterances[:5]:
    print('text tier  ', u.tiers['CHI'])
    print('mor tier:  ', u.tiers['%mor'])
    print('gra tier:  ', u.tiers['%gra'])
    print('cod tier:  ', u.tiers['%cod'], '\n')

text tier   so then they catch [!] the balloon around the bakery .
mor tier:   adv|so adv:tem|then pro:sub|they v|catch det:art|the n|balloon prep|around det:art|the n|bakery .
gra tier:   1|2|JCT 2|4|JCT 3|4|SUBJ 4|0|ROOT 5|6|DET 6|4|OBJ 7|4|JCT 8|9|DET 9|7|POBJ 10|4|PUNCT
cod tier:   $ind $pr $t:sq $str $pro:kds $np:bal 

text tier   and then the boy comes out .
mor tier:   coord|and adv:tem|then det:art|the n|boy v|come-3S adv|out .
gra tier:   1|5|LINK 2|5|JCT 3|4|DET 4|5|SUBJ 5|0|ROOT 6|5|JCT 7|5|PUNCT
cod tier:   $ind $pr $t:sq $np:pas 

text tier   and looks for his balloon .
mor tier:   coord|and v|look-3S prep|for det:poss|his n|balloon .
gra tier:   1|2|LINK 2|0|ROOT 3|2|JCT 4|5|DET 5|3|POBJ 6|2|PUNCT
cod tier:   $ind $pr $zero:pas $np:bal 

text tier   and he walks out .
mor tier:   coord|and pro:sub|he v|walk-3S adv|out .
gra tier:   1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT
cod tier:   $ind $pr $pro:pas 

text tier   and he looks up .
mor tier:   coord|and pro:sub|he v|

In [26]:
# create variables per tier

text_tier = [u.tiers['CHI'] for u in hicks_utterances]
mor_tier = [u.tiers['%mor'] for u in hicks_utterances]
gra_tier = [u.tiers['%gra'] for u in hicks_utterances]
cod_tier = [u.tiers['%cod'] for u in hicks_utterances if '%cod' in u.tiers.keys()] # thank you Man Ho!

In [27]:
text_tier[:2]

['so then they catch [!] the balloon around the bakery .', 'and then the boy comes out .']

In [28]:
cod_tier[:5]
len(cod_tier)

['$ind $pr $t:sq $str $pro:kds $np:bal', '$ind $pr $t:sq $np:pas', '$ind $pr $zero:pas $np:bal', '$ind $pr $pro:pas', '$ind $pr $pro:pas']

8123

In [29]:
print('mor:', len(mor_tier), 'gra:', len(gra_tier), 'cod:', len(cod_tier))

mor: 8992 gra: 8992 cod: 8123


So, there are different amounts of each tier. Whereas there are 8992 text tiers, 8992 mor tiers, and 8992 gra tiers, there are only 8123 cod tiers. (difference of 869)

In [30]:
mor_tier[:5]
gra_tier[:5]
cod_tier[:5]

['adv|so adv:tem|then pro:sub|they v|catch det:art|the n|balloon prep|around det:art|the n|bakery .', 'coord|and adv:tem|then det:art|the n|boy v|come-3S adv|out .', 'coord|and v|look-3S prep|for det:poss|his n|balloon .', 'coord|and pro:sub|he v|walk-3S adv|out .', 'coord|and pro:sub|he v|look-3S adv|up .']

['1|2|JCT 2|4|JCT 3|4|SUBJ 4|0|ROOT 5|6|DET 6|4|OBJ 7|4|JCT 8|9|DET 9|7|POBJ 10|4|PUNCT', '1|5|LINK 2|5|JCT 3|4|DET 4|5|SUBJ 5|0|ROOT 6|5|JCT 7|5|PUNCT', '1|2|LINK 2|0|ROOT 3|2|JCT 4|5|DET 5|3|POBJ 6|2|PUNCT', '1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT', '1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT']

['$ind $pr $t:sq $str $pro:kds $np:bal', '$ind $pr $t:sq $np:pas', '$ind $pr $zero:pas $np:bal', '$ind $pr $pro:pas', '$ind $pr $pro:pas']

In [31]:
filenames = [f for f in hicks.file_paths()]

In [32]:
filenames[:5]

['Hicks/1st/event/evt004.cha', 'Hicks/1st/event/evt005.cha', 'Hicks/1st/event/evt009.cha', 'Hicks/1st/event/evt010.cha', 'Hicks/1st/event/evt012.cha']

## 4. Creating DataFrame

I will use the dictionary output from calling tiers on the CHI utterances to create a DataFrame.

In [33]:
# variable reminder
# tiers = [u.tiers for u in hicks_utterances]

tiers[0]

{'CHI': 'so then they catch [!] the balloon around the bakery .', '%mor': 'adv|so adv:tem|then pro:sub|they v|catch det:art|the n|balloon prep|around det:art|the n|bakery .', '%gra': '1|2|JCT 2|4|JCT 3|4|SUBJ 4|0|ROOT 5|6|DET 6|4|OBJ 7|4|JCT 8|9|DET 9|7|POBJ 10|4|PUNCT', '%cod': '$ind $pr $t:sq $str $pro:kds $np:bal'}

In [34]:
# create DataFrame

tiers_df = pd.DataFrame(tiers)

In [35]:
tiers_df.sample(10)

Unnamed: 0,CHI,%mor,%gra,%cod,%com
131,now the little boy sees it .,adv|now det:art|the adj|little n|boy v|see-3S ...,1|5|JCT 2|4|DET 3|4|MOD 4|5|SUBJ 5|0|ROOT 6|5|...,$ind $pr $npap:pas $pro:bal,
3064,ooh (.) I think he's trying to open the door .,v|ooh pro:sub|I v|think pro:sub|he~aux|be&3S p...,1|0|ROOT 2|3|SUBJ 3|1|COMP 4|6|SUBJ 5|6|AUX 6|...,$ind $pr $pg $modv $pro:pas,
6429,and tried to (.) hide it .,coord|and v|try-PAST inf|to v|hide pro:per|it .,1|2|LINK 2|0|ROOT 3|4|INF 4|2|COMP 5|4|OBJ 6|2...,,
7586,and they walk away .,coord|and pro:sub|they v|walk adv|away .,1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT,$ind $pr $pro:kds,
8282,and noise started to come .,coord|and n|noise v|start-PAST inf|to v|come .,1|3|LINK 2|3|SUBJ 3|0|ROOT 4|5|INF 5|3|COMP 6|...,$ind $pa $a:v,
697,now (.) he's checking .,adv|now pro:sub|he~aux|be&3S part|check-PRESP .,1|4|JCT 2|4|SUBJ 3|4|AUX 4|0|ROOT 5|4|PUNCT,$mc $pr $pg $pro:pas,
2956,now the boys went over the wall .,adv|now det:art|the n|boy-PL v|go&PAST prep|ov...,1|4|JCT 2|3|DET 3|4|SUBJ 4|0|ROOT 5|4|JCT 6|7|...,$ind $pa $np:kds,
1132,to stay right there .,inf|to v|stay adv|right adv|there .,1|2|INF 2|0|ROOT 3|4|JCT 4|2|JCT 5|2|PUNCT,$cmp $inf $emp:bal,
5709,and they catch glimpse of the balloon .,coord|and pro:sub|they v|catch n|glimpse prep|...,1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|OBJ 5|4|NJCT 6|...,$mc $pr $note,
6237,now he's going into the shop to get something ...,adv|now pro:sub|he~aux|be&3S part|go-PRESP pre...,1|4|JCT 2|4|SUBJ 3|4|AUX 4|0|ROOT 5|4|JCT 6|7|...,$c:c:pur,


This is looking pretty good overall. I will need to change the column names because the % might cause issues, and I want the columns to be more self-documenting than they are currently. I was not expecting to get %com, which I am fairly certain is the comments tier, and it looks like most of the values are NaN, so I will remove this column. I do not need it for the purposes of my analysis.

In [36]:
# get rid of %com column

tiers_df = tiers_df.drop('%com', 1)
tiers_df.sample(2)

  tiers_df = tiers_df.drop('%com', 1)


Unnamed: 0,CHI,%mor,%gra,%cod
2421,he went out of church with it .,pro:sub|he v|go&PAST adv|out prep|of n|church ...,1|2|SUBJ 2|0|ROOT 3|2|JCT 4|3|JCT 5|4|POBJ 6|2...,$mc $pa $pro:pas $pro:bal
3251,to come back .,inf|to v|come adv|back .,1|2|INF 2|0|ROOT 3|2|JCT 4|2|PUNCT,$cmp $inf $emp:bal


In [37]:
# change column names to remove the %

tiers_df.columns = ['CHI_text', 'mor_tier', 'gra_tier', 'cod_tier']

In [38]:
tiers_df

Unnamed: 0,CHI_text,mor_tier,gra_tier,cod_tier
0,so then they catch [!] the balloon around the ...,adv|so adv:tem|then pro:sub|they v|catch det:a...,1|2|JCT 2|4|JCT 3|4|SUBJ 4|0|ROOT 5|6|DET 6|4|...,$ind $pr $t:sq $str $pro:kds $np:bal
1,and then the boy comes out .,coord|and adv:tem|then det:art|the n|boy v|com...,1|5|LINK 2|5|JCT 3|4|DET 4|5|SUBJ 5|0|ROOT 6|5...,$ind $pr $t:sq $np:pas
2,and looks for his balloon .,coord|and v|look-3S prep|for det:poss|his n|ba...,1|2|LINK 2|0|ROOT 3|2|JCT 4|5|DET 5|3|POBJ 6|2...,$ind $pr $zero:pas $np:bal
3,and he walks out .,coord|and pro:sub|he v|walk-3S adv|out .,1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT,$ind $pr $pro:pas
4,and he looks up .,coord|and pro:sub|he v|look-3S adv|up .,1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT,$ind $pr $pro:pas
...,...,...,...,...
8987,and throwing rocks at it .,coord|and part|throw-PRESP n|rock-PL prep|at p...,1|0|INCROOT 2|1|COORD 3|2|OBJ 4|2|JCT 5|4|POBJ...,$ind $pa $pg $zero:kds $pro:bal
8988,and then [//] and one of the rocks hit it .,coord|and coord|and pro:indef|one prep|of det:...,1|7|LINK 2|7|LINK 3|7|SUBJ 4|3|NJCT 5|6|DET 6|...,$ind $pa $t:sq $pro:bal
8989,and then it popped .,coord|and adv:tem|then pro:per|it v|pop-PAST .,1|4|LINK 2|4|JCT 3|4|SUBJ 4|0|ROOT 5|4|PUNCT,$ind $pa $t:sq $pro:bal
8990,and then [/] and then all the people loosed th...,coord|and coord|and adv:tem|then qn|all det:ar...,1|7|LINK 2|7|LINK 3|7|JCT 4|6|QUANT 5|6|DET 6|...,$ind $pa $t:sq


In [39]:
tiers_df.info()
tiers_df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8992 entries, 0 to 8991
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   CHI_text  8992 non-null   object
 1   mor_tier  8992 non-null   object
 2   gra_tier  8992 non-null   object
 3   cod_tier  8123 non-null   object
dtypes: object(4)
memory usage: 281.1+ KB


Unnamed: 0,CHI_text,mor_tier,gra_tier,cod_tier
count,8992,8992,8992,8123
unique,8217,8032,4042,3057
top,to see .,inf|to v|see .,1|3|LINK 2|3|SUBJ 3|0|ROOT 4|5|DET 5|3|OBJ 6|3...,$ind $pa $pro:pas
freq,21,21,137,269


As I discovered earlier and this confirms, not all of the child utterances have a %cod tier. Frustrating, but will have to deal with it. This just means that I may not catch all instances of the phenomenon I am analyzing present in the utterances since I am specifically working from the %cod annotations. I can also see here that a lot of the %cod tier annotations must be the same since there are only 3057 unique tiers out of a total 8123. 

## 5. Linguistic phenomenon of interest

The easiest way to search through all the CHAT files was to grep through them using the command line. I initially wanted to find instances of verbs participating in the dative alternation ("give", "tell").
- ex) from local *Hicks* directory: `grep -i 'told' */*/*cha | grep 'CHI'` (only want child lines)
This went no where as this dataset features speech from children who are reporting back the events they saw from a short, simple video, and there were no utterances in which these verbs were being utilized.

In my original project plan I noted that "verb forms, aspectual markers, timemarkers and logical connectors" would be my focus in part because that is what the Hicks dataset claimed to be exploring and therefore annotating. Looking through the %cod tier and comparing it against the coding scheme (Coding 1.1 section of https://childes.talkbank.org/access/Eng-NA/Hicks.html), I do not see much evidence that the researchers followed through with this path of exploration.

However, one thing I have been noticing that comes up a lot is the code $modv which stands for "modal verb." This is almost always some form of “try to”. It’s most likely so prevalent because of the nature of the task (getting kids to talk about something they have just watched) and probably has something to do with the kids not wanting to say something actually happened with certainty because they only witnessed what happened in a video, not in real life. 

This phenomenon is probably highly affected by the narrative task and therefore it’s not really true spoken speech. Then again, “try to” constructions are interesting because there are varying degrees to which the action being “tried”(?) is actually completed. 

How could the UDS framework apply to this phenomenon?

### 5-1. "try to" constructions 

from *Hicks* directory: `grep -i '$modv' */*/*cha | grep 'CHI'`
- goes through *all* files (5 subdirectories categorized by child age/subdirectories categorized by genre/files) and grabs coding tiers that have the `$modv` (modal verb) code
- only the participant (*CHI) utterances should have the %cod tier and `$modv` annotation only appears in the %cod tier, so I didn't need to restrict the search further with `grep 'CHI'`
- use -B# to grab # lines before and see what the text of the utterance is