# Initial exploration of CHILDES Narrative English Hicks Corpus

## Caroline Gish | cng18@pitt.edu | for 2022-02-24

---

*This notebook accounts for work done for progress report 1 and progress report 2.*

**Source:** https://childes.talkbank.org/access/Eng-NA/Hicks.html 
(also has description of research goals and annotation schema)

Initial exploration will involve trying to read in the CHAT files. 

**Resources:**

- .cha file viewer: https://filext.com/file-extension/CHA

- paper on child genre development: https://www.jstor.org/stable/40171463?seq=1#metadata_info_tab_contents

- CLAN documentation: https://talkbank.org/manuals/CLAN.pdf

- CHAT documentation: https://talkbank.org/manuals/CHAT.pdf

**Relevant tiers:**

- %mor: Morphological Tier
    - This tier codes morphemic segments by type and part of speech.

- %gra: Grammatical Relations Tier            
    - This tier is used to code dependency structures with tagged grammatical relations (Sagae, Davis, Lavie, MacWhinney, & Wintner, 2007; Sagae, Lavie, & MacWhinney, 2005; Sagae, MacWhinney, & Lavie, 2004).

- %cod: Coding Tier                                       
    - This is the general purpose coding tier. It can be used for mixing codes into a single tier for economy or ease of entry.

In [1]:
import numpy as np
import pandas as pd
import pickle as pkl

%pprint

Pretty printing has been turned OFF


In [2]:
# every returned Out[] displayed  
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Trying out `chamd`:

In [3]:
from chamd import ChatReader

In [4]:
reader = ChatReader()
chat = reader.read_file('../data/Hicks/1st/event/evt004.cha') # or read_string

#for item in chat.metadata:
    #print(item)
#for line in chat.lines:
    #for item in line.metadata:
        #print(item)
    #print(line.text)

In [5]:
help(ChatReader)

Help on class ChatReader in module chamd.chat_reader:

class ChatReader(builtins.object)
 |  Methods defined here:
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  read_file(self, filename: str) -> chamd.chat_reader.ChatFile
 |  
 |  read_string(self, content: str, filename: str) -> chamd.chat_reader.ChatFile
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)



In [6]:
dir(ChatReader)

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'read_file', 'read_string']

In [7]:
#for line in chat.lines:
    #print(line.text)

I've spent wayyy too much time researching and trying to figure out the `chamd` library, so it's time to see if there is anything else I can use.

## Trying out PyLangAcq:

PyLangAcq: Language Acquisition Research in Python: https://pylangacq.org/

reading CHAT data: https://pylangacq.org/read.html

In [8]:
import pylangacq

- "`read_chat()` automatically handles everything behind the scenes for you, from downloading the ZIP file, unzipping it, traversing through the CHAT files found, as well as parsing the files"

In [9]:
# initializing a reader from URL

url = 'https://childes.talkbank.org/data/Eng-NA/Hicks.zip'
hicks = pylangacq.read_chat(url)

In [10]:
hicks.info(verbose='True')  # set 'verbose' to True 
                            # to see all files

213 files
14273 utterances
104908 words
        Utterance Count    Word Count  File Path
----  -----------------  ------------  -------------------------------
#1                   66           522  Hicks/1st/event/evt004.cha
#2                   79           560  Hicks/1st/event/evt005.cha
#3                   57           423  Hicks/1st/event/evt009.cha
#4                   58           395  Hicks/1st/event/evt010.cha
#5                   57           469  Hicks/1st/event/evt012.cha
#6                   79           623  Hicks/1st/event/evt016.cha
#7                   41           309  Hicks/1st/event/evt018.cha
#8                   90           658  Hicks/1st/event/evt019.cha
#9                   73           531  Hicks/1st/event/evt021.cha
#10                  82           587  Hicks/1st/event/evt024.cha
#11                  75           553  Hicks/1st/event/evt026.cha
#12                  89           698  Hicks/1st/event/evt027.cha
#13                  77           513  Hicks/1st

This is looking much more promising! Acceses to different files and the ability to count different things is already built in.

Not sure how/why I didn't find this before `chamd`. PyLangAcq has a lot more documentation (that's also organized well) and it's taking care of everything I need.

In [11]:
hicks.head()
hicks.tail()

0,1,2,3,4,5,6,7,8
*RES:,so,this,is,again,Deborah,and,Ari,.
%mor:,adv|so,pro:dem|this,cop|be&3S,adv|again,n:prop|Deborah,coord|and,n:prop|Ari,.
%gra:,1|2|JCT,2|3|SUBJ,3|0|ROOT,4|3|JCT,5|3|PRED,6|5|CONJ,7|6|COORD,8|3|PUNCT

0,1,2,3,4,5,6,7,8
*RES:,and,we're,CLITIC,gonna,CLITIC,be,sportscasters,.
%mor:,coord|and,pro:sub|we,aux|be&PRES,part|go-PRESP,inf|to,cop|be,n|+n|sports+n|caster-PL,.
%gra:,1|4|LINK,2|4|SUBJ,3|4|AUX,4|0|ROOT,5|6|INF,6|4|COMP,7|6|PRED,8|4|PUNCT

0,1,2,3,4,5,6,7,8,9
*RES:,and,say,what's,CLITIC,happening,in,the,movie,.
%mor:,coord|and,v|say,pro:int|what,aux|be&3S,part|happen-PRESP,prep|in,det:art|the,n|movie,.
%gra:,1|2|LINK,2|0|ROOT,3|5|SUBJ,4|5|AUX,5|2|COMP,6|5|JCT,7|8|DET,8|6|POBJ,9|2|PUNCT

0,1,2,3,4,5,6,7
*RES:,and,I'm,CLITIC,gonna,CLITIC,start,.
%mor:,coord|and,pro:sub|I,aux|be&1S,part|go-PRESP,inf|to,v|start,.
%gra:,1|4|LINK,2|4|SUBJ,3|4|AUX,4|0|ROOT,5|6|INF,6|4|COMP,7|4|PUNCT

0,1,2,3,4,5,6,7,8,9
*RES:,the,little,boy,is,getting,some,money,out,.
%mor:,det:art|the,adj|little,n|boy,aux|be&3S,part|get-PRESP,qn|some,n|money,adv|out,.
%gra:,1|3|DET,2|3|MOD,3|5|SUBJ,4|5|AUX,5|0|ROOT,6|7|QUANT,7|5|OBJ,8|5|JCT,9|5|PUNCT


0,1,2,3,4,5,6,7,8,9,10
*CHI:,and,and,then,all,the,people,loosed,their,balloon,.
%mor:,coord|and,coord|and,adv:tem|then,qn|all,det:art|the,n|person&PL,v|loose-PAST,det:poss|their,n|balloon,.
%gra:,1|7|LINK,2|7|LINK,3|7|JCT,4|6|QUANT,5|6|DET,6|7|SUBJ,7|0|ROOT,8|9|DET,9|7|OBJ,10|7|PUNCT
%cod:,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq,$ind $pa $t:sq

0,1,2,3,4,5,6,7,8,9
*CHI:,and,lots_of,balloons,came,flying,to,the,boy,.
%mor:,coord|and,qn|lots_of,n|balloon-PL,v|come&PAST,part|fly-PRESP,prep|to,det:art|the,n|boy,.
%gra:,1|4|LINK,2|3|QUANT,3|4|SUBJ,4|0|ROOT,5|4|XJCT,6|5|JCT,7|8|DET,8|6|POBJ,9|4|PUNCT
%cod:,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas,$ind $pa $pg $indef:bls $np:pas

0,1,2,3,4,5,6,7,8,9
*EXP:,wow,is,that,the,end,of,the,story,?
%mor:,co|wow,cop|be&3S,comp|that,det:art|the,n|end,prep|of,det:art|the,n|story,?
%gra:,1|2|COM,2|0|ROOT,3|2|PRED,4|5|DET,5|3|OBJ,6|5|NJCT,7|8|DET,8|6|POBJ,9|2|PUNCT

0,1,2
*XXX:,yeah,.
%mor:,co|yeah,.
%gra:,1|0|INCROOT,2|1|PUNCT

0,1,2
*EXP:,great,!
%mor:,adj|great,!
%gra:,1|0|INCROOT,2|1|PUNCT


This is in a great format that preserves all the separation between tokens across all tiers. When a child is speaking, the %cod tier gets tacked on as the fourth and final tier. 

Of note for now: Contractions and clitics are transcribed in a specific way with the word itself first and then CLITIC as the next "word". Can read more about this in the CHAT manual. 

In [12]:
# all file paths

hicks.file_paths()[:10]
hicks.file_paths()[-10:]

['Hicks/1st/event/evt004.cha', 'Hicks/1st/event/evt005.cha', 'Hicks/1st/event/evt009.cha', 'Hicks/1st/event/evt010.cha', 'Hicks/1st/event/evt012.cha', 'Hicks/1st/event/evt016.cha', 'Hicks/1st/event/evt018.cha', 'Hicks/1st/event/evt019.cha', 'Hicks/1st/event/evt021.cha', 'Hicks/1st/event/evt024.cha']

['Hicks/Kinder/story/story046.cha', 'Hicks/Kinder/story/story047.cha', 'Hicks/Kinder/story/story051.cha', 'Hicks/Kinder/story/story052.cha', 'Hicks/Kinder/story/story053.cha', 'Hicks/Kinder/story/story054.cha', 'Hicks/Kinder/story/story056.cha', 'Hicks/Kinder/story/story057.cha', 'Hicks/Kinder/story/story058.cha', 'Hicks/Kinder/story/story059.cha']

In [13]:
# setting variables for grade level

first = pylangacq.read_chat(url, match="1st")
second = pylangacq.read_chat(url, match="2nd")
fifth = pylangacq.read_chat(url, match="5th")
Del = pylangacq.read_chat(url, match="Del")  # only has event and story, though
kinder = pylangacq.read_chat(url, match="Kinder")

The "Del" directory of files only contains narratives of the event and story genre (no report genre). To keep things consistent across all the grade levels, I may end up not using the "Del" files. I'll keep them in for now, though.

In [14]:
for file in first, second, fifth, Del, kinder:
    print('=====\n')
    print('number of files:', file.n_files())
    print('file path example:', file.file_paths()[:2])
    print('number of files per genre:', int(file.n_files()/3))
    print('total number of utterances:', len(file.utterances()))
    print('total number of words:', len(file.words()), '\n')

=====

number of files: 60
file path example: ['Hicks/1st/event/evt004.cha', 'Hicks/1st/event/evt005.cha']
number of files per genre: 20
total number of utterances: 3981
total number of words: 28408 

=====

number of files: 54
file path example: ['Hicks/2nd/event/evt001.cha', 'Hicks/2nd/event/evt002.cha']
number of files per genre: 18
total number of utterances: 3843
total number of words: 26823 

=====

number of files: 15
file path example: ['Hicks/5th/event/evt01.cha', 'Hicks/5th/event/evt02.cha']
number of files per genre: 5
total number of utterances: 1233
total number of words: 8545 

=====

number of files: 24
file path example: ['Hicks/Del/event/evt01.cha', 'Hicks/Del/event/evt02.cha']
number of files per genre: 8
total number of utterances: 1408
total number of words: 10560 

=====

number of files: 60
file path example: ['Hicks/Kinder/event/evt013.cha', 'Hicks/Kinder/event/evt014.cha']
number of files per genre: 20
total number of utterances: 3808
total number of words: 2724

There are a lot less files for fifth-grade students. There are 3 different genres, so that means there are all only 5 files per genre for fifth-grade students.

In [15]:
# setting variables for narrative genre

event = pylangacq.read_chat(url, match="event")
report = pylangacq.read_chat(url, match="report")
story = pylangacq.read_chat(url, match="story")

In [16]:
for file in event, report, story:
    print('=====\n')
    print('number of files:', file.n_files())
    print('file path example:', file.file_paths()[:2])
    print('total number of utterances:', len(file.utterances()))
    print('total number of words:', len(file.words()), '\n')

=====

number of files: 75
file path example: ['Hicks/1st/event/evt004.cha', 'Hicks/1st/event/evt005.cha']
total number of utterances: 5623
total number of words: 38753 

=====

number of files: 63
file path example: ['Hicks/1st/report/rep004.cha', 'Hicks/1st/report/rep005.cha']
total number of utterances: 3331
total number of words: 24338 

=====

number of files: 75
file path example: ['Hicks/1st/story/story004.cha', 'Hicks/1st/story/story005.cha']
total number of utterances: 5319
total number of words: 38486 



This seems to match up - remember the "Del" directory did not have any report files (so 63 report files instead of 75), but it did have event and story files. 

So, overall, it looks like the sets (by genre) are comparable in size. 

I am not sure yet in what way I want to apply the UDS framework to this CHILDES narrative dataset - across age or narrative genre? 

There is a lot more I can do with `pylangacq` than this, but I only have the basic info about the datasets for now because I exhausted so much of my energy trying to get `chamd` to work... 

### More exploration:

#### Accessing CHAT headers:

In [17]:
# ages()
# returns CHI ages in tuple of 3 integers
# (years, months, days)

# ages of first graders
first.ages()

# same ages in months
first.ages(months=True)

[(5, 9, 0), (6, 6, 0), (7, 0, 0), (6, 5, 0), (6, 8, 0), (6, 6, 0), (6, 1, 0), (6, 1, 0), (6, 7, 0), (6, 10, 0), (6, 7, 0), (6, 6, 0), (6, 7, 0), (6, 11, 0), (6, 9, 0), (6, 7, 0), (6, 5, 0), (6, 7, 0), (6, 6, 0), (6, 9, 0), (5, 9, 0), (6, 6, 0), (7, 0, 0), (6, 5, 0), (6, 8, 0), (6, 6, 0), (6, 1, 0), (6, 1, 0), (6, 7, 0), (6, 10, 0), (6, 7, 0), (6, 6, 0), (6, 7, 0), (6, 11, 0), (6, 9, 0), (6, 7, 0), (6, 5, 0), (6, 7, 0), (6, 6, 0), (6, 9, 0), (5, 9, 0), (6, 6, 0), (7, 0, 0), (6, 5, 0), (6, 8, 0), (6, 6, 0), (6, 1, 0), (6, 1, 0), (6, 7, 0), (6, 10, 0), (6, 7, 0), (6, 6, 0), (6, 7, 0), (6, 11, 0), (6, 9, 0), (6, 7, 0), (6, 5, 0), (6, 7, 0), (6, 6, 0), (6, 9, 0)]

[69.0, 78.0, 84.0, 77.0, 80.0, 78.0, 73.0, 73.0, 79.0, 82.0, 79.0, 78.0, 79.0, 83.0, 81.0, 79.0, 77.0, 79.0, 78.0, 81.0, 69.0, 78.0, 84.0, 77.0, 80.0, 78.0, 73.0, 73.0, 79.0, 82.0, 79.0, 78.0, 79.0, 83.0, 81.0, 79.0, 77.0, 79.0, 78.0, 81.0, 69.0, 78.0, 84.0, 77.0, 80.0, 78.0, 73.0, 73.0, 79.0, 82.0, 79.0, 78.0, 79.0, 83.0, 81.0, 79.0, 77.0, 79.0, 78.0, 81.0]

In [18]:
# dates_of_recording()
# returns dates of recording

# recording date by file
first.dates_of_recording(by_files=True)

# unique recording dates
first.dates_of_recording()

[{datetime.date(1987, 10, 8)}, {datetime.date(1987, 10, 14)}, {datetime.date(1987, 10, 15)}, {datetime.date(1987, 10, 16)}, {datetime.date(1987, 10, 16)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 22)}, {datetime.date(1987, 10, 23)}, {datetime.date(1987, 10, 29)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 30)}, {datetime.date(1987, 10, 30)}, {datetime.date(1987, 11, 5)}, {datetime.date(1987, 11, 5)}, {datetime.date(1987, 11, 5)}, {datetime.date(1987, 11, 6)}, {datetime.date(1987, 11, 6)}, {datetime.date(1987, 11, 6)}, {datetime.date(1987, 10, 8)}, {datetime.date(1987, 10, 14)}, {datetime.date(1987, 10, 15)}, {datetime.date(1987, 10, 16)}, {datetime.date(1987, 10, 16)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 19)}, {datetime.date(1987, 10, 22)}, set(), {datetime.date(1987, 10, 29)}, {datetime.date(1987, 10, 29)}, {datetime.date(1987, 10, 30)}, {datetim

{datetime.date(1987, 11, 6), datetime.date(1987, 10, 30), datetime.date(1987, 10, 23), datetime.date(1987, 10, 19), datetime.date(1987, 10, 29), datetime.date(1987, 11, 5), datetime.date(1987, 10, 8), datetime.date(1987, 10, 14), datetime.date(1987, 10, 16), datetime.date(1987, 10, 22), datetime.date(1987, 10, 15)}

all 1987

In [19]:
# languages()
# returns language info 

first.languages()

{'eng'}

Already know that this dataset is all English.

In [20]:
# participants()
# returns participants

first.participants()

{'XXX', 'CHI', 'RES'}

Note to self: look up what the 'XXX' participant label is.

&darr;

According to the [CHAT transcription manual](https://talkbank.org/manuals/CHAT.pdf), XXX stands for the three-letter speaker ID, which would be RES (researcher) and CHI (child), here. This XXX popping up here could be a remnant from a time they forgot to remove the template and add the actual ID?

In [21]:
# full header information in a dict 

first.headers()[0]
first.headers()[-1]

{'UTF8': '', 'PID': '11312/c-00025368-1', 'Languages': ['eng'], 'Participants': {'CHI': {'name': 'Ari', 'language': 'eng', 'corpus': 'Hicks', 'age': '5;09.', 'sex': 'female', 'group': '', 'ses': '', 'role': 'Target_Child', 'education': '', 'custom': ''}, 'RES': {'name': 'Deborah', 'language': 'eng', 'corpus': 'Hicks', 'age': '', 'sex': '', 'group': '', 'ses': '', 'role': 'Investigator', 'education': '', 'custom': ''}}, 'Comment': 'Task is eventcast', 'Date': {datetime.date(1987, 10, 8)}, 'Types': 'cross, narrative, TD'}

{'UTF8': '', 'PID': '11312/c-00025427-1', 'Languages': ['eng'], 'Participants': {'CHI': {'name': 'Emily', 'language': 'eng', 'corpus': 'Hicks', 'age': '6;09.', 'sex': 'female', 'group': '', 'ses': '', 'role': 'Target_Child', 'education': '', 'custom': ''}, 'RES': {'name': 'Deborah', 'language': 'eng', 'corpus': 'Hicks', 'age': '', 'sex': '', 'group': '', 'ses': '', 'role': 'Investigator', 'education': '', 'custom': ''}}, 'Comment': 'Task is story', 'Date': {datetime.date(1987, 11, 6)}, 'Types': 'cross, narrative, TD'}

There's not a whole lot of information in the headers of this dataset, but for demographic information, I think I will want to have access to the participants' age and sex. Not sure if I need their names? The grade level comes from the file name. As for additional information, it might be nice to see the comments, though, actually, this might only tell me about the narrative genre (which I also already know from the file name).

*For this project, what is important is the actual text and the dependent tiers.

#### Accessing transcriptions and annotations: 

For the specific purposes of my project, the transcription and annotation information is the important information needed.  

- asterisk that comes before the participant code signals a transcription line
- transcriptions are word-segmented by spaces
- punctuation treated as words
- dependent tiers marked by a % 
    - %mor tier
    - %gre tier
    - %cod tier (only for participants' utterances in this dataset)

In [22]:
# only care about CHI utterances
# group by utterance to create a list of lists

for file in fifth:
    print(file.file_paths(), file.words(participants="CHI", by_utterances=True,))

['Hicks/5th/event/evt01.cha'] [['he', 'puts', 'his', 'balloon', '.'], ['and', 'says', '.'], ["what's", 'he', 'saying', "what's", 'he', 'saying', '?'], ["he's", 'saying', '"/.'], ['stay', 'there', '!'], ['he', 'goes', 'into', 'the', 'bakery', 'shop', '.'], ['and', 'he', 'gets', 'something', '.'], ['oh', 'no', 'along', 'come', 'the', 'mean', 'boys', '.'], ["they're", 'looking', 'too', '.'], ['ah', "they're", 'looking', 'too', '.'], ["they're", 'feasting', 'their', 'eyes', '.'], ['agh', 'one', 'boy', 'spots', 'the', 'balloon', '!'], ['oh', 'they', 'run', '.'], ['and', 'they', 'catch', 'it', '.'], ['and', 'they', 'grab', 'it', '.'], ['and', 'they', 'run', '.'], ['the', 'little', 'boy', 'walks', 'out', 'of', 'the', 'pastry', 'shop', '.'], ['and', 'looks', 'for', 'his', 'balloon', '.'], ['he', 'walks', 'out', '.'], ['he', 'looks', 'up', '.'], ['he', 'looks', 'down', '.'], ['here', 'go', 'the', 'boys', '.'], ['up', 'the', 'hill', 'to', 'the', 'big', 'top', 'carrying', 'the', 'balloon', '.'], 

In [23]:
# .tokens() gives word-based annotations

hicks_tokens = hicks.tokens(participants="CHI")
hicks_tokens[:2]

[Token(word='so', pos='adv', mor='so', gra=Gra(dep=1, head=2, rel='JCT')), Token(word='then', pos='adv:tem', mor='then', gra=Gra(dep=2, head=4, rel='JCT'))]

In [24]:
for token in hicks_tokens[:2]:
    print("word:", token.word)
    print("part-of-speech tag:", token.pos)
    print("morphological information:", token.mor)
    print("grammatical relation:", token.gra, '\n')

word: so
part-of-speech tag: adv
morphological information: so
grammatical relation: Gra(dep=1, head=2, rel='JCT') 

word: then
part-of-speech tag: adv:tem
morphological information: then
grammatical relation: Gra(dep=2, head=4, rel='JCT') 



In [25]:
# utterances - info beyond tokens
# filter by CHI utterances

hicks_utterances = hicks.utterances(participants="CHI")

In [26]:
hicks_utterances[0]

0,1,2,3,4,5,6,7,8,9,10
*CHI:,so,then,they,catch,the,balloon,around,the,bakery,.
%mor:,adv|so,adv:tem|then,pro:sub|they,v|catch,det:art|the,n|balloon,prep|around,det:art|the,n|bakery,.
%gra:,1|2|JCT,2|4|JCT,3|4|SUBJ,4|0|ROOT,5|6|DET,6|4|OBJ,7|4|JCT,8|9|DET,9|7|POBJ,10|4|PUNCT
%cod:,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal,$ind $pr $t:sq $str $pro:kds $np:bal


In [27]:
t = [u.tiers for u in hicks_utterances]
t[0]

{'CHI': 'so then they catch [!] the balloon around the bakery .', '%mor': 'adv|so adv:tem|then pro:sub|they v|catch det:art|the n|balloon prep|around det:art|the n|bakery .', '%gra': '1|2|JCT 2|4|JCT 3|4|SUBJ 4|0|ROOT 5|6|DET 6|4|OBJ 7|4|JCT 8|9|DET 9|7|POBJ 10|4|PUNCT', '%cod': '$ind $pr $t:sq $str $pro:kds $np:bal'}

In [28]:
for u in hicks_utterances[:5]:
    print('mor tier:  ', u.tiers['%mor'])
    print('gra tier:  ', u.tiers['%gra'])
    print('cod tier:  ', u.tiers['%cod'], '\n')

mor tier:   adv|so adv:tem|then pro:sub|they v|catch det:art|the n|balloon prep|around det:art|the n|bakery .
gra tier:   1|2|JCT 2|4|JCT 3|4|SUBJ 4|0|ROOT 5|6|DET 6|4|OBJ 7|4|JCT 8|9|DET 9|7|POBJ 10|4|PUNCT
cod tier:   $ind $pr $t:sq $str $pro:kds $np:bal 

mor tier:   coord|and adv:tem|then det:art|the n|boy v|come-3S adv|out .
gra tier:   1|5|LINK 2|5|JCT 3|4|DET 4|5|SUBJ 5|0|ROOT 6|5|JCT 7|5|PUNCT
cod tier:   $ind $pr $t:sq $np:pas 

mor tier:   coord|and v|look-3S prep|for det:poss|his n|balloon .
gra tier:   1|2|LINK 2|0|ROOT 3|2|JCT 4|5|DET 5|3|POBJ 6|2|PUNCT
cod tier:   $ind $pr $zero:pas $np:bal 

mor tier:   coord|and pro:sub|he v|walk-3S adv|out .
gra tier:   1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT
cod tier:   $ind $pr $pro:pas 

mor tier:   coord|and pro:sub|he v|look-3S adv|up .
gra tier:   1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT
cod tier:   $ind $pr $pro:pas 



In [29]:
mor_tier = [[u.tiers['%mor']] for u in hicks_utterances]
gra_tier = [[u.tiers['%gra']] for u in hicks_utterances]
#cod_tier = [[u.tiers['%cod']] for u in hicks_utterances]

It's not letting me use list comprension to create a list out of the %cod tier... It may be because it is an extra tier not required by the CHAT data format? There are tiers that are not readily handled by PyLangAcq, and I think this may be one of them. 

Try for loop?

In [30]:
cod_tier_l = []

for u in hicks_utterances:
    cod_tier = [u.tiers['%cod']]
    cod_tier_l.append(cod_tier)

KeyError: '%cod'

In [31]:
cod_tier_l[:5]
len(cod_tier_l)

[['$ind $pr $t:sq $str $pro:kds $np:bal'], ['$ind $pr $t:sq $np:pas'], ['$ind $pr $zero:pas $np:bal'], ['$ind $pr $pro:pas'], ['$ind $pr $pro:pas']]

302

In [32]:
print(len(mor_tier), len(gra_tier), len(cod_tier_l))

8992 8992 302


I'm not exactly sure what's going on here. Again, it doesn't like '%cod' as a tiers attribute value, but I'm still getting some of the lists of individual cod tiers in a larger list. I only get 302, though, instead of the 8992 I want to match up with the '%mor' and '%gra' tiers.

In [33]:
mor_tier[:5]
gra_tier[:5]

[['adv|so adv:tem|then pro:sub|they v|catch det:art|the n|balloon prep|around det:art|the n|bakery .'], ['coord|and adv:tem|then det:art|the n|boy v|come-3S adv|out .'], ['coord|and v|look-3S prep|for det:poss|his n|balloon .'], ['coord|and pro:sub|he v|walk-3S adv|out .'], ['coord|and pro:sub|he v|look-3S adv|up .']]

[['1|2|JCT 2|4|JCT 3|4|SUBJ 4|0|ROOT 5|6|DET 6|4|OBJ 7|4|JCT 8|9|DET 9|7|POBJ 10|4|PUNCT'], ['1|5|LINK 2|5|JCT 3|4|DET 4|5|SUBJ 5|0|ROOT 6|5|JCT 7|5|PUNCT'], ['1|2|LINK 2|0|ROOT 3|2|JCT 4|5|DET 5|3|POBJ 6|2|PUNCT'], ['1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT'], ['1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT']]

In [34]:
tier_df = pd.DataFrame({'mor_tier':mor_tier,
                        'gra_tier':gra_tier,
                        #'cod_tier':cod_tier_l
                        })

In [35]:
tier_df

Unnamed: 0,mor_tier,gra_tier
0,[adv|so adv:tem|then pro:sub|they v|catch det:...,[1|2|JCT 2|4|JCT 3|4|SUBJ 4|0|ROOT 5|6|DET 6|4...
1,[coord|and adv:tem|then det:art|the n|boy v|co...,[1|5|LINK 2|5|JCT 3|4|DET 4|5|SUBJ 5|0|ROOT 6|...
2,[coord|and v|look-3S prep|for det:poss|his n|b...,[1|2|LINK 2|0|ROOT 3|2|JCT 4|5|DET 5|3|POBJ 6|...
3,[coord|and pro:sub|he v|walk-3S adv|out .],[1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT]
4,[coord|and pro:sub|he v|look-3S adv|up .],[1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|3|PUNCT]
...,...,...
8987,[coord|and part|throw-PRESP n|rock-PL prep|at ...,[1|0|INCROOT 2|1|COORD 3|2|OBJ 4|2|JCT 5|4|POB...
8988,[coord|and coord|and pro:indef|one prep|of det...,[1|7|LINK 2|7|LINK 3|7|SUBJ 4|3|NJCT 5|6|DET 6...
8989,[coord|and adv:tem|then pro:per|it v|pop-PAST .],[1|4|LINK 2|4|JCT 3|4|SUBJ 4|0|ROOT 5|4|PUNCT]
8990,[coord|and coord|and adv:tem|then qn|all det:a...,[1|7|LINK 2|7|LINK 3|7|JCT 4|6|QUANT 5|6|DET 6...
