Skip to content

mekline/CLANtoR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLANtoR

DOI

To cite this repository:

Kline, Melissa. (2012) CLANtoR (Github repository). Available at http://github.com/mekline/CLANtoR/ DOI:10.5281/zenodo.1196626

This is a fairly simple R script that just takes a single CHILDES-formatted corpus file and turns it into an R dataframe with all corpus information (e.g. child, age, date) listed on each utterance. From there it can be written out as a CSV or other standard file format.

Note: this is older code, and a much better solution for converting these corpora to tabular data now exists! Go check out childes-db

In addition to copying any tiers that it finds into columns of the dataframe, CLANtoR also creates a gloss of the utterance that removes corpus notation, leaving an unannotated version that is suitable for displaying to experimental participants or conference audiences :) In the example script I've also done some additional processing that deletes ending punctuation and various whitespace anomalies in order to find even more sentences with matched gloss and %mor lines.

Be wary of the output you get with this script, especially if you wish to use tiers other than gloss and mor. If you are not already, become familiar with text encoding, control characters, and csv formats to understand what might be happening if a line looks funny. Especially watch out if there are commas or quotation marks anywhere in your corpus file.

In the Brown (1973) corpus this finds matching lengths for approximately 97% of adult sentences and 95% of child sentences- most of the exceptions are repairs (He has, she has a doll) and sentence fragments. The script is pretty slow (10 minutes or so to read the entire Eve corpus on my machine) but you should only need to run it once per corpus.

I also include two additional Python files that may make this code more useful, by cleaning up some common errors and splitting the csv output into manageable sizes.

ClantoR.R - The main script

example.R Example script for the Brown corpus

fixer.py - Corrects for some of the text encoding anomalies

splitter.py - Gives you a set of smaller csv slices that are somewhat less likely to crash your spreadsheet program.

About

Code to turn a CHILDES-style corpus file into a nicely formatted dataframe

Resources

License

Stars

Watchers

Forks

Packages

No packages published