![acqva](https://slcladal.github.io/images/acqva.jpg)



# Preparation{-}


In [None]:
# set options
options(stringsAsFactors = F)          # no automatic data transformation
options("scipen" = 100, "digits" = 12) # suppress math annotation
options(max.print=1000)                # show maximally 1000 elements in the output
# install packages
install.packages(c("tidyverse", "flextable"))
# specify path to corpus
corpuspath <- here::here("data", "HSLLD")


In a next step, we activate the packages that we have installed.



In [None]:
library(tidyverse)
library(flextable)


Before we continue, it is important to think about what we want to do!

In this workshop, we want to load the CHILDES data and convert the data into a format that we can then use to extract information from it. Optimally, the data should have the following format once we have processed the data:


In [None]:
id <- data.frame(1:6)
id <- id %>%
  dplyr::rename("id" = colnames(id)[1]) %>%
  dplyr::mutate(file = c("aab", "aab", "aab", "aab", "aab", "aab"),
                childage = c("4;6", "4;6", "4;6", "4;6", "4;6", "4;6"),
                child = c("ben", "ben", "ben", "ben", "ben", "ben"),
                speaker = c("MOT", "MOT", "ben", "MOT", "ben", "MOT"),
                utterance = c("How are you ?", "Ben ?", "Okay", "Are you hungry ?", "No", "Sure ?"),
                tagged = c("How|WH are|BE you|PN ?|PC", "Ben|NNP ?|PC", "Okay|RB", "Are|BE you|PN hungry|JJ ?|PC", "No|NG", "Sure|RB ?|PC"),
                comment = c("", "", "", "", "shakes head", ""))
# inspect data
flextable::flextable(id) %>%
  flextable::autofit()


So we want to have the data in a tabular format and in this table, each utterance is in a separate line and each line should also contain information about the speaker and the file.

# Data processing

We start the analysis by preparing the R session. This mean that we set options and that we install as well as load packages that we will need. In addition, we specify the path to the unzipped CHILDES data that we will use. 

In a first step, we create a list with the paths to the individual files. This tells the computer where to find the files it is supposed to load.


In [None]:
# list corpus files
cha = list.files(path = corpuspath, 
                 pattern = ".cha$", 
                 all.files = T,
                 full.names = T, 
                 recursive = T, 
                 ignore.case = T)
# use only first 6 files for testing
#cha <- cha[1:6]
# check the first 6 file paths
head(cha)


We now load the data and split it up into files. 

the `sapply` function loops over the elements in the `cha` object and performs specified actions on the (here, loading the content via the `scan` function, getting rid of white spaces and splitting the files when it finds the following sequences `*ABC1:` or `%ABC:`).


In [None]:
# create version of corpus fit for concordancing
corpus <- sapply(cha, function(x) {
  # load data
  x <- scan(x, what = "char", sep = "\t", quiet = T, quote = "", skipNul = T)
  # clean data
  x <- stringr::str_trim(x, side = "both") # remove superfluous white spaces at the edges of strings
  x <- stringr::str_squish(x)              # remove superfluous white spaces within strings
  x <- paste0(x, collapse = " ")           # paste all utterances ina file together
  # split files into indivisual utterances
  x <- strsplit(gsub("([%|*][a-z|A-Z]{2,4}[0-9]{0,1}:)", "~~~\\1", x), "~~~")
})
# inspect results
str(corpus[1:3])


We have now loaded the files into R, but the format is not yet structured in a wqy thatwe can use it - remember: we want the data to be in a tabular format.

## Extract file information{-}

Now, we extract information about the recording, e.g., the participants, the age of the child, the date of the recording etc. For this, we extract the first element of each file (because this first element contains all the relevant information bout the recording).


In [None]:
# extract file info for each file
fileinfo <- sapply(corpus, function(x){ 
  # extract first element of each corpus file because this contains the file info
  x <- x[1]
  })
#inspect
fileinfo[1:3]


Now, we have one element for each file that contains all the relevant information about the file, like when the recording took place, how old the target child was, how was present during the recording etc.

## Extract file content{-}

Now, we extract the raw content from which we will extract the speaker, the utterance, the pos-tagged utterance, and any comments.Here, we loop over the `corpus` object with the `sapply` function and we remove the first element in each list, then we paste everything else together and then, we split the whole conversation into utterances that start with a speaker id (e.g. `*MOT:`).

  


In [None]:
content <- sapply(corpus, function(x){
  x <- x[2:length(x)]
  x <- paste0(x, collapse = " ")
  x <- stringr::str_split(stringr::str_replace_all(x, "(\\*[A-Z])", "~~~\\1"), "~~~")
})
# inspect data
content[[1]][1:6]


The data now consists of utterances but also the pos-tagged utterances and any comments. However, we use this form of the data to extract the clean utterances, the pos-tagged utterances and the comments and store them in different columns. 


## Extract information{-}

Now, we extract how many elements (or utterances) there are in each file. 


In [None]:
elements <- sapply(content, function(x){
  x <- length(x)
})
# inspect
head(elements)


## Generate table{-}

We use this information to generate a first table which holds the file information in one column and the raw file content in another.


In [None]:
files <- rep(names(elements), elements)
fileinfo <- rep(fileinfo, elements)
rawcontent <- as.vector(unlist(content))
chitb <- data.frame(1:length(rawcontent),
                    files,
                    fileinfo,
                    rawcontent)
# inspect data
flextable::flextable(head(chitb, 3)) %>%
  flextable::autofit()


# Process table{-}

We can now use the information in the two columns to extract specific pieces of information from the data (and we store that info in a separate column). But first, we rename the first column and then create a clean file column. We do this by remiving everything before the symbol `/` and then we remove the sequence `.cha`.


In [None]:
childes <- chitb %>%
  # rename id column
  dplyr::rename(id = colnames(chitb)[1]) %>%
  # clean file names
  dplyr::mutate(files = stringr::str_remove_all(files, ".*/"),
                files = stringr::str_remove_all(files, ".cha")) 
# inspect data
flextable::flextable(head(childes, 3)) %>%
  flextable::autofit()


We now continue in the same manner (by remove what is before what interests us and what comes after) and thereby extract pieces of information that we store in new columns.

Rename id column and cleaning file names.


In [None]:
childes <- chitb %>%
  # rename id column
  dplyr::rename(id = colnames(chitb)[1]) %>%
  # clean file names
  dplyr::mutate(files = gsub(".*/(.*?).cha", "\\1", files))


Creating a speaker column.



In [None]:
childes <- childes %>%  
  dplyr::mutate(speaker = stringr::str_remove_all(rawcontent, ":.*"),
                speaker = stringr::str_remove_all(speaker, "\\W"))


Creating an utterance column.



In [None]:
childes <- childes %>%  
  dplyr::mutate(utterance = stringr::str_remove_all(rawcontent, "%mor:.*"),
                utterance = stringr::str_remove_all(utterance, "%gpx:.*"),
                utterance = stringr::str_remove_all(utterance, "%act:.*"),
                utterance = stringr::str_remove_all(utterance, "%par:.*"),
                utterance = stringr::str_remove_all(utterance, "%add:.*"),
                utterance = stringr::str_remove_all(utterance, "\\*\\w{2,6}:"),
                utterance = stringr::str_squish(utterance))


Creating a column with the pos-tagged utterances.



In [None]:
childes <- childes %>%  
  dplyr::mutate(postag = stringr::str_remove_all(rawcontent, ".*%mor:"),
                postag = stringr::str_remove_all(postag, "%.*"),
                postag = stringr::str_remove_all(postag, "\\*\\w{2,6}:"),
                postag = stringr::str_squish(postag))


Creating a  column with comments.



In [None]:
childes <- childes %>%  
  dplyr::mutate(comment = stringr::str_remove_all(rawcontent, ".*%mor:"),
                comment = stringr::str_remove(comment, ".*?%"),
                comment = stringr::str_remove_all(comment, ".*|.*"),
                comment = stringr::str_squish(comment))


Creating a  column with the participants that were present during the recording.



In [None]:
childes <- childes %>%  
  dplyr::mutate(participants = gsub(".*@Participants:(.*?)@.*", "\\1", fileinfo))


Creating a  column with the age of the target child.



In [None]:
childes <- childes %>%
  dplyr::mutate(age_targetchild = gsub(".*\\|([0-9]{1,3};[0-9]{1,3}\\.[0-9]{1,3})\\|.*", "\\1", fileinfo)) 


Creating a  column with the age of the target child in years.



In [None]:
childes <- childes %>%
  dplyr::mutate(age_years_targetchild = stringr::str_remove_all(age_targetchild, ";.*")) 


Creating a  column with the gender of the target child.



In [None]:
childes <- childes %>%
  dplyr::mutate(gender_targetchild = gsub(".*\\|([female]{4,6})\\|.*", "\\1", fileinfo))


Creating columns with the date-of-birth of the target child, more comments, and the date of the recording.



In [None]:
childes <- childes %>%  
  # create dob_targetchild column
  dplyr::mutate(dob_targetchild = gsub(".*@Birth of CHI:(.*?)@.*","\\1", fileinfo)) %>%
  # create comment_file column
  dplyr::mutate(comment_file = gsub(".*@Comment: (.*?)@.*", "\\1", fileinfo)) %>%
  # create date column
  dplyr::mutate(date = gsub(".*@Date: (.*?)@.*", "\\1", fileinfo))


Creating columns with the location where the recording took place and the situation type of the recording.



In [None]:
childes <- childes %>%  
  # create location column,
  dplyr::mutate(location = gsub(".*@Location: (.*?)@.*", "\\1", fileinfo)) %>%
  # create situation column
  dplyr::mutate(situation = gsub(".*@Situation: (.*?)@.*", "\\1", fileinfo))


Creating columns with the activity during the recording and the homevisit number.



In [None]:
childes <- childes %>%  
  # create homevisit_activity column
  dplyr::mutate(homevisit_activity = stringr::str_remove_all(situation, ";.*")) %>%
  # create activity column
  dplyr::mutate(activity = gsub(".*@Activities: (.*?)@.*", "\\1", fileinfo)) %>%
  # create homevisit column
  dplyr::mutate(homevisit = stringr::str_sub(files, 4, 6))


Creating a column with the number of words in each utterance.



In [None]:
childes <- childes %>%  
  # create words column
  dplyr::mutate(words = stringr::str_replace_all(utterance, "\\W", " "),
                words = stringr::str_squish(words),
                words = stringr::str_count(words, "\\w+"))


Cleaning the data: removing rows without speakers, rows where the age of the target child was incotrrect, and removing superfluous columns.



In [None]:
childes <- childes %>%  
  # remove rows without speakers (contain only metadata)
  dplyr::filter(speaker != "") %>%
  # remove rows with incorrect age of child
  dplyr::filter(nchar(age_years_targetchild) < 5) %>%
  # remove superfluous columns
  dplyr::select(-fileinfo, -rawcontent, -situation)


In [None]:
# inspect data
flextable::flextable(head(childes)) %>%
  flextable::autofit()


Check the speakers.



In [None]:
table(childes$speaker)



We can use the table of speakers to classify speakers into different groups (e.g. siblings (SIB), secondary (SCG) and primary caregivers (PCG), and everyone else. In addition, we add proper labels to the activities.



In [None]:
# define groups for siblings and peers.
SIB <- c("BR1", "BR2", "BR3", "BRI", "BRO", "BRO1", "BRO2", "SI1",
         "SI2", "SI3", "SIS", "SIS1", "SIS2", "SIS3", "CO2", "CO3", 
         "COS", "COU", "COU2", "COU3", "FRE", "FRI", "KID", "FR1", 
         "FRE", "FRI")
# define group for primary caregivers
PCG <- c("MOT", "FAT")
# define group for secondary caregivers
SCG <- c("ANT", "AUN", "GFA", "GMA", "GPA", "GRA", "GRM", "UNC")
# clean column names and add interlocutor column
childes <- childes %>%
  # create interlocutor
  dplyr::mutate(interlocutor = dplyr::case_when(speaker %in% SIB ~ "peer",
                                                speaker %in% PCG ~ "primarycaregiver",
                                                speaker %in% SCG ~ "secondarycaregiver",
                                                speaker == "CHI" ~ "child", 
                                                T ~ "other")) %>%
  # code activity
  dplyr::mutate(visit = substr(files, 6, 6)) %>%
  dplyr::mutate(situation = substr(files, 4, 5),
                situation =  str_replace_all(situation, "br", "Book reading"),
                situation = str_replace_all(situation, "er", "Elicited report"),
                situation = str_replace_all(situation, "et", "Experimental task"),
                situation = str_replace_all(situation, "lw", "Letter writing"),
                situation = str_replace_all(situation, "md", "Mother defined situation"),
                situation = str_replace_all(situation, "mt", "Meal time"),
                situation = str_replace_all(situation, "re", "Reading"),
                situation = str_replace_all(situation, "tp", "Toy play"))
# inspect data
table(childes$interlocutor)


# Saving the CHILDES table on your computer 

Now that we have the data in a neat format, we may want to store the data on our computer. To save this table on your computer, you can use the `write.table` function and the `here` function as shown below. The first argument that the `write.table` needs is the object that we want to save. Then it needs to now a path, i.e., where to store the data. Regarding this path, it makes sense to use the `here` function because the `here` function creates nice paths. The `sep` and `row.names` arguments tells R how to store the data.


In [None]:
base::saveRDS(childes, file = here::here("data", "childes.rda"))



# Case studies

Now that we have the data in a format that we can use, we can use this table to perform searches.

In case the above processing has not worked for you, simply visit `https://github.com/AcqVALab/RCHILDES/` and download the file manually. If you store that file in your `data` folder, you can load it by executing the code chunk below.


In [None]:
childes <- base::readRDS(here::here("data", "childes.rda"))
# inspect data
childes[1:3, 1:4]


## Example 1: Extract uses of the word "No" by children {-}

To extract all instances of a single word, in this example the word *no*, that are uttered by a specific interlocutor we filter by speaker and define that we only want rows where the speaker is equal to `CHI` (target child).


In [None]:
no <- childes %>%
  dplyr::filter(speaker == "CHI") %>%
  dplyr::filter(stringr::str_detect(utterance, "\\b[Nn][Oo]\\b"))


In [None]:
# inspect data
flextable::flextable(head(no)) %>%
  flextable::autofit()


We summarize the results in a table. 



In [None]:
no_no <- no %>%
  dplyr::group_by(files, gender_targetchild, age_years_targetchild) %>%
  dplyr::summarise(nos = nrow(.))
head(no_no)


We can also extract the number of words uttered by children to check if the use of *no* shows a relative increase or decrease over time.



In [None]:
no_words <- childes %>%
  dplyr::filter(speaker == "CHI") %>%
  dplyr::group_by(files, gender_targetchild, age_years_targetchild) %>%
  dplyr::mutate(nos = stringr::str_detect(utterance, "\\b[Nn][Oo]\\b")) %>%
  dplyr::summarise(nos = sum(nos),
                   words = sum(words)) %>%
  # add relative frequency
  dplyr::mutate(freq = round(nos/words*1000, 3))
# inspect data
head(no_words)
  


We can also visualize the trends using the `ggplot` function . To learn how to visualize data in R see [this tutorial](https://slcladal.github.io/dviz.html).



In [None]:
no_words %>%
  dplyr::mutate(age_years_targetchild = as.numeric(age_years_targetchild)) %>%
  ggplot(aes(x = age_years_targetchild, y = freq)) +
  geom_smooth() +
  theme_bw() +
  labs(x = "Age of target child", y = "Relative frequency of NOs \n (per 1,000 words)") +
  ggsave(here::here("images", "no_words.png"), width = 6, height = 4, units = "cm")


## Example 2: Extracting all questions by mothers {-}

Here, we want to extract all questions uttered by mothers. We operationalize questions as utterances containing a question mark.


In [None]:
questions <- childes %>%
  dplyr::filter(speaker == "MOT") %>%
  dplyr::filter(stringr::str_detect(utterance, "\\?"))
# inspect data
head(questions)


We could now check if the rate of questions changes over time.



In [None]:
qmot <- childes %>%
  dplyr::filter(speaker == "MOT") %>%
  dplyr::mutate(questions = ifelse(stringr::str_detect(utterance, "\\?") == T, 1,0),
                utterances = 1) %>%
  dplyr::group_by(age_years_targetchild) %>%
  dplyr::summarise(utterances = sum(utterances),
                questions = sum(questions),
                percent = round(questions/utterances*100, 2))
# inspect data
head(qmot)


In [None]:
qmot %>%
  dplyr::mutate(age_years_targetchild = as.numeric(age_years_targetchild)) %>%
  ggplot(aes(x = age_years_targetchild, y = percent)) +
  geom_smooth() +
  theme_bw() +
  labs(x = "Age of target child", y = "Percent \n (questions)")


## Example 3: Extracting aux + part by mothers {-}

Here we want to extract all occurrences of an auxiliary plus a participle (e.g. *is swimming*) produced by mothers.


In [None]:
auxv <- childes %>%
  dplyr::filter(speaker == "MOT") %>%
  dplyr::filter(stringr::str_detect(postag, "aux\\|\\S{1,} part\\|"))
# inspect data
head(auxv)


We can now extract all the particle forms from the pos-tagged utterance



In [None]:
auxv_verbs <- auxv %>%
  dplyr::mutate(participle = gsub(".*part\\|(\\w{1,})-.*", "\\1", postag)) %>%
  dplyr::pull(participle)
head(auxv_verbs)


In [None]:
auxv_verbs_df <- auxv_verbs %>%
  as.data.frame(.)  %>%
  dplyr::rename("verb" = colnames(.)[1]) %>%
  dplyr::group_by(verb) %>%
  dplyr::summarise(freq = n()) %>%
  dplyr::arrange(-freq) %>%
  head(20)
# inspect
head(auxv_verbs_df)


We can again visualize the results. In this case, we create a bar plot (see the `geom_bar`).



In [None]:
auxv_verbs_df %>%
  ggplot(aes(x = reorder(verb, -freq), y = freq)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  labs(x = "Verb", y = "Frequency") +
  theme(axis.text.x = element_text(angle = 90))


## Example 4: How many verbs do children use by age? {-}

Here we extract all lexical verbs and words uttered by children by year and then see if the rate of verbs changes over time.


In [None]:
nverbs <- childes %>%
  dplyr::filter(speaker == "CHI") %>%
  dplyr::mutate(nverbs = stringr::str_count(postag, "^v\\|| v\\|"),
  age_years_targetchild = as.numeric(age_years_targetchild)) %>%
  dplyr::group_by(age_years_targetchild) %>%
  dplyr::summarise(words = sum(words),
                verbs = sum(nverbs)) %>%
  dplyr::mutate(verb.word.ratio = round(verbs/words, 3))
# inspect data
nverbs


We can also visualize the results to show any changes over time. 



In [None]:
nverbs %>%
  ggplot(aes(x = age_years_targetchild, y = verb.word.ratio)) +
  geom_line() +
  coord_cartesian(ylim = c(0, 0.2)) +
  theme_bw() +
  labs(x = "Age of target child", y = "Verb-Word Ratio")


# Saving data to your computer{-}

To save results on your computer, you can use the `write.table` function as shown below.


In [None]:
write.table(nverbs, here::here("tables", "nverbs.txt"), sep = "\t", row.names = F)



***

# Citation & Session Info {-}

Schweinberger, Martin. `r format(Sys.time(), '%Y')`. *Working with the Child Language Data Exchange System (CHILDES) using R: code book*. Tromsø: The Artic University of Norway. url: https://slcladal.github.io/mmws.html (Version `r format(Sys.time(), '%Y.%m.%d')`).


In [None]:
@manual{schweinberger`r format(Sys.time(), '%Y')`mmws,
  author = {Schweinberger, Martin},
  title = {Working with the Child Language Data Exchange System (CHILDES) using R},
  note = {https://slcladal.github.io/mmws.html},
  year = {2021},
  organization = "Arctic University of Norway, AcqVA Aurora Center},
  address = {Tromsø},
  edition = {`r format(Sys.time(), '%Y.%m.%d')`}
}


In [None]:
sessionInfo()



***

[Back to top](#introduction)

***
