<!--html_preserve-->
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-130562131-1"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-130562131-1');
</script>
<!--/html_preserve-->


In [None]:
knitr::include_graphics("https://slcladal.github.io/images/uq1.jpg")



# Introduction{-}

This section presents different case studies or use cases that highlight how to do corpus-based analyses by implementing procedures shown in other LADAL tutorials. In other words, here we aim to show how the content of other tutorials can be put into practice. However, please keep in mind that the case studies  merely aim to exemplify ways in which R can be used in language-based research - rather than providing detailed procedures on how to do corpus-based research. The R markdown document of this case study can be downloaded [here](https://slcladal.github.io/rscripts/corplingr.Rmd).

# What is Corpus Linguistics?{-}


In [None]:
knitr::include_graphics("https://slcladal.github.io/images/antconc.png")



Corpus Linguistics (CL) can be considered both a methodology and a field of study. The defining feature of corpus linguistics research is the use of corpora (plural of corpus) to understand language [@biber1998corpus]. A corpus is a collection of machine-readable (electronic) texts and CL emerged in the 1960s, but only really expanded since the 1990s when the use fo computers and software made it possible for researchers to analyze corpora efficiently [@lindquist2009corpus].

The texts represented in a corpus can be very different in nature and reflect many different uses of language - texts in corpora can, for example, represent news paper articles, parliamentary debates, dinner conversations, mothers talking to children, interviews, student essays, letters, lectures, etc. [@mcenery2001corpuslinguistics].

Corpora can also be very different. *Monitor corpora*, for example, aim to represent the whole variety of contexts in which people speaking a particular language, e.g. English, use that language. *Specialized corpora*, however, try to reflect language use in a specific context or register, for example the use of English in Academia or in business transactions or, for instance, how parents talk with children. 


In [None]:
knitr::include_graphics("https://slcladal.github.io/images/quantcl_gries.png")



In addition, we can differentiate between diachronic (or historical) and synchronic corpora: *diachronic corpora* contain language samples collected across different points in time so that changes in the use of language can be studied. *Synchronic corpora* represent language samples collected at one point (or period) in time and aim to reflect the state of language use during that one period (monitor corpora are typically synchronic). 


In addition to providing examples of actual, naturally occurring language use, corpora offer frequency and probability information. In other words, corpora provide information of how frequent certain phenomena or linguistic variants are compared to other phenomena and which factors condition their use  [@gries2009whatiscorpuslinguistics]. 

The use of corpora has increased dramatically as corpora offer a relatively cheap, comparatively easy, flexible, and externally valid method for analyzing language use and testing hypotheses about linguistic behavior.

## Notes on loading corpus data into R{-}

Before we continue with the case studies, it is important to see how what we will be doing in the case studies differs from what you will most likely do if you conduct corpus-based research.

If a corpus is not accessed via a web application, corpora (collections of electronic language samples) typically - almost always - come in the form of text or audio files in a folder. That is, when using corpora, researchers typically download that corpus from a repository (for instance a website) or from some other storage media (for instance a CD, USB-stick, etc.). This means that non-web-based corpora are typically somewhere on a researcher's computer where they can then be loaded into some software, e.g. AntConc. 

For the present tutorial, however, we will simply load data that is available via the LADAL GitHub repository. Nonetheless, it is important to know how to load corpus data into R - which is why I will show this below. 

Loading corpus data into R consists of two steps: 

1. create a list of paths of the corpus files

2. loop over these paths and load the data in the files identified by the paths.

To create a list of corpus files, you could use the code chunk below (the code chunk assumes that the corpus data is in a folder called *Corpus* in the *data* sub-folder of your Rproject folder).


In [None]:
corpusfiles <- list.files(here::here("data/Corpus"), # path to the corpus data
                          # file types you want to analyze, e.g. txt-files
                          pattern = ".*.txt",
                          # full paths - not just the names of the files
                          full.names = T)            


You can then use the `sapply` function to loop over the paths and load the data int R using e.g. the `scan` function as shown below. In addition to loading the file content, we also paste all the content together using the `paste0` function and remove superfluous white spaces using the `str_squish` function from the `stringr` package.



In [None]:
corpus <- sapply(corpusfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})


Once you have loaded your data into R, you can then continue with processing and transforming the data according to your needs.

> NOTE
> 
> There are many different ways in which you can load text data into R. What I have shown above is just one way of doing this. However, I found this procedure to load text data very useful. In the  case study which exemplifies how you can analyze sociolinguistic variation, we show how you can load text data in a very similar yet slightly different way (the tidyverse style of loading text data).

## Preparation and session set up{-}

The case studies shown below are based on R. Thus, you should already be familiar with R and Rstudio. If you have not installed R or are new to it, you will find an introduction to and more information how to use R [here](https://slcladal.github.io/intror.html). In addition, it is recommended to be familiar with regular expressions ([this tutorials](https://slcladal.github.io/regex.html) contains an overview of regular expressions that are used in this tutorial). 

You should also have downloaded  and installed R and RStudio. [This tutorial](https://slcladal.github.io/intror.html) contains links to detailed how-tos on how to download and install R and RStudio.

For this case study, we need to install certain *packages* from an R *library* so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).


In [None]:
# installing packages
install.packages("tidyverse")
install.packages("flextable")
install.packages("knitr")
install.packages("here"))
install.packages("quanteda")
install.packages("cfa")


In [None]:
# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# load packages
library(tidyverse)
library(flextable)
library(knitr)
library(here)
library(quanteda)
library(cfa)


Once you have installed R-Studio and initiated the session by executing the code shown above, you are good to go.

# Studying First Language Acquisiton{-}

This case study shows how you can use and analyze the *Child Language Data Exchange System* (CHILDES) [@macwhinney1996childes] data base using R to show case how you can use it in your own studies. This section of the tutorials consists of two parts: 

* in the first part, we load, process, and transform the data so that we have all pieces of information in a tidy table format. 

* in the second part, we perform selected case studies showing how to extract information form the table we have created out of the corpus files.

The first part is necessary because the corpus data comes in the CHILDSE comes in a special format that makes it somewhat tedious to extract information from it. Thus, we want to separate all information in different columns. In the end, the corpus data should have the following format:


In [None]:
id <- data.frame(1:6)
id <- id %>%
  dplyr::rename("id" = colnames(id)[1]) %>%
  dplyr::mutate(file = c("aab", "aab", "aab", "aab", "aab", "aab"),
                childage = c("4;6", "4;6", "4;6", "4;6", "4;6", "4;6"),
                child = c("ben", "ben", "ben", "ben", "ben", "ben"),
                speaker = c("MOT", "MOT", "ben", "MOT", "ben", "MOT"),
                utterance = c("How are you ?", "Ben ?", "Okay", "Are you hungry ?", "No", "Sure ?"),
                tagged = c("How|WH are|BE you|PN ?|PC", "Ben|NNP ?|PC", "Okay|RB", "Are|BE you|PN hungry|JJ ?|PC", "No|NG", "Sure|RB ?|PC"),
                comment = c("", "", "", "", "shakes head", ""))
# inspect data
flextable::flextable(id) %>%
  flextable::autofit()


So, after the processing, the data should contain each utterance in a separate line and each line should also contain information about the speaker and the file.


## Using CHILDES data{-}

The [*Child Language Data Exchange System* (CHILDES)](https://childes.talkbank.org/) [@macwhinney1996childes] is a browsable data base which provides corpora consisting of transcripts of conversations with children. CHILDES was established in 1984 by Brian MacWhinney and Catherine Snow and it represents the central repository for data of first language acquisition. Its earliest transcripts date from the 1960s, and it now has contents (transcripts, audio, and video) in 26 languages from 130 different corpora, all of which are publicly available worldwide. 


In [None]:
knitr::include_graphics("https://slcladal.github.io/images/childes01.png")



CHILDES is the child language part of the TalkBank system which is a system for sharing and studying conversational interactions.

<br><br>

To download corpora from CHILDES:

* Go to the [CHILDES website](https://childes.talkbank.org/) - the landing page looks like the website shown in the image to the right. For the present tutorial, the only relevant part of that website is section labeled *Database* which contains the links to the different CHILDS corpora that you can download fro free. 


* In the section called *Database* click on *Index to Corpora* which will take you to a table which contains links to different kinds of corpora - all containing transcripts of children's speech. The types of corpora available cover many different language, including monolingual and bilingual children, children with speech disorders, transcripts of frog stories, etc. 


In [None]:
knitr::include_graphics("https://slcladal.github.io/images/childes02.png")



* To download a corpus, click on one of the section, e.g. on *Eng-NA* which stands for *English recorded in North America* (but you can, of course, also download other CHILDES corpora), and then scroll down to the corpus you are interested in and click on it, e.g. scroll down to and click on *HSLLD*. 



In [None]:
knitr::include_graphics("https://slcladal.github.io/images/childes03.png")



* Click on *Download transcripts* and then download and store the zip-folder somewhere on your computer.

* Next, unzip the zip-file and store the resulting unzipped corpus in the *data* sub-folder in your Rproject folder.

Once you have downloaded, stored the data on your computer, and unzipped it, you are good to go and you can now access and analyze data from CHILDES.

<br><br>

## HSLLD corpus{-}

For this case study, we will use data from the *Home-School Study of Language and Literacy Development* corpus (HSLLD)  which part of the CHILDES data base. The *Home-School Study of Language and Literacy Development* began in 1987 under the leadership of  Patton Tabors and with Catherine E. Snow and David K. Dickinson as  primary investigators. The original purpose of the HSLLD was to investigate the social prerequisites to literacy success. 

The initial number of participants was 83 American English speaking, racially diverse, preschool age children from low-income families growing up in or around Boston, Massachusetts. Seventy-four of these children were still participating at age 5. The sample consists of 38 girls and 36 boys. Forty-seven children were Caucasian, 16 were African American, six were of Hispanic origin, and five were biracial. 

Children were visited once a year in their home from age 3 – 5 and then again when they were in 2nd and 4th grade. Each visit lasted between one and three hours. Home visits consisted of a number of different tasks depending on the year. An outline of the different tasks for each visit is presented below. 

Activities during **Home Visit 1** (HV1): Book reading (BR), Elicited report (ER), Mealtime (MT), Toy Play (TP)  

Activities during **Home Visit 2** (HV2): Book reading (BR), Elicited report (ER), Mealtime (MT), Toy Play (TP)  

Activities during **Home Visit 3** (HV3): Book reading (BR), Elicited report (ER), Experimental task (ET), Mealtime (MT), Reading (RE), Toy play (TP)  

Activities during **Home Visit 5** (HV5): Book reading (BR), Letter writing (LW), Mealtime (MT)  

Activities during **Home Visit 7** (HV7): Experimental task (ET), Letter writing (LW), Mother definitions (MD), Mealtime (MT)  

## Data processing{-}

We now load the data and inspect its structure using the `str` function - as the HSLLD has many files, we will only check the first 3.


In [None]:
url = "https://slcladal.github.io/data/hslld.rda"
download.file(url,"hslld.rds", method="curl")
hslld <- readRDS("hslld.rds")
# inspect
str(hslld[1:3])


We continue and split the data up into files. The `sapply` function loops each element in the `hslld` object and performs specified actions on the (here, loading the content via the `scan` function, getting rid of white spaces and splitting the files when it finds the following sequences `*ABC1:` or `%ABC:`)



In [None]:
# create version of corpus fit for concordancing
corpus <- sapply(hslld, function(x) {
  # clean data
  x <- stringr::str_trim(x, side = "both") # remove superfluous white spaces at the edges of strings
  x <- stringr::str_squish(x)              # remove superfluous white spaces within strings
  x <- paste0(x, collapse = " ")           # paste all utterances ina file together
  # split files into indivisual utterances
  x <- strsplit(gsub("([%|*][a-z|A-Z]{2,4}[0-9]{0,1}:)", "~~~\\1", x), "~~~")
})
# inspect results
str(corpus[1:3])


We have now loaded the files into R, but the format is not yet structured in a wqy thatwe can use it - remember: we want the data to be in a tabular format.

**Extract file information**

Now, we extract information about the recording, e.g., the participants, the age of the child, the date of the recording etc. For this, we extract the first element of each file (because this first element contains all the relevant information bout the recording). To do this, we again use the `sapply` function (which is our looping function) and then tell R that it shall only retain the first element of each element (`x <- x[1]`).


In [None]:
# extract file info for each file
fileinfo <- sapply(corpus, function(x){ 
  # extract first element of each corpus file because this contains the file info
  x <- x[1]
  })
#inspect
fileinfo[1:3]


Now, we have one element for each file that contains all the relevant information about the file, like when the recording took place, how old the target child was, how was present during the recording etc.

**Extract file content**

Now, we extract the raw content from which we will extract the speaker, the utterance, the pos-tagged utterance, and any comments.Here, we loop over the `corpus` object with the `sapply` function and we remove the first element in each list (and we retain the second to last element of each element (`x <- x[2:length(x)]`)), then we paste everything else together using the `paste0` function and then, we split the whole conversation into utterances that start with a speaker id (e.g. `*MOT:`). The latter is done by the sequence `stringr::str_split(stringr::str_replace_all(x, "(\\*[A-Z])", "~~~\\1"), "~~~")`.

  


In [None]:
content <- sapply(corpus, function(x){
  x <- x[2:length(x)]
  x <- paste0(x, collapse = " ")
  x <- stringr::str_split(stringr::str_replace_all(x, "(\\*[A-Z])", "~~~\\1"), "~~~")
})
# inspect data
content[[1]][1:6]


The data now consists of utterances but also the pos-tagged utterances and any comments. However, we use this form of the data to extract the clean utterances, the pos-tagged utterances and the comments and store them in different columns. 


**Extract information**

Now, we extract how many elements (or utterances) there are in each file by looping over the `content` object and extracting the number of elements within each element of the `content` object by using the `lenght` function. 


In [None]:
elements <- sapply(content, function(x){
  x <- length(x)
})
# inspect
head(elements)


**Generate table**

We now have the file names, the metadata for each file, and the content of each file (that is split into utterances). We use this information to generate a first table which holds the file name in one column, the file information in one column, and the raw file content in another column. To combine these three pieces of information though, we need to repeat the file names and the file information as often as there are utterances in each file. We perform this repetition using the `rep` function. Once we have as many file names and file information as there are utterances in each file, we can combine these three vectors into a table using the `data.frame` function. 


In [None]:
files <- rep(names(elements), elements)
fileinfo <- rep(fileinfo, elements)
rawcontent <- as.vector(unlist(content))
chitb <- data.frame(1:length(rawcontent),
                    files,
                    fileinfo,
                    rawcontent)


The table in its current form is shown below. We can see that the table has three columns: the first column holds the path to each file, the second contains the file information, and the third the utterances.



In [None]:
# inspect data
chitb %>% 
  head(3) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


**Process table**

We can now use the information in the two last columns to extract specific pieces of information from the data (which we will store in additional columns that we add to the table). But first, we rename the `id` column (which is simply an index of each utterance) using the `rename` function from the `dplyr` package. Then, we clean the file name column (called `files`) so that it only contains the name of the file, so we remove the rest of the path information that we do not need anymore. We do this by using the `mutate` function from the `dplyr` package  (which changes columns or creates new columns). Within the `mutate` function, we use the `gsub` function which substitutes something with something else: here the full path is replaced with on that part of the path that contains the file name. The `gsub` function has the following form

> gsub(*look for pattern*, *replacement of the pattern*, object)

This means that the `gsub` function needs an object and in that object it looks for a pattern and then replaces instances f that pattern with something.

In our case, that what we look for is the file name which is located between the symbol `/` and the file ending (`.cha`). So, we extract everything that comes between a `/` and a `.cha` in the path and keep that what is between the `/` and a `.cha`  in R's memory (this is done by placing something in round brackets in a regular expression). Then, we paste that what we have extracted back (and which is stored in memory) by using the `\\1` which grabs the first element that is in memory and puts it into the *replace with* part of the `gsub` function. 


In [None]:
hslld <- chitb %>%
  # rename id column
  dplyr::rename(id = colnames(chitb)[1]) %>%
  # clean file names
  dplyr::mutate(files = gsub(".*/(.*?).cha", "\\1", files))


Let's have a look at the data.



In [None]:
# inspect data
# inspect data
hslld %>% 
  head() %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


We now continue in the same manner (by remove what is before what interests us and what comes after) and thereby extract pieces of information that we store in new columns.

Creating a speaker column. We create a new column called `speaker` using the `mutate` function from the `dplyr` package. Then, we use the `str_replace_all` function from the `stringr` package to remove everything that comes after a `:`. *Everything that comes after* can be defined by a regular expression - in this case the sequence `.*`. The `.` is a regular expression that stands for *any symbol* - be it a letter, or a number, or any punctuation symbol, or a white space. The `*` is a numerating regular expression that tells R how many times the regular expression (the `.`) is repeated - in our case, the `*` stands for *zero to an infinite number*. So the sequence `.*` stands for any symbol, repeated zero to an infinite number of times. In combination, the sequence `:.*` stands for *look for a colon and anything that comes after. And because we have put this into the `str_replace_all` function, the colon adn everything that comes after is removed.


In [None]:
hslld <- hslld %>%  
  dplyr::mutate(speaker = stringr::str_remove_all(rawcontent, ":.*"),
                speaker = stringr::str_remove_all(speaker, "\\W"))


In the following, we will create many different columns, but we will always follow the same scheme: generate a new column using the `mutate` function from the `dplyr` package and then remove stuff that we do not need by using the `str_remove_all` function from the `stringr` package or just the `gsub` function - which is a simple replacement function. We can also use `str_squish` to get rid of superfluous white spaces. We will always remove sequences that are defined by a string (a sequence of characters and a regular expression consisting of the regular expression that determines what type of symbol R is supposed to look for and a numerator which tells R how many times that symbol can occur). For example, `%mor:.*` tells R to look for the sequence `%mor:` and any symbol, repeated between zero and an infinite number of times, that comes after the `%mor:` sequence. As thsi is put into the `str_replace_all` function and applied to the `rawcontent` file, it will replace everything that comes after `%mor:` and the sequence `%mor:` itself.

Creating an utterance column.


In [None]:
hslld <- hslld %>%  
  dplyr::mutate(utterance = stringr::str_remove_all(rawcontent, "%mor:.*"),
                utterance = stringr::str_remove_all(utterance, "%.*"),
                utterance = stringr::str_remove_all(utterance, "\\*\\w{2,6}:"),
                utterance = stringr::str_squish(utterance))


Creating a column with the pos-tagged utterances.



In [None]:
hslld <- hslld %>%  
  dplyr::mutate(postag = stringr::str_remove_all(rawcontent, ".*%mor:"),
                postag = stringr::str_remove_all(postag, "%.*"),
                postag = stringr::str_remove_all(postag, "\\*\\w{2,6}:"),
                postag = stringr::str_squish(postag))


Creating a  column with comments. In the following chunk, we use the `?` in combination with `.*`. In this case, the `?` does not mean the literal symbol `?` but it tells R to be what is called *non-greedy* which means that R will look for something until the first occurrence of something. So the sequence `.*?%` tells R to look for any symbol repeated between zero and an infinite number of times until *the first occurrence*(!) of the symbol `%`. If we did not include the `?`, R would look until the last (not the first) occurrence of `%`.



In [None]:
hslld <- hslld %>%  
  dplyr::mutate(comment = stringr::str_remove_all(rawcontent, ".*%mor:"),
                comment = stringr::str_remove(comment, ".*?%"),
                comment = stringr::str_remove_all(comment, ".*|.*"),
                comment = stringr::str_squish(comment))


Creating a  column with the participants that were present during the recording.



In [None]:
hslld <- hslld %>%  
  dplyr::mutate(participants = gsub(".*@Participants:(.*?)@.*", "\\1", fileinfo))


Creating a  column with the age of the target child. In the following, the sequence `[0-9]{1,3}` means look for any sequence containing between 1 and 3 (this is defined by the `{1,3}`) numbers (the numbers are defined by the `[0-9]` part). Also, when we put `\\` before something, then we tell R that this refers to the actual symbol and not its meaning as a regular expression. For example, the symbol `|` is a regular expression that means *or* as in *You can paint my walls blue OR orange*, but if we put `\\` before `|`, we tell R that we really mean the symbol `|`.



In [None]:
hslld <- hslld %>%
  dplyr::mutate(age_targetchild = gsub(".*\\|([0-9]{1,3};[0-9]{1,3}\\.[0-9]{1,3})\\|.*", "\\1", fileinfo)) 


Creating a  column with the age of the target child in years.



In [None]:
hslld <- hslld %>%
  dplyr::mutate(age_years_targetchild = stringr::str_remove_all(age_targetchild, ";.*")) 


Creating a  column with the gender of the target child.



In [None]:
hslld <- hslld %>%
  dplyr::mutate(gender_targetchild = gsub(".*\\|([female]{4,6})\\|.*", "\\1", fileinfo))


Creating columns with the date-of-birth of the target child, more comments, and the date of the recording.



In [None]:
hslld <- hslld %>%  
  # create dob_targetchild column
  dplyr::mutate(dob_targetchild = gsub(".*@Birth of CHI:(.*?)@.*","\\1", fileinfo)) %>%
  # create comment_file column
  dplyr::mutate(comment_file = gsub(".*@Comment: (.*?)@.*", "\\1", fileinfo)) %>%
  # create date column
  dplyr::mutate(date = gsub(".*@Date: (.*?)@.*", "\\1", fileinfo))


Creating columns with the location where the recording took place and the situation type of the recording.



In [None]:
hslld <- hslld %>%  
  # create location column,
  dplyr::mutate(location = gsub(".*@Location: (.*?)@.*", "\\1", fileinfo)) %>%
  # create situation column
  dplyr::mutate(situation = gsub(".*@Situation: (.*?)@.*", "\\1", fileinfo))


Creating columns with the activity during the recording and the homevisit number.



In [None]:
hslld <- hslld %>%  
  # create homevisit_activity column
  dplyr::mutate(homevisit_activity = stringr::str_remove_all(situation, ";.*")) %>%
  # create activity column
  dplyr::mutate(activity = gsub(".*@Activities: (.*?)@.*", "\\1", fileinfo)) %>%
  # create homevisit column
  dplyr::mutate(homevisit = stringr::str_sub(files, 4, 6))


Creating a column with the number of words in each utterance.



In [None]:
hslld <- hslld %>%  
  # create words column
  dplyr::mutate(words = stringr::str_replace_all(utterance, "\\W", " "),
                words = stringr::str_squish(words),
                words = stringr::str_count(words, "\\w+"))


Cleaning the data: removing rows without speakers, rows where the age of the target child was incorrect, and removing superfluous columns.



In [None]:
hslld <- hslld %>%  
  # remove rows without speakers (contain only metadata)
  dplyr::filter(speaker != "") %>%
  # remove rows with incorrect age of child
  dplyr::filter(nchar(age_years_targetchild) < 5) %>%
  # remove superfluous columns
  dplyr::select(-fileinfo, -rawcontent, -situation)  %>%
  # create words column
  dplyr::mutate(collection = "EngNA",
                corpus = "HSLLD") %>%
  dplyr::rename(transcript_id = files) %>%
    # code activity
  dplyr::mutate(visit = substr(transcript_id, 6, 6)) %>%
  dplyr::mutate(situation = substr(transcript_id, 4, 5),
                situation = str_replace_all(situation, "br", "Book reading"),
                situation = str_replace_all(situation, "er", "Elicited report"),
                situation = str_replace_all(situation, "et", "Experimental task"),
                situation = str_replace_all(situation, "lw", "Letter writing"),
                situation = str_replace_all(situation, "md", "Mother defined situation"),
                situation = str_replace_all(situation, "mt", "Meal time"),
                situation = str_replace_all(situation, "re", "Reading"),
                situation = str_replace_all(situation, "tp", "Toy play"))


In [None]:
# inspect data
hslld %>% 
  head() %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


Now that we have the data in a format that we can use, we can use this table to continue with our case studies.

## Case study 1: Use of NO {-}

To extract all instances of a single word, in this example the word *no*, that are uttered by a specific interlocutor we filter by speaker and define that we only want rows where the speaker is equal to `CHI` (target child).


In [None]:
no <- hslld %>%
  dplyr::filter(speaker == "CHI") %>%
  dplyr::filter(stringr::str_detect(utterance, "\\b[Nn][Oo]\\b")) 


In [None]:
# inspect data
no %>% 
  head() %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


We summarize the results in a table. 



In [None]:
no_no <- no %>%
  dplyr::group_by(transcript_id, gender_targetchild, age_years_targetchild) %>%
  dplyr::summarise(nos = nrow(.))
head(no_no)


We can also extract the number of words uttered by children to check if the use of *no* shows a relative increase or decrease over time.



In [None]:
no_words <- hslld %>%
  dplyr::filter(speaker == "CHI") %>%
  dplyr::group_by(transcript_id, gender_targetchild, age_years_targetchild) %>%
  dplyr::mutate(nos = stringr::str_detect(utterance, "\\b[Nn][Oo]\\b")) %>%
  dplyr::summarise(nos = sum(nos),
                   words = sum(words)) %>%
  # add relative frequency
  dplyr::mutate(freq = round(nos/words*1000, 3))
# inspect data
head(no_words)


We can also visualize the trends using the `ggplot` function . To learn how to visualize data in R see [this tutorial](https://slcladal.github.io/dviz.html).



In [None]:
no_words %>%
  dplyr::mutate(age_years_targetchild = as.numeric(age_years_targetchild)) %>%
  ggplot(aes(x = age_years_targetchild, y = freq)) +
  geom_smooth() +
  theme_bw() +
  labs(x = "Age of target child", y = "Relative frequency of NOs \n (per 1,000 words)")


## Case study 2: extracting questions {-}

Here, we want to extract all questions uttered by mothers. We operationalize questions as utterances containing a question mark.


In [None]:
questions <- hslld %>%
  dplyr::filter(speaker == "MOT") %>%
  dplyr::filter(stringr::str_detect(utterance, "\\?"))
# inspect data
head(questions)


We could now check if the rate of questions changes over time.



In [None]:
qmot <- hslld %>%
  dplyr::filter(speaker == "MOT") %>%
  dplyr::mutate(questions = ifelse(stringr::str_detect(utterance, "\\?") == T, 1,0),
                utterances = 1) %>%
  dplyr::group_by(age_years_targetchild) %>%
  dplyr::summarise(utterances = sum(utterances),
                questions = sum(questions),
                percent = round(questions/utterances*100, 2))
# inspect data
head(qmot)


In [None]:
qmot %>%
  dplyr::mutate(age_years_targetchild = as.numeric(age_years_targetchild)) %>%
  ggplot(aes(x = age_years_targetchild, y = percent)) +
  geom_smooth() +
  theme_bw() +
  labs(x = "Age of target child", y = "Percent \n (questions)")


## Case study 3: extracting aux + parts {-}

Here we want to extract all occurrences of an auxiliary plus a participle (e.g. *is swimming*) produced by mothers.


In [None]:
auxv <- hslld %>%
  dplyr::filter(speaker == "MOT") %>%
  dplyr::filter(stringr::str_detect(postag, "aux\\|\\S{1,} part\\|"))
# inspect data
head(auxv)


We can now extract all the particle forms from the pos-tagged utterance



In [None]:
auxv_verbs <- auxv %>%
  dplyr::mutate(participle = gsub(".*part\\|(\\w{1,})-.*", "\\1", postag)) %>%
  dplyr::pull(participle)
head(auxv_verbs)


In [None]:
auxv_verbs_df <- auxv_verbs %>%
  as.data.frame(.)  %>%
  dplyr::rename("verb" = colnames(.)[1]) %>%
  dplyr::group_by(verb) %>%
  dplyr::summarise(freq = n()) %>%
  dplyr::arrange(-freq) %>%
  head(20)
# inspect
head(auxv_verbs_df)


We can again visualize the results. In this case, we create a bar plot (see the `geom_bar`).



In [None]:
auxv_verbs_df %>%
  ggplot(aes(x = reorder(verb, -freq), y = freq)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  labs(x = "Verb", y = "Frequency") +
  theme(axis.text.x = element_text(angle = 90))


## Case study 4: ratio of verbs to words {-}

Here we extract all lexical verbs and words uttered by children by year and then see if the rate of verbs changes over time.


In [None]:
nverbs <- hslld %>%
  dplyr::filter(speaker == "CHI") %>%
  dplyr::mutate(nverbs = stringr::str_count(postag, "^v\\|| v\\|"),
  age_years_targetchild = as.numeric(age_years_targetchild)) %>%
  dplyr::group_by(age_years_targetchild) %>%
  dplyr::summarise(words = sum(words),
                verbs = sum(nverbs)) %>%
  dplyr::mutate(verb.word.ratio = round(verbs/words, 3))
# inspect data
nverbs


We can also visualize the results to show any changes over time. 



In [None]:
nverbs %>%
  ggplot(aes(x = age_years_targetchild, y = verb.word.ratio)) +
  geom_line() +
  coord_cartesian(ylim = c(0, 0.2)) +
  theme_bw() +
  labs(x = "Age of target child", y = "Verb-Word Ratio")


## Case study 5: type-token ratio over time {-}

Here we extract all tokens (words with repetition) and types (words without repetition) uttered by children by year and then see if the type-token ratio changes over time.

In a first step, we create a table with the age of the children in years, we then collapse all utterances of the children into one long utterance and then clean this long utterance by removing digits and superfluous white spaces.

> Tip: A more accurate way of doing this would be to create one utterance for each child per home visit as this would give us a dsitribution of type-token rtaiosn rather than a single value.


In [None]:
utterance_tb <- hslld %>%
  dplyr::filter(speaker == "CHI") %>%
  dplyr::group_by(age_years_targetchild) %>%
  dplyr::summarise(allutts = paste0(utterance, collapse = " ")) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(age_years_targetchild = as.numeric(age_years_targetchild),
                # clean utterance
                allutts = stringr::str_replace_all(allutts, "\\W", " "),
                allutts = stringr::str_replace_all(allutts, "\\d", " "),
                allutts = stringr::str_remove_all(allutts, "xxx"),
                allutts = stringr::str_remove_all(allutts, "zzz"),
                allutts = tolower(allutts)) %>%
  # remove superfluous white spaces
  dplyr::mutate(allutts = gsub(" {2,}", " ", allutts)) %>%
  dplyr::mutate(allutts = stringr::str_squish(allutts))
# inspect data
head(utterance_tb)


Extract the number of tokens, the number of types and calculating the type-token ratio.



In [None]:
tokens <- stringr::str_count(utterance_tb$allutts, " ") +1
types <- stringr::str_split(utterance_tb$allutts, " ")
types <- sapply(types, function(x){
  x <- length(names(table(x)))
})
ttr <- utterance_tb %>%
  dplyr::mutate(tokens = tokens,
                types = types) %>%
  dplyr::select(-allutts) %>%
  dplyr::mutate(TypeTokenRatio = round(types/tokens, 3))
# inspect 
ttr


Plot the type-token ratio against age of the target child.



In [None]:
ttr %>%
  ggplot(aes(x = age_years_targetchild, y = TypeTokenRatio)) +
  geom_line() +
  coord_cartesian(ylim = c(0, 0.75)) +
  theme_bw() +
  labs(x = "Age of target child", y = "Type-Token Ratio")


# Studying Sociolinguistic Variation{-}

This case study represents a corpus-based study of sociolinguistic variation that aims to answer if swearing differs across social groups. In particular, this case study analyzes if speakers coming from different age groups and genders, i.e. whether old or young or men or women swear more, differ in tehir sue of swear words based on a sub-sample of the Irish component of the [International Corpus of English (ICE)](https://www.ice-corpora.uzh.ch/en.html). The case study represents a simplified version of the analysis of [paper](https://www.sciencedirect.com/science/article/pii/S0024384117304357) [@schweinberger2018swearing].

## Data processing{-}

In a first step, we load the load the data into R. The way that the corpus data is loaded in this example is somewhat awkward because the data is in a server directory rather than on a hard drive on a simple PC. If the corpus data is not stored in a directory of a server, then you should not use the code shown immediately below but code in the window following the code immediately below.  


In [None]:
# load txt tidyr style
tbl <- list.files(pattern = "*.txt") %>% 
        map_chr(~ read_file(.)) %>% 
        data_frame(text = .)


In [None]:
# define path to corpus
corpuspath <- "https://slcladal.github.io/data/ICEIrelandSample/"
# define corpusfiles
files <- paste(corpuspath, "S1A-00", 1:20, ".txt", sep = "")
files <- gsub("[0-9]([0-9][0-9][0-9])", "\\1", files)
# load corpus files
corpus <- sapply(files, function(x){
  x <- readLines(x)
  x <- paste(x, collapse = " ")
  x <- tolower(x)
})
# inspect corpus
str(corpus)


If the corpus data is stored on your own computer (on not on a serves as is the case in the present example), you need to adapt the path though as the code below only works on my computer. Just exchange the `corpuspath` with the path to the data on your computer (e.g. with `"D:\\Uni\\UQ\\LADAL\\SLCLADAL.github.io\\data\\ICEIrelandSample"`).


**Data processing and extraction**

Now that the corpus data is loaded, we can prepare the searches by defining the search patterns. We will use regular expressions to retrieve all variants of the swear words. The sequence `\\b` denotes word boundaries while the sequence `[a-z]{0,3}` means that the sequences *ass* can be followed by a string consisting of any character symbol that is maximally three characters long (so that the search would also retrieve *asses*). We separate the search patters by `|` as this means *or*.


In [None]:
searchpatterns <- c("\\bass[ingedholes]{0,6}\\b|\\bbitch[a-z]{0,3}\\b|\\b[a-z]{0,}fuck[a-z]{0,3}\\b|\\bshit[a-z]{0,3}\\b|\\bcock[a-z]{0,3}\\b|\\bwanker[a-z]{0,3}\\b|\\bboll[io]{1,1}[a-z]{0,3}\\b|\\bcrap[a-z]{0,3}\\b|\\bbugger[a-z]{0,3}\\b|\\bcunt[a-z]{0,3}\\b")



After defining the search pattern(s), we extract the kwics (keyword(s) in context) of the swear words. 



In [None]:
# extract kwic
kwicswears <- quanteda::kwic(corpus, searchpatterns,window = 10, valuetype = "regex")


In [None]:
# inspect data
kwicswears %>%
  as.data.frame() %>%
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


We now clean the kwic so that it is easier to see the relevant information.



In [None]:
kwicswearsclean <- kwicswears %>%
  as.data.frame() %>%
  dplyr::rename("File" = colnames(.)[1], 
                "StartPosition" = colnames(.)[2], 
                "EndPosition" = colnames(.)[3], 
                "PreviousContext" = colnames(.)[4], 
                "Token" = colnames(.)[5], 
                "FollowingContext" = colnames(.)[6], 
                "SearchPattern" = colnames(.)[7]) %>%
  dplyr::select(-StartPosition, -EndPosition, -SearchPattern) %>%
  dplyr::mutate(File = str_remove_all(File, ".*/"),
                File = stringr::str_remove_all(File, ".txt"))


In [None]:
# inspect data
kwicswearsclean %>% 
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


We now create another kwic but with much more context because we want to extract the speaker that has uttered the swear word. To this end, we remove everything that proceeds the `$` symbol as the speakers are identified by characters that follow the `$` symbol, remove everything that follows the `>` symbol which end the speaker identification sequence, remove remaining white spaces, and convert the remaining character to upper case. 



In [None]:
# extract kwic
kwiclong <- kwic(corpus, searchpatterns,window = 1000, valuetype = "regex")
kwiclong <- as.data.frame(kwiclong)
colnames(kwiclong) <- c("File", "StartPosition", "EndPosition", "PreviousContext", "Token", "FollowingContext", "SearchPattern")
kwiclong <- kwiclong %>%
  dplyr::select(-StartPosition, -EndPosition, -SearchPattern) %>%
  dplyr::mutate(File = str_remove_all(File, ".*/"),
         File = str_remove_all(File, ".txt"),
         Speaker = str_remove_all(PreviousContext, ".*\\$"),
         Speaker = str_remove_all(Speaker, ">.*"),
         Speaker = str_squish(Speaker),
         Speaker = toupper(Speaker)) %>%
  dplyr::select(Speaker)
# inspect results
head(kwiclong)


We now add the Speaker to our initial kwic. This way, we combine the swear word kwic with the speaker and as we already have the file, we can use the file plus speaker idenification to check if the speaker was a man or a woman.



In [None]:
swire <- cbind(kwicswearsclean, kwiclong)



In [None]:
# inspect data
swire %>% 
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


Now, we inspect the extracted swear word tokens to check if our search strings have indeed captured swear words. 



In [None]:
# convert tokens to lower case
swire$Token <- tolower(swire$Token)
# inspect tokens
table(swire$Token)


FUCK and its variants is by far the most common swear word in our corpus. However, we do not need the type of swear word to answer our research question and we thus summarize the table to show which speaker in which files has used how many swear words.



In [None]:
swire <- swire %>%
  dplyr::group_by(File, Speaker) %>%
  dplyr::summarise(Swearwords = n())


In [None]:
# inspect data
swire %>% 
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


Now that we extract how many swear words the speakers in the corpus have used, we can load the biodata of the speakers.



In [None]:
# load bio data
bio <- read.table("https://slcladal.github.io/data/data01.txt", header = T, sep = "\t")


In [None]:
# inspect data
bio %>% 
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


In [None]:
bio <- bio %>%
  dplyr::rename(File = text.id, 
         Speaker = spk.ref,
         Gender = sex,
         Age = age,
         Words = word.count) %>%
  dplyr::select(File, Speaker, Gender, Age, Words)


In [None]:
# inspect data
bio %>% 
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


In a next step, we combine the table with the speaker information with the table showing the swear word use.



In [None]:
# combine frequencies and biodata
swire <- dplyr::left_join(bio, swire, by = c("File", "Speaker")) %>%
  # replace NA with 0
  dplyr::mutate(Swearwords = ifelse(is.na(Swearwords), 0, Swearwords),
                File = factor(File),
                Speaker = factor(Speaker),
                Gender = factor(Gender),
                Age = factor(Age))
# inspect data
head(swire)


In [None]:
# inspect data
swire %>% 
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


We now clean the table by removing speakers for which we do not have any information on their age and gender. Also, we summarize the table to extract the mean frequencies of swear words (per 1,000 words) by age and gender.



In [None]:
# clean data
swire <- swire %>%
  dplyr::filter(is.na(Gender) == F,
         is.na(Age) == F) %>%
  dplyr::group_by(Age, Gender) %>%
  dplyr::mutate(SumWords = sum(Words),
                SumSwearwords = sum(Swearwords),
                FrequencySwearwords = round(SumSwearwords/SumWords*1000, 3)) 


In [None]:
# inspect data
swire %>% 
  head(10) %>%
  flextable::flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


**Tabulating and visualizing the data**

We now summarize and visualize the data and exclude speakers between the ages of 0 and 18 as there are too few speakers within that age range to be representative.


In [None]:
swire %>%
  dplyr::filter(Age != "0-18") %>%
  dplyr::group_by(Age, Gender) %>%
  dplyr::summarise(Swears_ptw = SumSwearwords/SumWords*1000) %>%
  unique() %>%
  tidyr::spread(Gender, Swears_ptw)


Now that we have prepared our data, we can plot swear word use by gender. 



In [None]:
swire %>%
  dplyr::filter(Age != "0-18") %>%
  dplyr::group_by(Age, Gender) %>%
  dplyr::summarise(Swears_ptw = SumSwearwords/SumWords*1000) %>%
  unique() %>%
ggplot(aes(x = Age, y = Swears_ptw, group = Gender, fill = Gender)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  theme_bw() +
  scale_fill_manual(values = c("orange", "darkgrey")) +
  labs(y = "Relative frequency \n swear words per 1,000 words")


The graph suggests that the genders do not differ in their use of swear words except for the age bracket from 26 to 41: men swear more among speakers aged between 26 and 33 while women swear more between 34 and 41 years of age. 

## Statistical analysis{-}

We  now perform a statistical test, e.g. a Configural Frequency Analysis (CFA)  to check if specifically which groups in the data significantly over and under-use swearwords.


In [None]:
cfa_swear <- swire %>%
  dplyr::group_by(Gender, Age) %>%
  dplyr::summarise(Words = sum(Words),
                   Swearwords = sum(Swearwords)) %>%
  dplyr::mutate(Words = Words - Swearwords) %>%
  tidyr::gather(Type, Frequency,Words:Swearwords) %>%
  dplyr::filter(Age != "0-18")


After transforming the data, it has the following format.



In [None]:
# inspect data
head(cfa_swear, 20) %>%  
  as.data.frame() %>%
  flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::border_outer()


In [None]:
# define configurations
configs <- cfa_swear %>%
  dplyr::select(Age, Gender, Type)
# define counts
counts <- cfa_swear$Frequency


Now that configurations and counts are separated, we can perform the configural frequency analysis.



In [None]:
# perform cfa
cfa(configs,counts)$table %>%
  as.data.frame() %>%
  dplyr::filter(p.chisq < .1,
                stringr::str_detect(label, "Swear")) %>%
  dplyr::select(-z, -p.z, -sig.z, -sig.chisq, -Q)


After filtering out significant over use of non-swear words from the results of the CFA, we find that men and women in the age bracket 26 to 33 use significantly more swear words and other groups in the data.

It has to be borne in mind, though, that this is merely a case study and that a more fine-grained analysis on a substantially larger data set were necessary to get a more reliable impression.

# Citation & Session Info {-}

Schweinberger, Martin. `r format(Sys.time(), '%Y')`. *Corpus Linguistics with R*. Brisbane: The University of Queensland. url: https://slcladal.github.io/corplingr.html   (Version `r format(Sys.time(), '%Y.%m.%d')`).


In [None]:
@manual{schweinberger2021cl,
  author = {Schweinberger, Martin},
  title = {Corpus Linguistics with R},
  note = {https://slcladal.github.io/corplingr.html},
  year = {`r format(Sys.time(), '%Y')`},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {`r format(Sys.time(), '%Y.%m.%d')`}
}


In [None]:
sessionInfo()



***

[Back to top](#introduction)

[Back to HOME](https://slcladal.github.io/index.html)

***


# References{-}
