![This is an interactive LADAL notebook.](https://slcladal.github.io/images/uq1.jpg)

***

Please copy this Jupyter notebook so that you are able to edit it.

Simply go to: File > Save a copy in Drive.

If you want to run this notebook on your own computer, you need to do 2 things:

1. Make sure that you have R installed.

2. You need to download the [bibliography file](https://slcladal.github.io/bibliography.bib) and store it in the same folder where you store the Rmd file.

Once you have done that, you are good to go.

***

# Concordancing with R

This tutorial introduces how to extract concordances and keyword-in-context (KWIC) displays with R. The entire R Notebook for the tutorial can be downloaded [here](https://slcladal.github.io/kwics.Rmd). If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R installed and you also need to download the [bibliography file](https://slcladal.github.io/bibliography.bib) and store it in the same folder where you store the Rmd file. 

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to extract keywords and key phrases from textual data and how to process the resulting concordances using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with concordancing.

In the language sciences, concordancing refers to the extraction of words from a given text or texts (Lindquist 2009, 5). Commonly, concordances are displayed in the form of keyword-in-context displays (KWICs) where the search term is shown in context, i.e. with preceding and following words. Concordancing are central to analyses of text and they often represents the first step in more sophisticated analyses of language data (Stafanowitsch 2020). The play such a key role in the language sciences because concordances are extremely valuable for understanding how a word or phrase is used, how often it is used, and in which contexts is used. As concordances allow us to analyze the context in which a word or phrase occurs and provide frequency information about word use, they also enable us to analyze collocations or the collocational profiles of words and phrases Stafanowitsch 2020, 50-51). Finally, concordances can also be used to extract examples and it is a very common procedure. 

![Concordance produced with AntConc.](https://slcladal.github.io/images/AntConcConcordance.png)

There are various very good software packages that can be used to create concordances - both for offline use (e.g. [*AntConc*](https://www.laurenceanthony.net/software/antconc/) (Anthony 2004), [*SketchEngine*](https://www.sketchengine.eu/) (Kilgarriff 2004), [*MONOCONC*](https://www.monoconc.com/) (Barlow 1999), and [*ParaConc*](https://paraconc.com/)) (Barlow 2002) and online use (see e.g. [here](https://lextutor.ca/conc/)). 

In addition, many corpora that are available such as the [BYU corpora](https://corpus.byu.edu/overview.asp) can be accessed via a web interface that have in-built concordancing functions.  
  
![Concordancein the Corpus of Contemporary American English (COCA).](https://slcladal.github.io/images/KwicCocaLanguage.png)  
  

While these packages are very user-friendly, offer various additional functionalities, and almost everyone who is engaged in analyzing language has used concordance software, they all suffer from shortcomings that render R a viable alternative. Such issues include that these applications  
  
* are black boxes that researchers do not have full control over or do not know what is going on within the software

* they are not open source

* they hinder replication because the replications is more time consuming compared to analyses based on Notebooks.

* they are commonly not free-of charge or have other restrictions on use (a notable exception is *AntConc*)

R represents an alternative to ready-made concordancing applications because it:

* is extremely flexible and enables researchers to perform their entire analysis in a single environment

* allows full transparency and documentation as analyses can be based on Notebooks

* offer version control measures (this means that the specific versions of the involved software are traceable)

* makes research more replicable as entire analyses can be reproduced by simply running the Notebooks that the research is based on 

Especially the aspect that R enables full transparency and replicability is relevant given the ongoing *Replication Crisis* (see, e.g. Diener and Biswas-Diener 2018). The Replication Crisis is a ongoing methodological crisis primarily affecting parts of the social and life sciences beginning in the early 2010s (see also Fanelli 2009). Replication is important so that other researchers, or the public for that matter, can see or, indeed, reproduce, exactly what you have done. Fortunately, R allows you to document your entire workflow as you can store everything you do in what is called a script or a notebook (in fact, this document was originally a R notebook). If someone is then interested in how you conducted your analysis, you can simply share this notebook or the script you have written with that person.

**Preparation and session set up**

For this tutorials, we need to install certain *packages* from an R *library* so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph - it may take some time (between 5 and 15 minutes to install all of the packages for the first time so you do not need to worry if it takes some time).


In [None]:
# install packages
install.packages("quanteda")
install.packages("tidyverse")
install.packages("gutenbergr")


Now that we have installed the packages, we activate them as shown below.



In [None]:
# set options
options(stringsAsFactors = F)          # no automatic data transformation
options("scipen" = 100, "digits" = 12) # suppress math annotation
# activate packages
library(quanteda)
library(gutenbergr)
library(tidyverse)


Once you have installed the packages and initiated the session by executing the code shown above, you are good to go.

## Loading and processing textual data

For this tutorial, we will use Charles Darwin's *On the Origin of Species by means of Natural Selection* which we download from the [Project Gutenberg](https://www.gutenberg.org/) archive (see Stroube 2003). Thus, Darwin's *Origin of Species* forms the basis of our analysis. You can use the code below to download this text into R (but you have to have access to the internet to do so).


In [None]:
origin <- gutenberg_works(
  # define id of darwin's origin in project gutenberg
  gutenberg_id == "1228") %>%
  # download text
  gutenberg_download(meta_fields = "gutenberg_id", 
                     mirror = "http://mirrors.xmission.com/gutenberg/") %>%
  # remove empty rows
  dplyr::filter(text != "")
# inspect data
origin %>%
  # convert to data frame
  as.data.frame() %>%
  # show first 10 lines
  head(10) 


The table above shows that Darwin's *Origin of Species* requires formatting so that we can use it. Therefore, we collapse it into a single object (or text) and remove superfluous white spaces.



In [None]:
origin <- origin$text %>%
  # collapse lines into a single  text
  paste0(collapse = " ") %>%
  # remove superfluous white spaces
  str_squish()
# inspect data
origin %>%
  # show first 1000 characters
  substr(start=1, stop=1000) %>%
  # convert to data frame
  as.data.frame()


The result confirms that the entire text is now combined into a single character object. 

You can also use you own data. The code chunk below shows you how to upload two files from your own computer **BUT** to be able to load your own data, you need to click on the folder symbol to the left of the screen:

![Colab Folder Symbol](https://slcladal.github.io/images/ColabFolder.png)

Then on the upload symbol. 

![Colab Upload Symbol](https://slcladal.github.io/images/ColabUpload.png)

Next, upload the files you want to analyze and then the respective files names in the `file` argument of the `scan` function. When you then execute the code (like to code chunk below, you will upload your own data.


In [None]:
mytext1 <- scan(file = "linguistics01.txt",
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T) %>%
            paste0(collapse = " ")
mytext2 <- scan(file = "linguistics02.txt",
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T) %>%
            paste0(collapse = " ")
# inspect
mytext1; mytext2


To apply the code and functions below to your own data, you will need to modify the code chunks and replace the data we use here with your own data object. 



## Creating simple concordances

Now that we have loaded the data, we can easily extract concordances using the `kwic` function from the `quanteda` package. The `kwic` function takes the text (`x`) and the search pattern (`pattern`) as it main arguments but it also allows the specification of the context window, i.e. how many words/elements are show to the left and right of the key word (we will go over this later on).


In [None]:
kwic_natural <- kwic(
  # define text
  origin, 
  # define search pattern
  pattern = "selection")
# inspect data
kwic_natural %>%
  # convert to data frame
  as.data.frame() %>%
  # show first 10 results
  head(10) 


You will see that you get a warning stating that you should use `token` f´before extracting concordances. This can be done as shown below. Also, we can specify the package from which we want to use a function by adding the package name plus :: before the function (see below)



In [None]:
kwic_natural <- quanteda::kwic(
  # define and tokenize text
  quanteda::tokens(origin), 
  # define serach pattern
  pattern = "selection")
# inspect data
kwic_natural %>%
  # convert to data frame
  as.data.frame() %>%
  # show first 10 results
  head(10) 


We can easily extract the frequency of the search term (*selection*) using the `nrow` or the `length` functions which provide the number of rows of a tables (`nrow`) or the length of a vector (`length`).



In [None]:
nrow(kwic_natural)



In [None]:
length(kwic_natural$keyword)



The results show that there are 414 instances of the search term (*selection*) but we can also find out how often different variants (lower case versus upper case) of the search term were found using the `table` function. This is especially useful when searches involve many different search terms (while it is, admittedly, less useful in the present example). 



In [None]:
table(kwic_natural$keyword)



To get a better understanding of the use of a word, it is often useful to extract more context. This is easily done by increasing size of the context window. To do this, we specify the `window` argument of the `kwic` function. In the example below, we set the context window size to 10 words/elements rather than using the default (which is 5 word/elements).



In [None]:
kwic_natural_longer <- kwic(
  # define text
  origin, 
  # define search pattern
  pattern = "selection", 
  # define context window size
  window = 10)
# inspect data
kwic_natural_longer %>%
  # convert to data frame
  as.data.frame() %>%
  head(10) 


## Extracting more than single words

While extracting single words is very common, you may want to extract more than just one word. To extract phrases, all you need to so is to specify that the pattern you are looking for is a phrase, as shown below.


In [None]:
kwic_naturalselection <- kwic(origin, pattern = phrase("natural selection"))
# inspect data
kwic_naturalselection %>%
  # convert to data frame
  as.data.frame() %>%
  # show first 10 results
  head(10) 


Of course you can extend this to longer sequences such as entire sentences. However, you may want to extract more or less concrete patterns rather than words or phrases. To search for patterns rather than words, you need to include regular expressions in your search pattern. 


## Searches using regular expressions

Regular expressions allow you to search for abstract patterns rather than concrete words or phrases which provides you with an extreme flexibility in what you can retrieve. A regular expression (in short also called *regex* or *regexp*) is a special sequence of characters that stand for are that describe a pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids. For example, the sequence `[a-z]{1,3}` is a regular expression that stands for one up to three lower case characters and if you searched for this regular expression, you would get, for instance, *is*, *a*, *an*, *of*, *the*, *my*, *our*, *etc*, and many other short words as results.

There are three basic types of regular expressions:

* regular expressions that stand for individual symbols and determine frequencies

* regular expressions that stand for classes of symbols

* regular expressions that stand for structural properties

The regular expressions below show the first type of regular expressions, i.e. regular expressions that stand for individual symbols and determine frequencies.


In [None]:
symbols1 <- c("?", "\\*", "\\+", "{n}", "{n,}", "{n,m}")
explanation1 <- c("The preceding item is optional and will be matched at most once", "The preceding item will be matched zero or more times", "The preceding item will be matched one or more times", "The preceding item is matched exactly n times", "The preceding item is matched n or more times", "The preceding item is matched at least n times, but not more than m times")
example1 <- c("walk[a-z]? = walk, walks", 
             "walk[a-z]* = walk, walks, walked, walking", 
             "walk[a-z]+ = walks, walked, walking", 
             "walk[a-z]{2} = walked", 
             "walk[a-z]{2,} = walked, walking", 
             "walk[a-z]{2,3} = walked, walking")
df_regex <- data.frame(symbols1, explanation1, example1)
colnames(df_regex) <- c("RegEx Symbol/Sequence", "Explanation", "Example")
# inspect data
df_regex %>%
  as.data.frame()


The regular expressions below show the second type of regular expressions, i.e. regular expressions that stand for classes of symbols.



In [None]:
symbols2 <- c("[ab]", "[AB]", "[12]", "[:digit:]", "[:lower:]", "[:upper:]", "[:alpha:]", "[:alnum:]", "[:punct:]", "[:graph:]", "[:blank:]", "[:space:]", "[:print:]")
explanations2 <- c("lower case a and b", 
                   "upper case a and b", 
                   "digits 1 and 2", 
                   "digits: 0 1 2 3 4 5 6 7 8 9", 
                   "lower case characters: a–z", 
                   "upper case characters: A–Z", 
                   "alphabetic characters: a–z and A–Z", 
                   "digits and alphabetic characters", 
                   "punctuation characters: . , ; etc.", 
                   "graphical characters: [:alnum:] and [:punct:]", 
                   "blank characters: Space and tab", 
                   "space characters: Space, tab, newline, and other space characters", 
                   "printable characters: [:alnum:], [:punct:] and [:space:]")
df_regex <- data.frame(symbols2, explanations2)
colnames(df_regex) <- c("RegEx Symbol/Sequence", "Explanation")
# inspect data
df_regex %>%
  as.data.frame()


The regular expressions that denote classes of symbols are enclosed in `[]` and `:`. The last type of regular expressions, i.e. regular expressions that stand for structural properties are shown below.



In [None]:
symbols3 <- c("\\\\\\w", "\\\\\\W", "\\\\\\s", "\\\\\\S", "\\\\\\d", "\\\\\\D", "\\\\\\b", "\\\\\\B", "\\\\<", "\\\\>", "^", "$")
explanations3 <- c("Word characters: [[:alnum:]_]",
                   "No word characters: [^[:alnum:]_]",
                   "Space characters: [[:blank:]]",
                   "No space characters: [^[:blank:]]",
                   "Digits: [[:digit:]]",
                   "No digits: [^[:digit:]]",
                   "Word edge",
                   "No word edge",
                   "Word beginning",
                   "Word end",
                   "Beginning of a string",
                   "End of a string")
df_regex <- data.frame(symbols3, explanations3)
colnames(df_regex) <- c("RegEx Symbol/Sequence", "Explanation")
# inspect data
df_regex %>%
  as.data.frame()


To include regular expressions in your KWIC searches, you include them in your search pattern and set the argument `valuetype` to `"regex"`. The search pattern `"\\bnatu.*|\\bselec.*"` retrieves elements that contain `natu` and `selec` followed by any characters and where the `n` in `natu` and the `s` in `selec` are at a word boundary, i.e. where they are the first letters of a word. Hence, our serach would not retrieve words like *unnatural* or *deselect*. The `|` is an operator (like `+`, `-`, or `*`) that stands for *or*.



In [None]:
# define search patterns
patterns <- c("\\bnatu.*|\\bselec.*")
kwic_regex <- kwic(
  # define text
  origin, 
  # define search pattern
  patterns, 
  # define valuetype
  valuetype = "regex")
# inspect data
kwic_regex %>%
  # convert to data frame
  as.data.frame() %>%
  # show first 10 results
  head(10)


## Piping concordances

Quite often, we only want to retrieve patterns if they occur in a certain context. For instance, we might be interested in instances of *selection* but only if the preceding word is *natural*. Such conditional concordances could be extracted using regular expressions but they are easier to retrieve by piping. Piping is done using the `%>%` function from the `dplyr` package and the piping sequence can be translated as *and then*. We can then filter those concordances that contain *natural* using the `filter` function from the `dplyr` package. Note the the `$` stands for the end of a string so that *natural$* means that *natural* is the last element in the string that is preceding the keyword.


In [None]:
kwic_pipe <- kwic(x = origin, pattern = "selection") %>%
  dplyr::filter(stringr::str_detect(pre, "natural$|NATURAL$"))
# inspect data
kwic_pipe %>%
  # convert to data frame
  as.data.frame() %>%
  # show first 10 results
  head(10) 


Piping is a very useful helper function and it is very frequently used in R - not only in the context of text processing but in all data science related domains.

## Arranging concordances and adding frequency information

When inspecting concordances, it is useful to re-order the concordances so that they do not appear in the order that they appeared in the text or texts but by the context. To reorder concordances, we can use the `arrange` function from the `dplyr` package which takes the column according to which we want to re-arrange the data as it main argument. 

In the example below, we extract all instances of *natural* and then arrange the instances according to the content of the `post` column in alphabetical.


In [None]:
kwic_ordered <- kwic(x = origin, pattern = "natural") %>%
  dplyr::arrange(post)
# inspect data
kwic_ordered %>%
  # convert to data frame
  as.data.frame() %>%
  # show first 10 results
  head(10) 


Arranging concordances according to alphabetical properties may, however, not be the most useful option. A more useful option may be to arrange concordances according to the frequency of co-occurring terms or collocates. In order to do this, we need to extract the co-occurring words and calculate their frequency. We can do this by combining the  `mutate`, `group_by`, `n()` functions from the `dplyr` package with the `str_remove_all` function from the `stringr` package. Then, we arrange the concordances by the frequency of the collocates in descending order (that is why we put a `-` in the arrange function). In order to do this, we need to 

1. create a new variable or column which represents the word that co-occurs with, or, as in the example below, immediately follows the search term. In the example below, we use the `mutate` function to create a new column called `post_word`. We then use the `str_remove_all` function to remove everything except for the word that immediately follows the search term (we simply remove everything and including a white space).

2. group the data by the word that immediately follows the search term.

3. create a new column called `post_word_freq` which represents the frequencies of all the words that immediately follow the search term.

4. arrange the concordances by the frequency of the collocates in descending order.


In [None]:
kwic_ordered_coll <- kwic(
  # define text
  x = origin, 
  # define search pattern
  pattern = "natural") %>%
  # extract word following the keyword
  dplyr::mutate(post_word = str_remove_all(post, " .*")) %>%
  # group following words
  dplyr::group_by(post_word) %>%
  # extract frequencies of the following words
  dplyr::mutate(post_word_freq = n()) %>%
  # arrange/order by the frequency of the following word
  dplyr::arrange(-post_word_freq)
# inspect data
kwic_ordered_coll %>%
  # convert to data frame
  as.data.frame() %>%
  # show first 10 lines
  head(10) 


We add more columns according to which we could arrange the concordance following the same schema. For example, we could add another column that represented the frequency of words that immediately preceded the keyword and then arrange according to this column.


We now extract the three words following the keyword (*selection*) and organize the concordances by the frequencies of the following words. We start by taking the concordances and create a clean post column that is all in lower case and that does not contain any punctuation.


In [None]:
kwic_natural %>%
  # convert to data frame
  as.data.frame() %>%
  # create new CleanPost
  dplyr::mutate(CleanPost = stringr::str_remove_all(post, "[:punct:]"),
                CleanPost = stringr::str_squish(CleanPost),
                CleanPost = tolower(CleanPost))-> kwic_natural_following
# inspect
head(kwic_natural_following)


In a next step, we extract the 1^st^, 2^nd^, and 3^rd^ words following the keyword.



In [None]:
kwic_natural_following %>%
  # extract first element after keyword
  dplyr::mutate(FirstWord = stringr::str_remove_all(CleanPost, " .*")) %>%
  # extract second element after keyword
  dplyr::mutate(SecWord = stringr::str_remove(CleanPost, ".*? "),
                SecWord = stringr::str_remove_all(SecWord, " .*")) %>%
  # extract third element after keyword
  dplyr::mutate(ThirdWord = stringr::str_remove(CleanPost, ".*? "),
                ThirdWord = stringr::str_remove(ThirdWord, ".*? "),
                ThirdWord = stringr::str_remove_all(ThirdWord, " .*")) -> kwic_natural_following
# inspect
head(kwic_natural_following)


Next, we calculate the frequencies of the subsequent words and order in descending order from the  1^st^ to the 3^rd^ word following the keyword.



In [None]:
kwic_natural_following %>%
  # calculate frequency of following words
  # 1st word
  dplyr::group_by(FirstWord) %>%
  dplyr::mutate(FreqW1 = n()) %>%
  # 2nd word
  dplyr::group_by(SecWord) %>%
  dplyr::mutate(FreqW2 = n()) %>%
  # 3rd word
  dplyr::group_by(ThirdWord) %>%
  dplyr::mutate(FreqW3 = n()) %>%
  # ungroup
  dplyr::ungroup() %>%
  # arrange by following words
  dplyr::arrange(-FreqW1, -FreqW2, -FreqW3) -> kwic_natural_following
# inspect results
head(kwic_natural_following, 10)


The results now show the concordance arranged by the frequency of the words following the keyword.

## Concordances from transcriptions

As many analyses use transcripts as their primary data and because transcripts have features that require additional processing, we will now perform concordancing based on on transcripts. As a first step, we load five example transcripts that represent the first five files from the Irish component of the [International Corpus of English](https://www.ice-corpora.uzh.ch/en.html).


In [None]:
# define corpus files
files <- paste("https://slcladal.github.io/data/ICEIrelandSample/S1A-00", 1:5, ".txt", sep = "")
# load corpus files
transcripts <- sapply(files, function(x){
  x <- readLines(x)
  })
# inspect first 10 lines of 1^st^ transcript
transcripts[[1]][1:10] %>%
  # convert to data frame
  as.data.frame()


The first ten lines shown above let us know that, after the header (`<S1A-001 Riding>`) and the symbol which indicates the start of the transcript (`<I>`), each utterance is preceded by a sequence which indicates the section, file, and speaker (e.g. `<S1A-001$A>`). The first utterance is thus uttered by speaker `A` in file `001` of section `S1A`. In addition, there are several sequences that provide meta-linguistic information which indicate the beginning of a speech unit (`<#>`), pauses (`<,>`), and laughter (`<&> laughter </&>`).

To perform the concordancing, we need to change the format of the transcripts because the `kwic` function only works on character, corpus, tokens object- in their present form, the transcripts represent a list which contains vectors of strings. To change the format, we collapse the individual utterances into a single character vector for each transcript.


In [None]:
transcripts_collapsed <- sapply(files, function(x){
  # read-in text
  x <- readLines(x)
  # paste all lines together
  x <- paste0(x, collapse = " ")
  # remove superfluous white spaces
  x <- str_squish(x)
})
# inspect data
transcripts_collapsed %>%
  # extract the first 500 characters  
  substr(start=1, stop=500) %>%
  # convert to data frame
  as.data.frame()


We can now extract the concordances. 



In [None]:
kwic_trans <- quanteda::kwic(
  # tokenize transcripts
  quanteda::tokens(transcripts_collapsed), 
  # define search pattern
  pattern = phrase("you know"))
# inspect resulting kwic
kwic_trans %>%
  # convert to data frame
  as.data.frame() %>%
  head(10) # first 10 results


The results show that each non-alphanumeric character is counted as a single word which reduces the context of the keyword substantially. Also, the *docname* column contains the full path to the data which make it hard to parse the content of the table. To address the first issue, we specify the tokenizer that we will use to not disrupt the annotation too much. In addition, we clean the *docname* column and extract only the file name. Lastly, we will expand the context window to 10 so that we have a better understanding of the context in which the phrase was used.



In [None]:
kwic_trans <- quanteda::kwic(
  # tokenize transcripts
  quanteda::tokens(transcripts_collapsed, what = "fasterword"), 
  # define search
  pattern = phrase("you know"),
  # extend context
  window = 10) %>%
  # clean docnames
  dplyr::mutate(docname = str_replace_all(docname, ".*/([A-Z][0-9][A-Z]-[0-9]{1,3}).txt", "\\1"))
# inspect data
kwic_trans %>%
  # convert to data frame
  as.data.frame() %>%
  # show first 10 results
  head(10)


Extending the context can also be used to identify the speaker that has uttered the search pattern that we are interested in. We will do just that as this is a common task in linguistics analyses.

To extract speakers, we need to follow these steps:

1. Create normal concordances of the pattern that we are interested in.

2. Generate concordances of the pattern that we are interested in with a substantially enlarged context window size.

3. Extract the speakers from the enlarged context window size.

4. Add the speakers to the normal concordances using the `left-join` function from the `dplyr` package.


In [None]:
kwic_normal <- quanteda::kwic(
  # tokenize transcripts
  quanteda::tokens(transcripts_collapsed, what = "fasterword"), 
  # define search
  pattern = phrase("you know")) %>%
  as.data.frame()
kwic_speaker <- quanteda::kwic(
    # tokenize transcripts
  quanteda::tokens(transcripts_collapsed, what = "fasterword"), 
  # define search
  pattern = phrase("you know"), 
  # extend search window
  window = 500) %>%
  # convert to data frame
  as.data.frame() %>%
  # extract speaker (comes after $ and before >)
  dplyr::mutate(speaker = stringr::str_replace_all(pre, ".*\\$(.*?)>.*", "\\1")) %>%
  # extract speaker
  dplyr::pull(speaker)
# add speaker to normal kwic
kwic_combined <- kwic_normal %>%
  # add speaker
  dplyr::mutate(speaker = kwic_speaker) %>%
  # simplify docname
  dplyr::mutate(docname = stringr::str_replace_all(docname, ".*/([A-Z][0-9][A-Z]-[0-9]{1,3}).txt", "\\1")) %>%
  # remove superfluous columns
  dplyr::select(-to, -from, -pattern)
# inspect data
kwic_combined %>%
  # convert to data frame
  as.data.frame() %>%
  # extract first 10 concordances
  head(10)


The resulting table shows that we have successfully extracted the speakers (identified by the letters in the `speaker` column) and cleaned the file names (in the `docnames` column).


# Citation & Session Info 

Schweinberger, Martin. 2021. *Concordancing with R*. Brisbane: The University of Queensland. url: https://slcladal.github.io/kwics.html.


In [None]:
sessionInfo()



# References 



Anthony, Laurence. 2004. “AntConc: A Learner and Classroom Friendly, Multi-Platform Corpus Analysis Toolkit.” Proceedings of IWLeL, 7–13.

Barlow, Michael. 1999. Monoconc 1.5 and Paraconc. *International Journal of Corpus Linguistics* 4(1): 173–84.

Barlow, Michael. 2002. "ParaConc: Concordance Software for Multilingual Parallel Corpora". In *Proceedings of the Third International Conference on Language Resources and Evaluation. Workshop on Language Resources in Translation Work and Research*, 20–24.

Diener, Edward, and Robert Biswas-Diener. 2018. "The Replication Crisis in Psychology." In Gilad feldman, *HKU PSYCH2020 Fundamental of Social Psychology*, 6-18. NOBA. url: https://nobaproject.com/modules/the-replication-crisis-in-psychology.

Fanelli, Daniele. 2009. How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data. *PLoS One* 4(5): e5738.

Kilgarriff, Adam, Pavel Rychly, Pavel Smrz, and David Tugwell. 2004. Itri-04-08 the Sketch Engine. *Information Technology* 105: 116.

Lindquist, Hans. 2009. *Corpus Linguistics and the Description of English*. Edinburgh: Edinburgh University Press.

Stefanowitsch, Anatol. 2020. *Corpus Linguistics. A Guide to the Methodology*. Textbooks in Language Sciences. Berlin: Language Science Press.

Stroube, Bryan. 2003. Literary Freedom: Project Gutenberg. *XRDS: Crossroads, the ACM Magazine for Students* 10(1): 3.
