![An interactive LADAL notebook](https://slcladal.github.io/images/uq1.jpg)

# Concordancing with R

This tutorial is the interactive Jupyter notebook accompanying the [*Language Technology and Data Analysis Laboratory* (LADAL) tutorial **Concordancing with R**](https://ladal.edu.au/kwics.html). 


**Preparation and session set up**

We start by activating the packages we need for this tutorial.


In [None]:
# set options
options(warn=-1)  # do not show warnings or messages
# activate packages
library(quanteda) # for concordancing
library(dplyr)    # for table processing
library(stringr)  # for text processing
library(writexl)  # for saving data
library(here)     # for easy pathing


<div class="warning" style='padding:0.1em; background-color: rgba(251,184,0,.5); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>If you are using this notebook on your own computer and you have not already installed the R packages listed above, you need to install them.<br> <a href=
"https://www.dataquest.io/blog/install-package-r/">
        <div class="text">
        <p style='margin-top:1em; text-align:center'>
            Here is Dataquest post on how to install packages in R.
            </p>
        </div>
    </a>
    </b>
</p>
</span>
</div>

<br>

## Using your own data

While the tutorial uses data from the LADAL website, you can also use your own data. To use your own data, follow the instructions below.

To be able to load your own data, you need to click on the folder symbol to the left of the screen:

![Binder Folder Symbol](https://slcladal.github.io/images/binderfolder.JPG)


Then, when the menu has unfolded, click on the smaller folder symbol (encircled in red in the picture below).

![Small Binder Folder Symbol](https://slcladal.github.io/images/upload2.png)

Now, you are in the main menu and can click on the 'MyData' folder.

![MyData Folder Symbol](https://slcladal.github.io/images/upload3.png)

Now, that you are in the MyData folder, you can click on the upload symbol.

![Binder Upload Symbol](https://slcladal.github.io/images/binderupload.JPG)

Select and upload the files you want to analyze. When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.

<div class="warning" style='padding:0.1em; background-color: rgba(251,184,0,.5); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>IMPORTANT: here, we assume that you upload some form of text data - not tabular data! You can upload only txt and docx files!</b><br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>


In [None]:
myfiles <- list.files(here::here("MyData"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# load files
mytext <- sapply(myfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(mytext)


<div class="warning" style='padding:0.1em; background-color: rgba(251,184,0,.5); color:#51247a'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>If you are using your own data, do not execute the next code chunk and change `mytext` into `text` in the code chunk above.</b><br>
</p>
<p style='margin-left:1em;'>
</p></span>
</div>

<br>


If you do not use your own data, you can load the default data, Lewis Caroll's  *Alice's Adventures in Wonderland*, by executing the following code chunk.


In [None]:
text <- base::readRDS(url("https://slcladal.github.io/data/alice.rda", "rb"))
# inspect first 6 text elements
head(text)


The data consists of many separate text elements. Next, we combine the elements into a single text. Then we clean it by removing superfluous white spaces and then we split it into individual  words (this is called tokenising).



In [None]:
text <- text %>%
  # collapse lines into a single  text
  paste0(collapse = " ") %>%
  # remove superfluous white spaces
  str_squish() %>%
  # tokenize
  tokens()
# inspect
head(text)


The text is now split into individual words. 

## Creating simple concordances

Now we can extract concordances using the `kwic` function from the `quanteda` package. This function requires 

+ `x`: a text or collection of texts  
+ `pattern`: a keyword defined by a search pattern  
+ `window`: the size of the context window (how many word before and after)  
+ `valuetype`: the type of pattern matching  
  + "glob" for "glob"-style wildcard expressions;  
  + "regex" for regular expressions; or  
  + "fixed" for exact matching  
+ `separator`: a character to separate words in the output  
+ `case_insensitive`: logical; if TRUE, ignore case when matching a pattern or dictionary values


In [None]:
mykwic <- kwic(
  # define text
  text, 
  # define target word (this is called the "search pattern")
  pattern = phrase("alice"),
  # 5 worde before and after
  window = 5,
  # no regex
  valuetype = "fixed",
  # words separated by whitespace
  separator = " ",
  # search should be case insensitive
  case_insensitive = TRUE)
# inspect
mykwic %>%
  as.data.frame() %>%
  head()


## Exporting concordances

To export a concordance table as an MS Excel spreadsheet, we use `write_xlsx`. Be aware that we use the `here` function to  save the file in the current working directory.


In [None]:
getwd()
here::here("MyOutput/mykwic.xlsx")
write_xlsx(mykwic, here::here("MyOutput/mykwic.xlsx"))


[Back to LADAL](https://ladal.edu.au/kwics.html)

