![Regular Expressions in R](https://slcladal.github.io/images/uq1.jpg)

# Regular Expressions in R

This tutorial introduces regular expressions and how they can be used when working with language data. The entire R markdown document for the sections below can be downloaded [here](https://slcladal.github.io/regex.Rmd).

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to use regular expression (or wild cards) in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful functions and methods associated with regular expressions. 


How can you search texts for complex patterns or combinations of patterns? This question will answered in this tutorial and at the end you will be able to perform very complex searches yourself. The key concept of this tutorial is that of a regular expression. A regular expression (in short also called *regex* or *regexp*) is a special sequence of characters (or string) for describing a search pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids. 

If you would like to get deeper into regular expressions, I can recommend Friedl (2006) and, in particular, chapter 17 of Peng (2020) for further study (although the latter uses base R rather than tidyverse functions, but this does not affect the utility of the discussion of regular expressions in any major or meaningful manner). Also, here is a so-called cheatsheet about regular expressions written by Ian Kopacka and provided by RStudio.

**Preparation and session set up**

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R [here](https://slcladal.github.io/Intror.html). For this tutorials, we need to install certain *packages* from an R *library* so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).


In [None]:
# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install packages
install.packages("tidyverse")
install.packages("flextable")


In a next step, we load the packages.



In [None]:
library(tidyverse)
library(flextable)


Once you have installed RStudio and have initiated the session by executing the code shown above, you are good to go.

# Getting started with Regular Expressions

To put regular expressions into practice, we need some text that we will perform out searches on. In this tutorial, we will use texts from wikipedia about grammar.


In [None]:
# read in first text
text1 <- readLines("https://slcladal.github.io/data/testcorpus/linguistics02.txt")
et <-  paste(text1, sep = " ", collapse = " ")
# inspect example text
et


***

**You can also use you own data**

The code chunk below shows you how to upload two files from your own computer **BUT** to be able to load your own data, you need to click on the folder symbol to the left of the screen:

![Colab Folder Symbol](https://slcladal.github.io/images/ColabFolder.png)

Then on the upload symbol. 

![Colab Upload Symbol](https://slcladal.github.io/images/ColabUpload.png)

Next, upload the files you want to analyze and then the respective files names in the `file` argument of the `read.delim` function. When you then execute the code (like to code chunk below, you will upload your own data (in this case a table stored as a tab-separated txt-file).


In [None]:
mytext1 <- scan(file = "linguistics01.txt",
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T) %>%
            paste0(collapse = " ")
mytext2 <- scan(file = "linguistics02.txt",
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T) %>%
            paste0(collapse = " ")
# inspect
mytext1; mytext2


To apply the code and functions below to your own data, you will need to modify the code chunks and replace the data we use here with your own data object. 

***


In addition, we will split the example text into words to have another resource we can use to understand regular expressions


In [None]:
# split example text
set <- str_split(et, " ") %>%
  unlist()
# inspect
head(set)


Before we delve into using regular expressions, we will have a look at the regular expressions that can be used in R and also check what they stand for.

There are three basic types of regular expressions:

* regular expressions that stand for individual symbols and determine frequencies

* regular expressions that stand for classes of symbols

* regular expressions that stand for structural properties

The regular expressions below show the first type of regular expressions, i.e. regular expressions that stand for individual symbols and determine frequencies.

![Regular expressions that stand for individual symbols and determine frequencies.](https://slcladal.github.io/images/regextb1.png)


The regular expressions below show the second type of regular expressions, i.e. regular expressions that stand for classes of symbols.

![Regular expressions that stand for classes of symbols.](https://slcladal.github.io/images/regextb2.png)




The regular expressions that denote classes of symbols are enclosed in `[]` and `:`. The last type of regular expressions, i.e. regular expressions that stand for structural properties are shown below.




![Regular expressions that stand for structural properties.](https://slcladal.github.io/images/regextb3.png)

# Practice

In this section, we will explore how to use regular expressions. At the end, we will go through some exercises to help you understand how you can best utilize regular expressions.

Show all words in the split example text that contain `a` or `n`.


In [None]:
set[str_detect(set, "[an]")]



Show all words in the split example text that begin with a lower case `a`.



In [None]:
set[str_detect(set, "^a")]



Show all words in the split example text that end in a lower case `s`.



In [None]:
set[str_detect(set, "s$")]



Show all words in the split example text in which there is an `e`, then any other character, and than another `n`.



In [None]:
set[str_detect(set, "e.n")]



Show all words in the split example text in which there is an `e`, then two other characters, and than another `n`.



In [None]:
set[str_detect(set, "e.{2,2}n")]



Show all words that consist of exactly three alphabetical characters in the split example text.



In [None]:
set[str_detect(set, "^[:alpha:]{3,3}$")]



Show all words that consist of six or more alphabetical characters in the split example text.



In [None]:
set[str_detect(set, "^[:alpha:]{6,}$")]



Replace all lower case `a`s with upper case `E`s in the example text.



In [None]:
str_replace_all(et, "a", "E")



Remove all non-alphabetical characters in the split example text.



In [None]:
str_remove_all(set, "\\W")



Remove all white spaces in the example text.



In [None]:
str_remove_all(et, " ")



**Highlighting patterns**

We use the `str_view` and `str_view_all` functions to show the occurrences of regular expressions in the example text.

To begin with, we match an exactly defined pattern (`ang`).


In [None]:
str_view_all(et, "ang")



Now, we include . which stands for any symbol (except a new line symbol).



In [None]:
str_view_all(et, ".n.")



# Citation & Session Info 

Schweinberger, Martin. 2022. *Regular Expressions in R*. Brisbane: The University of Queensland. url: https://slcladal.github.io/regex.html.


In [None]:
sessionInfo()



# References 



Friedl, Jeffrey EF. 2006. *Mastering Regular Expressions*. Sebastopol, CA: O’Reilly Media.

Peng, Roger D. 2020. *R Programming for Data Science*. Leanpub. https://bookdown.org/rdpeng/rprogdatascience/.
