![](https://github.com/MartinSchweinberger/SLAT7829/blob/master/images/bannerSLAT7829.jpeg?raw=true)

# Case Study: Detecting errors in student writing

This tutorial show how you can detect errors in student essays based on a subsample of the *International Corpus of Learner English* (ICLE).

This case study represents a corpus-based study of orthographic errors (not grammatical or stylistic errors).The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with corpus linguistics.

## Preparation and session set up

Activate required packages.


In [None]:
# load packages
library(dplyr)
library(stringr)
library(ggplot2)
library(quanteda)
library(quanteda.textstats)
library(hunspell)


## Loading the corpus data

Loading corpus data consists of two steps: 

1. create a list of paths of the corpus files

2. loop over these paths and load the data in the files identified by the paths.

To create a list of corpus files, you could use the code chunk below (the code chunk assumes that the corpus data is in a folder called *Corpus* in the *data* sub-folder of your Rproject folder).


In [None]:
corpusfiles <- list.files(here::here("ICLE"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# inspect
head(corpusfiles)


You can then use the `sapply` function to loop over the paths and load the data int R using e.g. the `scan` function as shown below. In addition to loading the file content, we also paste all the content together using the `paste0` function and remove superfluous white spaces using the `str_squish` function from the `stringr` package.



In [None]:
corpus <- sapply(corpusfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(corpus)


Once you have loaded your data into R, you can then continue with processing and transforming the data according to your needs.


***

## Using your own data

You can also use your own data. You can see below what you need to do to upload and use your own data.

To be able to load your own data, you need to click on the folder symbol to the left of the screen:

![Binder Folder Symbol](https://slcladal.github.io/images/binderfolder.JPG)

Then, click on the `New Folder` symbol and create a new folder and call it `MyData`.

![Binder New Folder Symbol](https://slcladal.github.io/images/bindernewfolder.JPG)

Then click on the upload symbol and upload your files into the `MyData` folder.

![Binder Upload Symbol](https://slcladal.github.io/images/binderupload.JPG)

Select and upload the files you want to analyze (**IMPORTANT**: here, we assume that you upload some form of text data - not tabular data!). When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.


In [None]:
myfiles <- list.files(here::here("MyData"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# load colt files
mycorpus <- sapply(myfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(mycorpus)


**Keep in mind though that you need to adapt the names of the texts in the code chunks below so that the code below work on your own texts!**

***


## Data processing


Now that the corpus data is loaded, we extract the file names.


In [None]:
filenames <- names(corpus) %>%
  stringr::str_remove_all(".*/")  %>%
  stringr::str_remove_all("\\..*")
# inspect
head(filenames)


Next, we can clean the data.



In [None]:
corpus <- corpus %>%
  stringr::str_remove_all("<.*?>") %>%
  stringr::str_squish()
# add names
names(corpus) <- filenames
# inspect
substr(corpus[1], start=1, stop=200)


In a next step, we perform the error detection. To do this, we loop over the corpus and use the `hunspell()` function. Within the `hunspell()` function, we can also specify the dictionary (whether we want to use a British or an American English dictionary) via the `dict = ` argument.



In [None]:
errors_gb <- sapply(corpus, function(x){
  x <- hunspell(x, dict = 'en_GB') })
# inspect 
head(errors_gb)


We can see that many proper nouns (names) are not in the dictionary and thus detected as errors. In addition, we see that the text is written in American English. - so we repeat the analysis with the American English dictionary.

 


In [None]:
errors_us <- sapply(corpus, function(x){
  x <- hunspell(x, dict = 'en_US') })
# inspect 
head(errors_us)


A better way to perform error detection is to create a table of the textual data with each word representing a separate line in the table. This allows you to correct errors detected by the spell checking.



In [None]:
texttable <- lapply(corpus, function(x){
  # split text into words
  x <- quanteda::tokens(x) %>%
    # flatten data
  unlist() %>%
    # convert into a data frame
  as.data.frame() %>%
    # rename first column
  dplyr::rename(words = 1) %>%
    # create an id column and performing error correction
  dplyr::mutate(id = 1:nrow(.),
                error = hunspell::hunspell_check(words, dict = "en_US")) %>%
  # moving it to first position
  dplyr::relocate(id) })
# combine list of tables into one table
ttable <- do.call("rbind", texttable) %>%
  dplyr::mutate(corpus = stringr::str_replace_all(rownames(.), "(^[A-Z]{2}).*", "\\1"))
# inspect
head(ttable)


We can now inspect the errors that were detected.



In [None]:
erors_raw <- ttable %>%
  dplyr::filter(error == FALSE)
# inspect
head(erors_raw, 20)


We can now correct falsely detected errors.



In [None]:
ttable <- ttable %>%
  dplyr::mutate(error = ifelse(str_detect(words, "\\W"), TRUE, error)) %>%
  dplyr::mutate(error = ifelse(str_detect(words, "^[A-Z]"), TRUE, error)) %>%
  dplyr::mutate(error = ifelse(words == "Basinger", TRUE, error))
# inspect
head(ttable, 20)


We can now re-inspect the errors that were detected.



In [None]:
errors_checked <- ttable %>%
  dplyr::filter(error == FALSE)
# inspect
head(errors_checked, 20)


Once the errors are corrected, we can tabulate the errors by language background.



In [None]:
tb1 <- ttable %>%
  # rename levels of corpus
  dplyr::mutate(corpus = case_when(corpus == "GE" ~ "German",
                                   corpus == "SP" ~ "Spanish",
                                   corpus == "PO" ~ "Polish",
                                   corpus == "RU" ~ "Russian",
                                   T ~ corpus)) %>%
  # groups by corpus
  dplyr::group_by(corpus) %>%
  # sum up errors and words
  dplyr::summarise(errors = table(error)[1],
                   words = n()) %>%
# calculate relative frequency
  dplyr::mutate(Frequency = round(errors/words*1000, 2))
# inspect
tb1


Now,as a final step, we can visualize the results.



In [None]:
tb1 %>%
  ggplot(aes(x = corpus, y = Frequency)) +
  geom_bar(stat = "identity") +
  geom_text(aes(y = Frequency-5, label = Frequency), color = "white", size=5) +
  theme_bw() +
  labs(x = "L1 of student", y = "Relative Frequency \n(orthographic errors per 1,000 words)")


We end the session by calling the session info which tells us what packages and what version of the software and packages we have used.



In [None]:
sessionInfo()



***

[Back to HOME](https://github.com/MartinSchweinberger/SLAT7829Tutorials)

***
