![](https://github.com/MartinSchweinberger/SLAT7829/blob/master/images/bannerSLAT7829.jpeg?raw=true)

# Case Study: Article Use in ESL 

This tutorial presents a case study on analyzing article use among L2 learners of English based on a subsample of the *International Corpus of Learner English* (ICLE).

This case study represents a corpus-based study of article use and we will test the hypothesis that learners of English of speak a Slavic L1 (which do not have articles in the same way English has) will have more profound difficulties in using articles compared to learners with a German or Spanish language background wholse L1 is similar to English with respect to artcile use.



## Preparation and session set up

Activate required packages.


In [None]:
# load packages
library(dplyr)
library(stringr)
library(ggplot2)
library(quanteda)
library(udpipe)
library(here)
library(tidyr)


## Loading the corpus data

Loading corpus data consists of two steps: 

1. create a list of paths of the corpus files

2. loop over these paths and load the data in the files identified by the paths.

To create a list of corpus files, you could use the code chunk below (the code chunk assumes that the corpus data is in a folder called *Corpus* in the *data* sub-folder of your Rproject folder).


In [None]:
corpusfiles <- list.files(here::here("ICLE"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# inspect
head(corpusfiles)


You can then use the `sapply` function to loop over the paths and load the data int R using e.g. the `scan` function as shown below. In addition to loading the file content, we also paste all the content together using the `paste0` function and remove superfluous white spaces using the `str_squish` function from the `stringr` package.



In [None]:
corpus <- sapply(corpusfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(corpus)


Once you have loaded your data into R, you can then continue with processing and transforming the data according to your needs.


***

## Using your own data

You can also use your own data. You can see below what you need to do to upload and use your own data.

To be able to load your own data, you need to click on the folder symbol to the left of the screen:

![Binder Folder Symbol](https://slcladal.github.io/images/binderfolder.JPG)

Then, click on the `New Folder` symbol and create a new folder and call it `MyData`.

![Binder New Folder Symbol](https://slcladal.github.io/images/bindernewfolder.JPG)

Then click on the upload symbol and upload your files into the `MyData` folder.

![Binder Upload Symbol](https://slcladal.github.io/images/binderupload.JPG)

Select and upload the files you want to analyze (**IMPORTANT**: here, we assume that you upload some form of text data - not tabular data!). When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.


In [None]:
myfiles <- list.files(here::here("MyData"), # path to the corpus data
                          # full paths - not just the names of the files
                          full.names = T) 
# load colt files
mycorpus <- sapply(myfiles, function(x){
  x <- scan(x, 
            what = "char", 
            sep = "", 
            quote = "", 
            quiet = T, 
            skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(mycorpus)


**Keep in mind though that you need to adapt the names of the texts in the code chunks below so that the code below work on your own texts!**

***


## Data processing


Now that the corpus data is loaded, we extract the file names.


In [None]:
filenames <- names(corpus) %>%
  stringr::str_remove_all(".*/")  %>%
  stringr::str_remove_all("\\..*")
# inspect
head(filenames)


Next, we can clean the data.



In [None]:
corpus <- corpus %>%
  stringr::str_remove_all("<.*?>") %>%
  stringr::str_squish()
# inspect
substr(corpus[1], start=1, stop=200)


In a next step, we add part of speech tags. To do this, we download the English language model and then load this model into R.



In [None]:
# download language model
m_eng <- udpipe::udpipe_download_model(language = "english-ewt")


In [None]:
# load language model
m_eng <- udpipe_load_model(file = here::here("english-ewt-ud-2.5-191206.udpipe"))


After loading the language model, we can use it to pos-tag the essays.



In [None]:
text_anndf <- udpipe::udpipe_annotate(m_eng, x = corpus[1]) %>%
  as.data.frame() %>%
  dplyr::select(-sentence)
# inspect
head(text_anndf, 10)


In [None]:
ger <- corpus[str_detect(filenames, "^GE")]
spa <- corpus[str_detect(filenames, "^SP")]
pol <- corpus[str_detect(filenames, "^PO")]
rus <- corpus[str_detect(filenames, "^RU")]
# inspect
# inspect tagged text
substr(ger[1], start=1, stop=200)


Pos-tag L1 German data.



In [None]:
# tokenise, tag, dependency parsing
ger_tagged <- udpipe::udpipe_annotate(m_eng, ger) %>%
  as.data.frame() %>%
  dplyr::group_by(doc_id) %>%
  summarise(tagged = paste(token, "/", xpos, collapse = " ", sep = "")) %>%
  pull(tagged)
# inspect tagged text
substr(ger_tagged[1], start=1, stop=200)


Pos-tag L1 Spanish data.



In [None]:
# tokenise, tag, dependency parsing
spa_tagged <- udpipe::udpipe_annotate(m_eng, spa) %>%
  as.data.frame() %>%
  dplyr::group_by(doc_id) %>%
  summarise(tagged = paste(token, "/", xpos, collapse = " ", sep = "")) %>%
  pull(tagged)
# inspect tagged text
substr(spa_tagged[1], start=1, stop=200)


Pos-tag L1 Polish data.



In [None]:
# tokenise, tag, dependency parsing
pol_tagged <- udpipe::udpipe_annotate(m_eng, pol) %>%
  as.data.frame() %>%
  dplyr::group_by(doc_id) %>%
  summarise(tagged = paste(token, "/", xpos, collapse = " ", sep = "")) %>%
  pull(tagged)
# inspect tagged text
substr(pol_tagged[1], start=1, stop=200)


Pos-tag L1 Russian data.



In [None]:
# tokenise, tag, dependency parsing
rus_tagged <- udpipe::udpipe_annotate(m_eng, rus) %>%
  as.data.frame() %>%
  dplyr::group_by(doc_id) %>%
  summarise(tagged = paste(token, "/", xpos, collapse = " ", sep = "")) %>%
  pull(tagged)
# inspect tagged text
substr(rus_tagged[1], start=1, stop=200)


Now, we can extract all the nouns from the four tagged subcorpora. 



In [None]:
# extract nn
ger_nns <- quanteda::kwic(tokens(ger_tagged, what = "fastestword"), ".*NN.*", window = 10, valuetype = "regex", case_insensitive = F) %>% as.data.frame() %>% dplyr::select(-from, -to, -pattern) %>% dplyr::mutate(l1 = "ger")
spa_nns <- quanteda::kwic(tokens(spa_tagged, what = "fastestword"), ".*NN.*", window = 10, valuetype = "regex", case_insensitive = F) %>% as.data.frame() %>% dplyr::select(-from, -to, -pattern) %>% dplyr::mutate(l1 = "spa")
pol_nns <- quanteda::kwic(tokens(pol_tagged, what = "fastestword"), ".*NN.*", window = 10, valuetype = "regex", case_insensitive = F) %>% as.data.frame() %>% dplyr::select(-from, -to, -pattern) %>% dplyr::mutate(l1 = "pol")
rus_nns <- quanteda::kwic(tokens(rus_tagged, what = "fastestword"), ".*NN.*", window = 10, valuetype = "regex", case_insensitive = F) %>% as.data.frame() %>% dplyr::select(-from, -to, -pattern) %>% dplyr::mutate(l1 = "rus")
# combine into one table
nns <- rbind(ger_nns, spa_nns, pol_nns, rus_nns)
# inspect data
head(nns)


Next, we check for each noun what type of noun it is and also if there is an article preceding it.



In [None]:
nns <- nns %>%
  dplyr::mutate(tag = stringr::str_remove_all(keyword, ".*/"),
                noun = stringr::str_remove_all(keyword, "/.*"),
                pre = stringr::str_remove_all(pre, ".*\\./\\."),
                pre = stringr::str_remove_all(pre, ".*,/,"),
                pre = stringr::str_remove_all(pre, ".*VB[A-Z]{0,1}")) %>%
  dplyr::mutate(article = ifelse(stringr::str_detect(pre, "DT|CD|PRP"), 1, 0))
# inspect
head(nns)


Tabulate results



In [None]:
nnstb <- nns %>%
  dplyr::group_by(tag, l1) %>%
  dplyr::summarise(Freq = n(),
                   Determiners = sum(article),
                   Percent = round(Determiners/Freq*100, 1))
# inspect results
nnstb


Visualize results



In [None]:
nnstb %>%
  dplyr::rename(Type = tag) %>%
  dplyr::mutate(l1 = dplyr::case_when(l1 == "ger" ~ "German",
                                      l1 == "spa" ~ "Spanish",
                                      l1 == "pol" ~ "Polish",
                                      l1 == "rus" ~ "Russian",
                                      T ~ l1),
                l1 = factor(l1, levels = c("German", "Spanish", "Polish", "Russian"))) %>%
  ggplot(aes(x = l1, y = Percent, fill = Type)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(aes(y = Percent+2, label = Percent), color = "gray20", size=3,position = position_dodge(0.9)) + 
  theme_bw() +
  scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9", "gray80"), 
                       name="Noun type",
                       breaks=c("NN", "NNS", "NNP", "NNPS"),
                       labels=c("Common noun singular", 
                                "Common noun plural", 
                                "Propoer noun singular", 
                                "Proper noun plural")) +
  labs(y = "Percent of nouns preceeded by \na determiner, cardial number, or possessive pronoun", 
       x = "Language background of learners", 
       title = "Percentage of noun types preceeded by determiners \nacross language backgrounds of English learners") 


We end the session by calling the session info which tells us what packages and what version of the software and packages we have used.



In [None]:
sessionInfo()



***

[Back to HOME](https://github.com/MartinSchweinberger/SLAT7829Tutorials)

***
