![This is an interactive notebook that introduces R produced by the AcqVA Aurora Centre.](https://slcladal.github.io/images/acqva.jpg)

***

Please copy this Jupyter notebook so that you are able to edit it.

Simply go to: File > Save a copy in Drive.

If you want to run this notebook on your own computer, you need to do 2 things:

1. Make sure that you have R installed.

2. You need to download the [bibliography file](https://slcladal.github.io/bibliography.bib) and store it in the same folder where you store the Rmd file.

Once you have done that, you are good to go.

***

# A practical introduction to R

This tutorial shows how to get started with R and it specifically focuses on R for analyzing language data. The R markdown document for this tutorial can be downloaded [here](https://slcladal.github.io/intror.Rmd). If you already have experience with R, both Wickham (2016) (see [here](https://r4ds.had.co.nz/)) and  Gillespie and Lovelace (2016) (see [here](https://bookdown.org/csgillespie/efficientR/)) are highly recommendable and excellent resources for improving your coding abilities and workflows in R.

## Goals of this tutorial

The goals of this tutorial are:

- How to get started with R 
- How to orient yourself to R and RStudio
- How to create and work in R projects
- How to know where to look for help and to learn more about R
- Understand the basics of working with data: load data, save data, working with tables, create a simple plot
- Learn some best practices for using R scripts, using data, and projects 
- Understand the basics of objects, functions, and indexing

## Audience

The intended audience for this tutorial is beginner-level, with no previous experience using R. Thus, no prior knowledge of R is required.

If you want to know more, would like to get some more practice, or would like to have another approach to R, please check out the various online resources available to learn R (you can check out a very recommendable introduction [here](https://uvastatlab.github.io/phdplus/intror.html)). 

## Installing R and RStudio

* You have NOT yet installed **R** on your computer? 

  + You have a Windows computer? Then click [here](https://cran.r-project.org/bin/windows/base/R-4.0.2-win.exe) for downloading and installing R

  + You have a Mac? Then click [here](https://cran.r-project.org/bin/macosx/R-4.0.2.pkg) for downloading and installing R

* You have NOT yet installed **RStudio** on your computer?

  + Click [here](https://rstudio.com/products/rstudio/download/#download) for downloading and installing RStudio.

* You have NOT yet downloaded the **materials** for this workshop?

  + Click [here](https://slcladal.github.io/data/data.zip) to download the data for this session

  + Click [here](https://slcladal.github.io/cbs/intror_cb.Rmd) to download the Rmd-file of this workshop

You can find a more elaborate explanation of how to download and install R and RStudio [here](https://gitlab.com/stragu/DSH/blob/master/R/Installation.md).

### How to use the workshop materials

You can follow this workshop in different ways based on your preferences as well as prior experience and knowledge of R (the suggestions listed below are ordered from less engaged/easy/no knowledge required to more engaged/more complex/more knowledge is required)

* You can simply **sit back and follow** the workshop

* You can load the Rmd-file in RStudio and **execute the code snippets** in this Rmd-file as we go (we will talk about what Rmd-file are, how they work, and how to work in RStudio below)
  
  + If you decide on doing this, then I suggest, that you use a section of your screen for Zoom (to see what I do) and another section of your screen to work within your own R project (we will see what an R project is below)

* You can load the Rmd-file in RStudio, **create a new Rmd-file** (or RNotebook) and then copy and paste the code snippets in this new Rmd-file and execute them as we go.

  + This option requires some knowledge of R and RStudio
  
  + If you decide on doing this, then I suggest, that you use a section of your screen for Zoom (to see what I do) and another section of your screen to work within your own R project (as with the previous option)

Future workshops will be interactive and allow you to write your own code into code boxes on the website - unfortunately, I was not able to integrate that for this workshop.

# Preparation

Before you actually open R or RStudio, there things to consider that make working in R much easier and give your workflow a better structure. 

Imagine it like this: when you want to write a book, you could simply take pen and paper and start writing *or* you could think about what you want to write about, what different chapters your book would consist of, which chapters to write first, what these chapters will deal with, etc. The same is true for R: you could simply open R and start writing code *or* you can prepare you session and structure what you will be doing.

## Folder Structure and R projects

Before actually starting with writing code, you should prepare the session by going through the following steps:

### 1. Create a folder for your project

In that folder, create the following sub-folders (you can, of course, adapt this folder template to match your needs)

  - data (you do not create this folder for the present workshop as you can simply use the data folder that you downloaded for this workshop instead)
  - images
  - tables
  - docs

The folder for your project could look like the the one shown below.

![New folder](https://slcladal.github.io/images/RStudio_newfolder.png)

Once you have created your project folder, you can go ahead with RStudio.

### 3. Open RStudio

This is what RStudio looks like when you first open it: 

![RStudio: first time](https://slcladal.github.io/images/RStudio_empty.png)

In RStudio, click on `File` 
  
![RStudio: new file](https://slcladal.github.io/images/RStudio_file.png)

You can use the drop-down menu to create a `R project`

### 4. R Projects

In RStudio, click on `New Project`
  
![RStudio: new project](https://slcladal.github.io/images/RStudio_newfile.png)

  
Next, confirm by clicking `OK` and select `Existing Directory`.

Then, navigate to where you have just created the project folder for this workshop.
  
![RStudio: existing directory](https://slcladal.github.io/images/RStudio_existingdirectory.png)

  
Once you click on `Open`, you have created a new `R project` 
  
### 5. RNotebooks
  
In this project, click on `File`
  
![RStudio: new file](https://slcladal.github.io/images/RStudio_file.png)
  
Click on `New File` and then on `RNotebook` as shown below.

![RStudio: R Notebook](https://slcladal.github.io/images/RStudio_newnotebook.png)

This `RNotebook` will be the file in which you do all your work.

### 6. Getting started with RNotebooks

You can now start writing in this RNotebook. For instance, you could start by changing the title of the RNotebook and describe what you are doing (what this Notebook contains).

Below is a picture of what this document looked like when I started writing it.

![RStudio: first R Notebook](https://slcladal.github.io/images/RStudio_editMD.png)
 

When you write in the RNotebook, you use what is called `R Markdown` which is explained below.

## R Markdown

The Notebook is an [R Markdown document](http://rmarkdown.rstudio.com/): a Rmd (R Markdown) file is more than a flat text document: it's a program that you can run in R and which allows you to combine prose and code, so readers can see the technical aspects of your work while reading about their interpretive significance. 

You can get a nice and short overview of the formatting options in R Markdown (Rmd) files [here](https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf).

R Markdown allows you to make your research fully transparent and reproducible! If a couple of years down the line another researcher or a journal editor asked you how you have done your analysis, you can simply send them the Notebook or even the entire R-project folder. 

As such, Rmd files are a type of document that allows to 

+ include snippets of code (and any outputs such as tables or graphs) in plain text while 

+ encoding the *structure* of your document by using simple typographical symbols to encode formatting (rather than HTML tags or format types such as *Main header* or *Header level 1* in Word).  

Markdown is really quite simple to learn and these resources may help:

+ The [Markdown Wikipedia page](https://en.wikipedia.org/wiki/Markdown) includes a very handy chart of the syntax.

+ John Gruber developed Markdown and his [introduction to the syntax](https://daringfireball.net/projects/markdown/syntax) is worth browsing.

+ This [interactive Markdown tutorial](http://www.markdowntutorial.com/) will teach you the syntax in a few minutes.

# R and RStudio Basics

RStudio is a so-called IDE - Integrated Development Environment. The interface provides easy access to R. The advantage of this application is that R programs and files as well as a project directory can be managed easily. The environment is capable of editing and running program code, viewing outputs and rendering graphics. Furthermore, it is possible to view variables and data objects of an R-script directly in the interface. 

## RStudio: Panes

The GUI - Graphical User Interface - that RStudio provides divides the screen into four areas that are called **panes**:

1. File editor
2. Environment variables
3. R console
4. Management panes (File browser, plots, help display and R packages).

The two most important are the R console (bottom left) and the File editor (or Script in the top left).
The Environment variables and Management panes are on the right of the screen and they contain: 

* **Environment** (top): Lists all currently defined objects and data sets
* **History** (top): Lists all commands recently used or associated with a project
* **Plots** (bottom): Graphical output goes here
* **Help** (bottom): Find help for R packages and functions.  Don't forget you can type `?` before a function name in the console to get info in the Help section. 
* **Files** (bottom): Shows the files available to you in your working directory

These RStudio panes are shown below.

![RStudio: panes](https://slcladal.github.io/images/RStudioscreenshot.png)


### R Console (bottom left pane)

The console pane allows you to quickly and immediately execute R code. You can experiment with functions here, or quickly print data for viewing. 

Type next to the `>` and press `Enter` to execute. 

Here, the plus sign is the **operator**.  Operators are symbols that represent some sort of action.  However, R is, of course, much more than a simple calculator.  To use R more fully, we need to understand **objects**, **functions**, and **indexing** - which we will learn about as we go.

For now, think of *objects as nouns* and *functions as verbs*. 

## Running commands from a script

To run code from a script, insert your cursor on a line with a command, and press `CTRL/CMD+Enter`.

Or highlight some code to only run certain sections of the command, then press `CTRL/CMD+Enter` to run.

Alternatively, use the `Run` button at the top of the pane to execute the current line or selection (see below).

![RStudio: run code](https://slcladal.github.io/images/RStudio_run.png)
  

### Script Editor (top left pane)

In contrast to the R console, which quickly runs code, the Script Editor (in the top left) does not automatically execute code. The Script Editor allows you to save the code essential to your analysis.  You can re-use that code in the moment, refer back to it later, or publish it for replication.  

Now, that we have explored RStudio, we are ready to get started with R!

# Getting started with R

This section introduces some basic concepts and procedures that help optimize your workflow in R. 

## Setting up an R session

At the beginning of a session, it is common practice to define some basic parameters. This is not required or even necessary, but it may just help further down the line. This session preparation may include specifying options. In the present case, we 

+ want R to show numbers as numbers up to 100 decimal points (and not show them in mathematical notation (in mathematical notation, 0.007 would be represented as 0.7e^-3^))

+ want R to show maximally 100 results (otherwise, it can happen that R prints out pages-after-pages of some numbers).

Again, the session preparation is not required or necessary but it can help avoid errors. 


In [None]:
# set options
options(stringsAsFactors = F)                           
options(scipen = 100) 
options(max.print=100) 


In script editor pane of RStudio, this would look like this:

![RStudio: scrit editor](https://slcladal.github.io/images/RStudio_preparation.png)

## Packages

When using R, most of the functions are not loaded or even installing automatically. Instead, most functions are in contained in what are called **packages**. 

R comes with about 30 packages ("base R").  There are over 10,000 user-contributed packages; you can discover these packages online.  A prevalent collection of packages is the Tidyverse, which includes ggplot2, a package for making graphics. 

Before being able to use a package, we need to install the package (using the `install.packages` function) and load the package (using the `library` function). However, a package only needs to be installed once(!) and can then simply be loaded. When you install a package, this will likely install several other packages it depends on.  You should have already installed tidyverse before the workshop. 

You must load the package in any new R session where you want to use that package.    Below I show what you need to type when you want to install the `tidyverse`, the `tidytext`,  the `quanteda`, the `readxl`, and the `tm` packages (which are the packages that we will need in this workshop).


In [None]:
# install packages
install.packages("tidyverse")
install.packages("tidytext")
install.packages("quanteda")
install.packages("readxl")
install.packages("tm")
install.packages("here")


To load these packages, use the `library` function which takes the package name as its main argument.



In [None]:
# load packages from library
library(tidyverse)
library(tidytext)
library(quanteda)
library(readxl)
library(tm)
library(tokenizers)
library(here)


The session preparation section of your Rmd file will thus also state which packages a script relies on.

In script editor pane of RStudio, the code blocks that install and activate packages would look like this:

![RStudio: install packages](https://slcladal.github.io/images/RStudio_packages.png)
 

## Getting help

When working with R, you will encounter issues and face challenges. A very good thing about R is that it provides various ways to get help or find information about the issues you face.

### Finding help within R

To get help regrading what functions a package contains, which arguments a function takes or to get information about how to use a function, you can use the `help` function or the `apropos`. function or you can simply type a `?` before the package or two `??` if this does not give you any answers. 


In [None]:
help(tidyverse) 
apropos("tidyverse")
?require


There are also other "official" help resources from R/RStudio. 

* Read official package documentation, see vignettes, e.g., Tidyverse <https://cran.r-project.org/package=tidyverse>

* Use the RStudio Cheat Sheets at <https://www.rstudio.com/resources/cheatsheets/>

* Use the RStudio Help viewer by typing `?` before a function or package

* Check out the keyboard shortcuts `Help` under `Tools` in RStudio for some good tips 

### Finding help online

One great thing about R is that you can very often find an answer to your question online.

* Google your error! See <http://r4ds.had.co.nz/introduction.html#getting-help-and-learning-more> for excellent suggestions on how to find help for a specific question online.

# Working with tables

We will now start working with data in R. As most of the data that we work with comes in tables, we will focus on this first before moving on to working with text data.

## Loading data from the web

To show, how data can be downloaded from the web, we will download a tab-separated txt-file. Translated to prose, the code below means *Create an object called *icebio *and in that object, store the result of the `read.delim` function*. 

`read.delim` stands for *read delimited file* and it takes the URL from which to load the data (or the path to the data on your computer) as its first argument. The `sep` stand for separator and the `\t` stands for tab-separated and represents the second argument that the `read.delim` function takes. The third argument, `header`, can take either T(RUE) or F(ALSE) and it tells R if the data has column names (headers) or not. 

## Functions and Objects

In R, functions always have the following form: `function(argument1, argument2, ..., argumentN)`. Typically a function does something to an object (e.g. a table), so that the first argument typically specifies the data to which the function is applied. Other arguments then allow to add some information. Just as a side note, functions are also objects that do not contain data but instructions.

To assign content to an object, we use `<-` or `=` so that the we provide a name for an object, and then assign some content to it. For example, `MyObject <- 1:3` means *Create an object called `MyObject`. this object should contain the numbers 1 to 3*.


In [None]:
# load data
icebio <- read.delim("https://slcladal.github.io/data/BiodataIceIreland.txt", 
                      sep = "\t", header = T)


## Inspecting data

There are many ways to inspect data. We will briefly go over the most common ways to inspect data.

The `head` function takes the data-object as its first argument and automatically shows the first 6 elements of an object (or rows if the data-object has a table format).


In [None]:
head(icebio)



We can also use the `head` function to inspect more or less elements and we can specify the number of elements (or rows) that we want to inspect as a second argument. In the example below, the `4` tells R that we only want to see the first 4 rows of the data.



In [None]:
head(icebio, 4)



## Accessing individual cells in a table

If you want to access specific cells in a table, you can do so by typing the name of the object and then specify the rows and columns in square brackets (i.e. **data[row, column]**). For example, `icebio[2, 4]` would show the value of the cell in the second row and fourth column of the object `icebio`. We can also use the colon to define a range (as shown below, where 1:5 means from 1 to 5 and 1:3 means from 1 to 3) The command `icebio[1:5, 1:3]` thus means:

*Show me the first 5 rows and the first 3 columns of the data-object that is called icebio*. 
 


In [None]:
icebio[1:5, 1:3]



**Inspecting the structure of data**

You can use the `str` function to inspect the structure of a data set. This means that this function will show the number of observations (rows) and variables (columns) and tell you what type of variables the data consists of 

- **int** = integer
- **chr** = character string
- **num** = numeric
- **fct** = factor


In [None]:
str(icebio)



The `summary` function summarizes the data.



In [None]:
summary(icebio)



## Tabulating data

We can use the `table` function to create basic tables that extract raw frequency information. The following command tells us how many instances there are of each level of the variable `date` in the `icebio`. 


In [None]:
table(icebio$date) 



Alternatively, you could, of course, index the column by using its position in the data set like this: `icebio[, 6]` - the result of `table(icebio[, 6])` and `table(icebio$date)` are the same! Also note that here we leave out indexes for rows to tell R that we want all rows.

When you want to cross-tabulate columns, it is often better to use the `ftable` function (`ftable` stands for *frequency table*). 


In [None]:
ftable(icebio$age, icebio$sex)



## Saving data to your computer

To save tabular data on your computer, you can use the `write.table`  function. 

**WARNING**: This only works on your own computer and if you have a data-folder in the project folder.

This function requires the data that you want to save as its first argument, the location where you want to save the data as the second argument and the type of delimiter as the third argument. 

The command to save the data on your disc would be `write.table(icebio, here::here("data", "icebio.txt"), sep = "\t")`. However, in Google Colab, we need to use a slightly different command (shown below).


In [None]:
write.table(icebio, file = "icebio.txt", sep = "\t")



**A word about paths**

In the code chunk above, the sequence `here::here("data", "icebio.txt")` is a handy way to define a path. A path is simply the location where a file is stored on your computer or on the internet (which typically is a server - which is just a fancy term for a computer - somewhere on the globe). The `here` function from the`here` package allows to simply state in which folder a certain file is and what file you are talking about. 

In this case, we want to access the file `icebio` (which is a `txt` file and thus has the appendix `.txt`) in the `data` folder. R will always start looking in the folder in which your project is stored. If you want to access a file that is stored somewhere else on your computer, you can also define the full path to the folder in which the file is. In my case, this would be `D:/UiT/Workshop/IntroR/data`. However, as the `data` folder in in the folder where my Rproj file is, I only need to specify that the file is in the `data` folder within the folder in which my Rproj file is located.

**A word about package naming**

Another thing that is notable in the sequence `here::here("data", "icebio.txt")` is that I specified that the `here` function is part of the `here` package. This is what I meant by writing `here::here` which simply means use the `here` function from `here` package (`package::function`). This may appear to be somewhat redundant but it happens quite frequently, that different packages have functions that have the same names. In such cases, R will simply choose the function from the package that was loaded last. To prevent R from using the wrong function, it makes sense to specify the package AND the function (as I did in the sequence `here::here`). I only use functions without specify the package if the function is part of base R.

## Loading data from your computer

 To load tabular data from within your project folder (if it is in a tab-separated txt-file) you can also use the `read.delim` function. The only difference to loading from the web is that you use a path instead of a URL. If the txt-file is in the folder called *data* in your project folder, you would load the data as shown below. 
 
**WARNING**: This only works on your own computer and if you have a data-folder in the project folder.

The command to read data from your disc would be `icebio <- read.delim(here::here("data", "icebio.txt"), sep = "\t", header = T)`. However, in Google Colab, we need to use a slightly different command (shown below).


In [None]:
icebio <- read.delim("icebio.txt", sep = "\t", header = T)



To if this has worked, we will use the `head` function to see first 6 rows of the data



In [None]:
head(icebio)



## Renaming, Piping, and Filtering 

To rename existing columns in a table, you can use the `rename` command which takes the table as the first argument, the new name as the second argument, the an equal sign (=), and finally, the old name es the third argument. For example, renaming a column *OldName* as *NewName* in a table called *MyTable* would look like this: `rename(MyTable, NewName = OldName)`.  

Piping is done using the `%>%` sequence and it can be translated as **and then**. In the example below, we create a new object (icebio_edit) from the existing object (icebio) *and then* we rename the columns in the new object. When we use piping, we do not need to name the data we are using as this is provided by the previous step.


In [None]:
icebio_edit <- icebio %>%
  dplyr::rename(Id = id,
         FileSpeakerId = file.speaker.id,
         File = colnames(icebio)[3],
         Speaker = colnames(icebio)[4])
# inspect data
icebio_edit[1:5, 1:6]


A very handy way to rename many columns simultaneously, you can use the `str_to_title` function which capitalizes first letter of a word. In the example below, we capitalize all first letters of the column names of our current data.



In [None]:
colnames(icebio_edit) <- stringr::str_to_title(colnames(icebio_edit))
# inpsect data
icebio_edit[1:5, 1:6]


To remove rows based on values in columns you can use the `filter` function.



In [None]:
icebio_edit2 <- icebio_edit %>%
  dplyr::filter(Speaker != "?",
                Zone != is.na(Zone),
                Date == "2002-2005",
                Word.count > 5)
# inspect data
head(icebio_edit2)


To select specific columns you can use the `select` function.



In [None]:
icebio_selection <- icebio_edit2 %>%
  dplyr::select(File, Speaker, Word.count)
# inspect data
head(icebio_selection)


You can also use the `select` function to remove specific columns.



In [None]:
icebio_selection2 <- icebio_edit2 %>%
  dplyr::select(-Id, -File, -Speaker, -Date, -Zone, -Age)
# inspect data
head(icebio_selection2)


## Ordering data

To order data, for instance, in ascending order according to a specific column you can use the `arrange` function.


In [None]:
icebio_ordered_asc <- icebio_selection2 %>%
  dplyr::arrange(Word.count)
# inspect data
head(icebio_ordered_asc)


To order data in descending order you can also use the `arrange` function and simply add a - before the column according to which you want to order the data.



In [None]:
icebio_ordered_desc <- icebio_selection2 %>%
  dplyr::arrange(-Word.count)
# inspect data
head(icebio_ordered_desc)


The output shows that the female speaker in file S2A-005 with the speaker identity A has the highest word count with 2,355 words. 


## Creating and changing variables

New columns are created, and existing columns can be changed, by using the `mutate` function. The `mutate` function takes two arguments (if the data does not have to be specified): the first argument is the (new) name of column that you want to create and the second is what you want to store in that column. The = tells R that the new column will contain the result of the second argument.

In the example below, we create a new column called *Texttype*. 

This new column should contains 

  + the value *PrivateDialoge* if *Filespeakerid* contains the sequence *S1A*, 
  
  + the value *PublicDialogue* if *Filespeakerid* contains the sequence *S1B*, 
  
  + the value *UnscriptedMonologue* if *Filespeakerid* contains the sequence *S2A*, 
  
  + the value *ScriptedMonologue* if *Filespeakerid* contains the sequence *S2B*, 
  
  + the value of *Filespeakerid* if *Filespeakerid* neither contains *S1A*, *S1B*, *S2A*, nor *S2B*.


In [None]:
icebio_texttype <- icebio_selection2 %>%
  dplyr::mutate(Texttype = 
                  dplyr::case_when(str_detect(Filespeakerid ,"S1A") ~ "PrivateDialoge",
                                   str_detect(Filespeakerid ,"S1B") ~ "PublicDialogue",
                                   str_detect(Filespeakerid ,"S2A") ~ "UnscriptedMonologue",
                                   str_detect(Filespeakerid ,"S2B") ~ "ScriptedMonologue",
                                   TRUE ~ Filespeakerid))
# inspect data
head(icebio_texttype)


## If-statements

We should briefly talk about if-statements (or `case_when` in the present case). The `case_when` function is both very powerful and extremely helpful as it allows you to assign values based on a test. As such, `case_when`-statements can be read as:

*When/If X is the case, then do A and if X is not the case do B!* (When/If -> Then -> Else)

The nice thing about `ifelse` or `case_when`-statements is that they can be used in succession as we have done above. This can then be read as:

*If X is the case, then do A, if Y is the case, then do B, else do Z* 



## Summarizing data

Summarizing is really helpful and can be done using the `summarise` function.


In [None]:
icebio_summary1 <- icebio_texttype %>%
  dplyr::summarise(Words = sum(Word.count))
# inspect data
head(icebio_summary1)


To get summaries of sub-groups or by variable level, we can use the `group_by` function and then use the `summarise` function.



In [None]:
icebio_summary2 <- icebio_texttype %>%
  dplyr::group_by(Texttype, Sex) %>%
  dplyr::summarise(Speakers = n(),
            Words = sum(Word.count))
# inspect data
head(icebio_summary2)


## Gathering and spreading data

The `tidyr` package has two very useful functions for gathering and spreading data that can be sued to transform data to long and wide formats (you will see what this means below). The functions are called `gather` and `spread`.

We will use the data set called `icebio_summary2`, which we created above, to demonstrate how this works.

We will first check out the `spread`-function to create different columns for women and men that show how many of them are represented in the different text types. 


In [None]:
icebio_summary_wide <- icebio_summary2 %>%
  dplyr::select(-Words) %>%
  tidyr::spread(Sex, Speakers)
# inspect
icebio_summary_wide


The data is now in what is called a `wide`-format as values are distributed across columns.

To reformat this back to a `long`-format where each column represents exactly one variable, we use the `gather`-function:


In [None]:
icebio_summary_long <- icebio_summary_wide %>%
  tidyr::gather(Sex, Speakers, female:male)
# inspect
icebio_summary_long


# More on working with text

We have now worked though how to load, save, and edit tabulated data. However, R is also perfectly equipped for working with textual data which is what we going to concentrate on now. 

## Loading text data

To load text data from the web, we can use the `read_file` function which takes the URL of the text as its first argument. In this case will will load the 2016 rally speeches Donald Trump.


In [None]:
Trump <-base::readRDS(url("https://slcladal.github.io/data/Trump.rda", "rb"))
# inspect data
str(Trump)


It is very easy to extract frequency information and to create frequency lists. We can do this by first using the `unnest_tokens`  function which splits texts into individual words, an then use the `count` function to get the raw frequencies of all word types in a text.



In [None]:
Trump %>%
  tibble(text = SPEECH) %>%
  unnest_tokens(word, text) %>%
  dplyr::count(word, sort=T)


Extracting N-grams is also very easy as the `unnest_tokens`  function can an argument called `token` in which we can specify that we want to extract n-grams, If we do this, then we need to specify the `n` as a separate argument. Below we specify that we want the frequencies of all 4-grams.



In [None]:
Trump %>%
  tibble(text = SPEECH) %>%
  unnest_tokens(word, text, token="ngrams", n=4) %>%
  dplyr::count(word, sort=T) %>%
  head(10)


## Splitting-up texts

We can use the `str_split` function to split texts. However, there are two issues when using this (very useful) function:

  + the pattern that we want to split on disappears 

  + the output is a list (a special type of data format)

To remedy these issues, we 

  + combine the `str_split` function with the `unlist` function 

  + add something right at the beginning of the pattern that we use to split the text. 
To add something to the beginning of the pattern that we want to split the text by, we use the `str_replace_all` function. The `str_replace_all` function takes three arguments, 1. the **text**, 2. the **pattern** that should be replaced, 3. the **replacement**. In the example below, we add `~~~` to the sequence `SPEECH` and then split on the `~~~` rather than on the sequence "SPEECH" (in other words, we replace `SPEECH` with `~~~SPEECH` and then split on `~~~`).


In [None]:
Trump_split <- unlist(str_split(
  stringr::str_replace_all(Trump, "SPEECH", "~~~SPEECH"),
  pattern = "~~~"))
# inspect data
nchar(Trump_split) 


Let's alos have a look at the structure of the `Trump_split` object.



In [None]:
str(Trump_split)



## Cleaning texts

When working with texts, we usually need to clean the data. Below, we do some very basic cleaning using a pipeline.


In [None]:
Trump_split_clean <- Trump_split %>%
  # replace elements
  stringr::str_replace_all(fixed("\n"), " ") %>%
  # remove strange symbols
  stringr::str_replace_all("[^[:alnum:][:punct:]]+", " ") %>%
  # remove \"
  stringr::str_remove_all("\"") %>%
  # remove superfluous white spaces
  stringr::str_squish()
# remove very short elements
Trump_split_clean <- Trump_split_clean[nchar(Trump_split_clean) > 5]
# inspect data
nchar(Trump_split_clean)


Inspect text



In [None]:
Trump_split_clean[5]



## Concordancing and KWICs

Creating concordances or key-word-in-context displays is one of the most common practices when dealing with text data. Fortunately, there exist ready-made functions that make this a very easy task in R. We will use the `kwic` function from the `quanteda` package to create kwics here. 


In [None]:
kwic_multiple <- quanteda::kwic(Trump_split_clean, 
       pattern = phrase("great again"),
       window = 3, 
       valuetype = "regex") %>%
  as.data.frame()
# inspect data
head(kwic_multiple)


We can now also select concordances based on specific features. For example, we only want those instances of "great again" if the preceding word was "America". 



In [None]:
kwic_multiple_select <- kwic_multiple %>%
  # last element before search term is "America"
  dplyr::filter(str_detect(pre, "America$"))
# inspect data
head(kwic_multiple_select)


Again, we can use the `write.table` function to save our kwics to disc.

**WARNING**: This only works on your own computer and if you have a data-folder in the project folder.

The command to save the data on your disc would be `write.table(kwic_multiple_select, here::here("data", "kwic_multiple_select.txt"), sep = "\t")`. However, in Google Colab, we need to use a slightly different command (shown below).


In [None]:
write.table(kwic_multiple_select, file = "kwic_multiple_select.txt", sep = "\t")



## Tokenization and counting words

We will now use the `tokenize_words` function from the tokenizer package to find out how many words are in each file. Before we count the words, however, we will clean the data by removing everything between pointy brackets (e.g. <#>) as well as all punctuation.
 


In [None]:
words <- as.vector(sapply(Trump_split_clean, function(x){
  x <- tm::removeNumbers(x)
  x <- tm::removePunctuation(x)
  x <- unlist(tokenize_words(x))
  x <- length(x)}))
words


The nice thing about the tokenizer package is that it also allows to split texts into sentences.
To show this, we return to the rally speeches by Donald Trump and split the first of his rally speeches into sentences.


In [None]:
Sentences <- unlist(tokenize_sentences(Trump_split_clean[6]))
# inspect
head(Sentences)


We now turn to data visualization basics.

# Working with figures

There are numerous function in R that we can use to visualize data. We will use the `ggplot` function from the `ggplot2` package here to visualize the data. 

The `ggplot2` package was developed by Hadley Wickham in 2005 and it implements the graphics scheme described in the book *The Grammar of Graphics* by Leland Wilkinson.

The idea behind the  *Grammar of Graphics* can be boiled down to 5 bullet points (see Wickham 2016: 4):

- a statistical graphic is a mapping from data to **aes**thetic attributes (location, color, shape, size) of **geom**etric objects (points, lines, bars). 

- the geometric objects are drawn in a specific **coord**inate system.

- **scale**s control the mapping from data to aesthetics and provide tools to read the plot (ie, axes and legends).

- the plot may also contain **stat**istical transformations of the data (means, medians, bins of data, trend lines).

- **facet**ing can be used to generate the same plot for different subsets of the data.

## Basics of ggplot2 syntax 

**Specify data, aesthetics and geometric shapes** 

`ggplot(data, aes(x=, y=, color=, shape=, size=)) +`   
`geom_point()`, or `geom_histogram()`, or `geom_boxplot()`, etc.   

- This combination is very effective for exploratory graphs. 

- The data must be a data frame.

- The `aes()` function maps columns of the data frame to aesthetic properties of geometric shapes to be plotted.

- `ggplot()` defines the plot; the `geoms` show the data; each component is added with `+` 

- Some examples should make this clear

## Practical examples  

We will now create some basic visualizations or plots.

Before we start plotting, we will create data that we want to plot. In this case, we will extract the mean word counts by gender and age.


In [None]:
plotdata <- ICE_Ire %>%
  # only private dialogue
  dplyr::filter(stringr::str_detect(File, "S1A"),
         # without speaker younger than 19
         Age != "0-18",
         Age != "NA") %>%
  dplyr::group_by(Sex, Age) %>%
  dplyr::summarise(Words = mean(Word.count))
# inspect
head(plotdata)


In the example below, we specify that we want to visualize the `plotdata` and that the x-axis should represent `Age` and the y-axis `Words`(the mean frequency of words). We also tell R that we want to group the data by `Sex` (i.e. that we want to distinguish between men and women). Then, we add `geom_line` which tells R that we want a line graph. The result of this is shown below. 



In [None]:
ggplot(plotdata, aes(x = Age, y = Words, color = Sex, group = Sex)) +
  geom_line()


Once you have a basic plot like the one above, you can prettify the plot. For example, you can 

+ change the width of the lines (`size = 1.25`)

+ change the y-axis limits (`coord_cartesian(ylim = c(0, 1000)) `)

+ use a different theme (`theme_bw()` means black and white theme)

+ move the legend to the top

+ change the default colors to colors you like (*scale_color_manual ...`)

+ change the linetype (`scale_linetype_manual ...`)


In [None]:
ggplot(plotdata, aes(x = Age, y = Words,
                     color = Sex, 
                     group = Sex, 
                     linetype = Sex)) +
  geom_line(size = 1.25) +
  coord_cartesian(ylim = c(0, 1500)) +
  theme_bw() + 
  theme(legend.position = "top") + 
  scale_color_manual(breaks = c("female", "male"),
                     values = c("gray20", "gray50")) +
  scale_linetype_manual(breaks = c("female", "male"),
                        values = c("solid", "dotted"))


An additional and very handy feature of this way of producing graphs is that you 

+ can integrate them into pipes 

+ can easily combine plots.


In [None]:
ICE_Ire %>%
  dplyr::filter(Sex != "NA",
         Age != "NA",
         is.na(Sex) == F,
         is.na(Age) == F) %>%
  dplyr::mutate(Age = factor(Age),
         Sex = factor(Sex)) %>%
  ggplot(aes(x = Age, 
             y = Word.count,
             color = Sex,
             linetype = Sex)) +
  geom_point() +
  stat_summary(fun=mean, geom="line", aes(group=Sex)) +
  coord_cartesian(ylim = c(0, 2000)) +
  theme_bw() + 
  theme(legend.position = "top") + 
  scale_color_manual(breaks = c("female", "male"),
                     values = c("indianred", "darkblue")) +
  scale_linetype_manual(breaks = c("female", "male"),
                        values = c("solid", "dotted"))


You can also create different types of graphs very easily and split them into different facets.



In [None]:
ICE_Ire %>%
  drop_na() %>%
  dplyr::filter(Age != "NA") %>%
  dplyr::mutate(Date = factor(Date)) %>%
  ggplot(aes(x = Age, 
             y = Word.count, 
             fill = Sex)) +
  facet_grid(vars(Date)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 2000)) +
  theme_bw() + 
  theme(legend.position = "top") + 
  scale_fill_manual(breaks = c("female", "male"),
                     values = c("#E69F00", "#56B4E9"))


# Ending R sessions

At the end of each session, you can extract information about the session itself (e.g. which R version you used and which versions of packages). This can help others (or even your future self) to reproduce the analysis that you have done.

## Extracting session information

You can extract the session information by running the `sessionInfo` function (without any arguments)


In [None]:
sessionInfo()



# Going further

If you want to know more, would like to get some more practice, or would like to have another approach to R, please check out the workshops and resources on R provided by the [UQ library](https://web.library.uq.edu.au/library-services/training). In addition, there are various online resources available to learn R (you can check out a very recommendable introduction [here](https://uvastatlab.github.io/phdplus/intror.html)). 

Here are also some additional resources that you may find helpful:

* Grolemund. G., and Wickham, H., [*R 4 Data Science*](http://r4ds.had.co.nz/), 2017.
    + Highly recommended! (especially chapters 1, 2, 4, 6, and 8)
* Stat545 - Data wrangling, exploration, and analysis with R. University of British Columbia.  <http://stat545.com/>
* Swirlstats, a package that teaches you R and statistics within R: <https://swirlstats.com/>
* DataCamp's (free) *Intro to R* interactive tutorial: <https://www.datacamp.com/courses/free-introduction-to-r>
    + DataCamp's advanced R tutorials require a subscription.
*Twitter: 
    + Explore RStudio Tips https://twitter.com/rstudiotips 
    + Explore #rstats, #rstudioconf

# Citation & Session Info 

Schweinberger, Martin. 2021. *A practical introduction to R*. Tromsø: Arctic University of Norway. url: https://slcladal.github.io/intror.html 


In [None]:
sessionInfo()



***

# References 


Gillespie, Colin, and Robin Lovelace. 2016. *Efficient R Programming: A Practical Guide to Smarter Programming*. O’Reilly Media, Inc.

Wickham, Hadley, and Garrett Grolemund. 2016. *R for Data Science: Import, Tidy, Transform, Visualize, and Model Data*. O’Reilly Media, Inc.
