# An Introduction to R

R is a language and environment for statistical computing and graphics. You can read more about it <a href="https://www.r-project.org/about.html" target="_blank">here</a>.

In this Notebook we will point out **some** useful basics of R to help you get started. If you want to learn more then there are some select resourses listed in the guided study on Galen.

We will be assuming you are following along using RStudio, but you can also interact with the code by clicking the rocket launch button in the top right to open this in Google Colab.

To get started open RStudio, open a new R Script (e.g. File->New File->R Script) and save it (e.g. as **RBasics.R**).

## Getting to know R/RStudio

When you are working in R Studio you should see that the window is split into four panels. Commonly this will be:
* top left - any current script(s) (e.g. our RBasics.R script)
* bottom left - a console for experimental commands and any R output
* top right - the working environment with any saved variables or data structures
* bottom right - extras such as a files tab and R Help
If you have no current script open the left panel will be taken up completely by the console. If you want to change the layout you can go to View->Panes and select your preferred option.

![panels](https://github.com/CicelyKrystyna/MD4002_RWorkshop/blob/main/images/panels.png?raw=1)

### The Console

If you want to test out a line of code (command) you can run it in the console before copying it to your script. Some useful features:
* You can use your keyboard's up/down arrows to scroll back through previous commands.
* You can use `^R` to search your command history.
* You can clear the console using `^L`.

```{tip}
You can also use the console to access R's help guides (which will appear in the bottom right panel).

You can search R's documentation for a keyword by typing `??keyword`. You can get help on a particular function or package by typing `?function`.
```

In [None]:
# asking R to search its documentation for the keyword "mean"
??mean
# asking R to provide help on the function mean
?mean

```{note}
Lines of code typed into the console are not saved. This is why we create scripts to store our code.
```

## Preamble

Our script is where we will write and then execute our code. Let us start our script **RBasics.R** by adding some preamble.

In RStudio you can create (collapsible) sections to keep your code organised. To create a section in R Studio go to Code->Insert Section, a pop up box will appear for you to name your section. We will start by creating a section called "Preamble". Once you click ok you should get something that looks like this:

In [None]:
# Preamble ----------------------------------------------------------------

```{tip}
The # symbol is used to indicate a comment in R. This is a line which R will skip over and not try to run as code. The best codes are well commented, so feel free to add your own comments to your code to help you understand it.
```

### Documentation

You may wish to start your code with some documentation. This could look something like:

In [None]:
# Author: [Your Name]
# Date: [Date]
# Description: [Brief description of what the script does]

### Install/Load Packages

Next you will need to install/load packages. We **install** packages in R once. We **load** packages we are going to use everytime we open/start R/RStudio.

In [None]:
# Install your packages (first use only)
# install.packages("tidyverse")

# Load your packages (everytime you restart R)
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


```{tip}
There are thousands of R packages many of which you will never use and a handful of which you will use all the time. We will try to point you to the most relevant but you will also find more information in the references section.
```

### Clear R Environment

A common line you will often see in the Preamble of R codes is one which will clear the current workspace/environment.

In [None]:
# Clear everything from the environment
rm(list = ls())

```{caution}
This will delete/unassign all of the varaibles and data structures you have created so only use it if you wont lose something from the environment you will need.
```

### Set Working Directory

It may keep things streamlined if you set the path to your preferred working directory. Imagine you have a folder called **DataProject** sitting in your **Documents** folder. By setting the path to point to that folder automatically means that inputs will be taken from this folder and outputs will be put there.

In [None]:
# Set working directory using setwd("path/to/your/directory")
# setwd("~/Documents/DataProject")

## Basic Operations and Print Statements

Lets set up another section within our R script and call it **The Basics**

In [None]:
# The Basics --------------------------------------------------------------

### Basic Operations

Let's have a play with some very basic functions. Most of these are unremarkable. Firstly mathematical operations:
* to add numbers use `+`
* to subtract numbers use `-`
* to multiply numbers use `*`
* to divide numbers use `/`
* to find powers use `^`
* to calculate the square root use `sqrt()`


In [None]:
# a simple mathematical calculation using basic mathematical operations
1+2*2+3^2-4/2-sqrt(4)

```{note}
R, like most coding languages, follows the traditional mathematical rules of precedence, i.e. <a href="https://www.bbc.co.uk/bitesize/articles/znm8cmn#:~:text=BIDMAS%20is%20an%20acronym%20used,%2C%20Multiplication%2C%20Addition%2C%20Subtraction" target="_blank">BIDMAS</a> or <a href="https://www.mathsisfun.com/operation-order-pemdas.html" targe="_blank">PEMDAS</a>.
```

There are many other simple functions such as
* the exponential function `exp()`
* the natural logarithmic function `log()`
* to calculate the sum `sum()`
* to calculate the mean `mean()`
* to round numbers `round()`

Some of these functions can take additional arguments such as `round()` which allows you to specify how many decimal places you would like to round to, for example,

In [None]:
round(pi,6)

**Remember:** If you want to see if R has a function for something you can use the help to search that keyword (e.g. ``??exponential``).

### Print Statements

Print statements, like comments, are very useful when coding. Although you can make use of the simple print function `print`, we would recommend the concatenate and print function `cat`. Here's how it works:

In [None]:
# an example of a simple print statement in R
cat("The sum of the first ten numbers is:", sum(1:10), "\n")

The sum of the first ten numbers is: 55 


Anything you want R to print verbatim you put inside " " and anything else must either be an object or some function or operation which R can carry out. The final "\n" tells are to start a new line on ending the statement.

## Objects in R

### Assigning Objects

Objects in R are essentially containers. They may contain single peices of data (e.g. a variable) or be more complex structures of data (e.g. lists and data frames (see below). Objects are created by **assigning** data to an object using `<-`.

```{note}
There are actually no less than five different assignment operators as per this <a href="https://stat.ethz.ch/R-manual/R-patched/library/base/html/assignOps.html" target="_blank">documentation</a>. However, at a basic level using `<-` will serve you well.    
```

Lets set up a section in our R script called **Objects in R** and look at some examples.

In [None]:
# Objects in R ------------------------------------------------------------

# assign the value 16 to the object my_variable
my_variable<-16
# assign a set of values {1,2,3} to the object my_set
my_set<-c(1,2,3)
# assign a list of values {1,two,3} to the object my_list
my_list<-list(1,"two",3)
# assign a list of sets {1,2,3} and {A, B, C, D} to the object my_listofsets
my_listofsets<-list(c(1,2,3),c("A","B","C","D"))
# assign a name to each of the sets in my_listofsets and save as my_listofsets2
my_listofsets2<-list(my_set1=c(1,2,3),my_set2=c("A","B","C","D"))

```{note}
We store strings/words using quotation marks - single or double depending on user preference.
```


You can get R to list all the currently assigned objects using `ls()`.

In [None]:
# list all of the currently assigned objects
ls()

```{note}
We saw above that `rm(list = ls())` will remove (and de-assign) all of the objects in the current session. You can also remove/deassign individual objects, for example `rm(my_listofsets)`.
```

### The Environment Panel

After assigning these six objects our environment panel should look like this:

![environment](https://github.com/CicelyKrystyna/MD4002_RWorkshop/blob/main/images/environment.png?raw=1)

The blue arrow buttons allow you to expand or collapse the information about more complicated objects like lists and dataframes. The environment panel provides details on objects which have been assigned, the data they contain and their class.

### Classes of Objects

Objects in R are automatically given a **class** (e.g. "numeric", "character"). These indicate how data is stored in R.

In [None]:
# give the class of my_variable
cat("The object my_variable is of class:", class(my_variable), "\n")
# give the class of my_set
cat("The object my_set is of class:", class(my_set), "\n")
# give the class of my_list
cat("The object my_list is of class:", class(my_list), "\n")
# give the class of my_listofsets
cat("The object my_listofsets is of class:", class(my_listofsets), "\n")
# give the class of each set in my_listofsets2
cat("The set my_set1 within object my_listofsets2 is of class:", class(my_listofsets2$my_set1), "\n")
cat("The set my_set2 within object my_listofsets2 is of class:", class(my_listofsets2$my_set2), "\n")

The object my_variable is of class: numeric 
The object my_set is of class: numeric 
The object my_list is of class: list 
The object my_listofsets is of class: list 
The set my_set1 within object my_listofsets2 is of class: numeric 
The set my_set2 within object my_listofsets2 is of class: character 


```{note}
We can also use the function `typeof()`. If an object contains strings its **type and class** will be "character". If an object is of **class** "numeric" it may be subcatergorised as either an "integer" or a "double" **type** using `typeof()`.
```

In [None]:
# give the type of my_variable
cat("The object my_variable is of type:", typeof(my_variable), "\n")

The object my_variable is of type: double 


R is reasonably good at selecting the correct **type** and **class** for each object. If it is wrong it will typically be becuase of the way the data has been entered.

```{note}
A list in R can contain data of different forms (e.g. numbers and words) because each entry in a list can have a different class/type. Other structures are restricted to being of one class/type only.
```

In [None]:
# assign a set of values {1,"two",3} to the object my_mixedset
my_mixedset<-c(1,"two",3)
# determine the class of my_mixedset
cat("The object my_mixedset is of class:", class(my_mixedset), "\n")

The object my_mixedset is of class: character 


Here, we included a string in my_mixedset. Automatically R classes my_mixedset as "character" and all values are stored as strings. We can check the class using `is` e.g. `is.numeric()` and change the class using `as` e.g. `as.numeric`.

In [None]:
# ask R if my_mixedset is of class "numeric"
is.numeric(my_mixedset)
# ask R to store my_mixedset as class "numeric" (note reassignment)
my_mixedset<-as.numeric(my_mixedset)
# ask R if my_mixedset is of class "numeric"
is.numeric(my_mixedset)

“NAs introduced by coercion”


```{warning}
When we force my_mixedset to be stored as a numeric object any strings are lost (irrevocably) and replaced by a mising data indicator NA. In the R Workshop there is a more detailed example of how we can convert the "two" to a 2 so as to not lose this data entry.
```

We can also change the more specific type of an object, for example,

In [None]:
# ask R to store my_variable as type "integer" (note reassignment)
my_variable<-as.integer(my_variable)
# give the type of my_variable
cat("The object my_variable is of type:", typeof(my_variable), "\n")

The object my_variable is of type: integer 


### Logicals and Binary Data

Sometimes our data may be stored as a set of true/false responses. R stores these as **logicals**. We tell R this by writing TRUE or FALSE or more simply using T or F, for example,

In [None]:
# assign a set of true/false repsonses to the object my_truefalseset
my_truefalseset<-c(T,F,T,T,F,T)
# give the class of my_truefalseset
cat("The object my_truefalseset is of class:", class(my_truefalseset), "\n")

The object my_truefalseset is of class: logical 


Logicals can be converted to numeric by flagging TRUE as 1 and FASLE as 0, i.e. any Ts become 1s and Fs become 0s.

In [None]:
# ask R to store my_truefalseset as class "numeric" (note reassignment)
my_truefalseset<-as.numeric(my_truefalseset)
# print my_truefalseset
cat("my_truefalseset:", my_truefalseset, "\n")

my_truefalseset: 1 0 1 1 0 1 


Indeed any binary data (data restricted to only two possible responses) can be stored numerically or logically. Suppose we had another set of binary data e.g. a set of happy or sad responses. We can convert this to either numeric or logical data. We would just need to specify which of happy or sad is to be flagged as TRUE/1.

In [None]:
# assign a set of happy/sad repsonses to the object my_happysadset
my_happysadset<-c("happy", "sad", "sad", "sad", "happy")
# give the class of my_happysadset
cat("The object my_happysadset is of class:", class(my_happysadset), "\n")
# print my_happysadset as a logical flagging happy as TRUE
cat("my_happysadset as logical:", as.logical(my_happysadset=="happy"), "\n")
# print my_happysadset as binary numerical flagging sad as 1
cat("my_happysadset as numeric:", as.numeric(my_happysadset=="sad"), "\n")

The object my_happysadset is of class: character 
my_happysadset as logical: TRUE FALSE FALSE FALSE TRUE 
my_happysadset as numeric: 0 1 1 1 0 


### Factors



We can use **factors** to store categorical data numerically. Each catergory is given a numeric value from 1 to n (the number of distinct categories).  R will automatically order the data in to **levels**. This will be done alphabetically unless you specify an alternative. If your data is nominal (no discernable order) you wouldn't need to worry about the order of **levels** but if your data is ordinal (clear implied order) as in the following example you would.

```{tip}
Factors are useful because saving data numerically is more computationally efficient.
```

In [None]:
# assign a set of responses {low, medium, high} to a factor my_factor
my_factor<-factor(c("low", "medium", "high"))
# print my_factor as it is stored in R
cat("my_factor is stored numerically as:", my_factor, "\n")
# assign the same set of responses but tell R how to order the data
my_factor<-factor(c("low", "medium", "high"), levels=c("low", "medium", "high"))
# print my_factor as it is stored in R
cat("my_factor is stored numerically as:", my_factor, "\n")

my_factor is stored numerically as: 2 3 1 
my_factor is stored numerically as: 1 2 3 


### Data Frames

If you are working with data you are almost certainly going to be using **data frames**. A data frame is essentially a table of observations for different variables of interest. Each column represents a variable and the rows are individual observations.

```{note}
Data frames can contain data of different types/class but each column/variable will be designated a single class. The length of each column must be the same, i.e. there must be an observation provided for each variable or a missing placeholder (NA) must be used.
```

In [None]:
# assign a table of data consisting of five observations to the object my_dataframe
# each new observation - perhaps a person filling in a survey will be given a unique ID
ID<-c(1:5)
mood<-c("happy", "happy", "sad", NA, "sad")
score<-c("low", "medium", "high", "high", "low")
my_dataframe<-data.frame(ID, mood, score)
# display my_dataframe
my_dataframe

ID,mood,score
<int>,<chr>,<chr>
1,happy,low
2,happy,medium
3,sad,high
4,,high
5,sad,low


We could store all of this data numerically. We could convert each variable into a numeric variable, for example,

In [None]:
# assign mood and score responses as numeric
numeric_mood<-as.numeric(mood=="happy")
score<-factor(score, levels=c("low", "medium", "high"))
numeric_score<-as.numeric(score)
# assign the dataframe
my_numeric_dataframe<-data.frame(ID, numeric_mood, numeric_score)
# display my_numeric_dataframe
my_numeric_dataframe

ID,numeric_mood,numeric_score
<int>,<dbl>,<dbl>
1,1.0,1
2,1.0,2
3,0.0,3
4,,3
5,0.0,1


or we could make use of the `stringasfactor` option,

In [None]:
# use the stringasfactor option in my_dataframe
my_dataframe<-data.frame(ID, mood, score, stringsAsFactors = TRUE)
# display my_dataframe
my_dataframe

ID,mood,score
<int>,<fct>,<fct>
1,happy,low
2,happy,medium
3,sad,high
4,,high
5,sad,low


```{tip}
This second option allows us to see the raw data inputs but the computer stores them in a memory efficient way.
```

A quick way to see the variables (column headers) used in your dataframe is to use `names()`, for example:

In [None]:
# ask for the column/variable names from my_numeric_dataframe
cat("The variables in my_dataframe are:", names(my_numeric_dataframe), "\n")

The variables in my_dataframe are: ID numeric_mood numeric_score 


```{note}
You can also change the variable names using the `names()` function.
```

In [None]:
# Rename numeric_mood to mood and numeric_score to score
names(my_numeric_dataframe)[names(my_numeric_dataframe) == "numeric_mood"] <- "mood"
names(my_numeric_dataframe)[names(my_numeric_dataframe) == "numeric_score"] <- "score"
# display the dataframe
my_numeric_dataframe

ID,mood,score
<int>,<dbl>,<dbl>
1,1.0,1
2,1.0,2
3,0.0,3
4,,3
5,0.0,1


## Importing/Exporting Data

There are a variety of different ways to read data into R, since there are a variety of different ways to save and store data. One of the more common ways to save data is as a CSV (comma delimited file). We can read such into R using the `read_csv` function from the `tidyverse` package.

It may be that you are creating your own data and storing this in an excel spreadsheet. You can easily create a CSV file from excel by going to File->Save As and selecting CSV from the File Format dropdown menu. If you are creating your own data use the first row to store sensible, short but informative column headings which will become the name of your variables in R.

This is a very brief introduction to the basics of R. Carry on with Task 2 in the guided study to get some more hands on practise.