-
-
Notifications
You must be signed in to change notification settings - Fork 25
/
01-intro_to_base_R.Rmd
526 lines (375 loc) · 19.2 KB
/
01-intro_to_base_R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
---
title: "Introduction to R and RStudio"
author: Originally authored by Stephanie J. Spielman,<br>adapted by CCDL for ALSF
date: 2021
output:
html_notebook:
toc: true
toc_float: true
---
## Objectives
This notebook will demonstrate how to:
- Navigate the RStudio environment
- Use R for simple calculations, both mathematical and logical
- Define and use variables in base R
- Understand and apply base R functions
- Understand, define, and use R data types, including vector manipulation and indexing
- Understand the anatomy of a data frame
---
#### *More resources for learning R*
- [Swirl, an interactive tutorial](https://swirlstats.com/)
- [_R for Data Science_ book](https://r4ds.had.co.nz/)
- [Tutorial on R, RStudio and R Markdown](https://ismayc.github.io/rbasics-book/)
- [Handy R cheatsheets](https://www.rstudio.com/resources/cheatsheets/)
- [R Markdown website](https://rmarkdown.rstudio.com)
- [_R Markdown: The Definitive Guide_](https://bookdown.org/yihui/rmarkdown/)
## What is R?
**R** is a statistical computing language that is _open source_, meaning the underlying code for the language is freely available to anyone.
You do not need a special license or set of permissions to use and develop code in R.
R itself is an _interpreted computer language_ and comes with functionality that comes bundled with the language itself, known as **"base R"**.
But there is also rich additional functionality provided by **external packages**, or libraries of code that assist in accomplishing certain tasks and can be freely downloaded and loaded for use.
In the next notebook and subsequent modules, we will be using a suite of packages collectively known as [**The Tidyverse**](https://tidyverse.org).
The `tidyverse` is geared towards intuitive data science applications that follow a shared data philosophy.
But there are still many core features of base R which are important to be aware of, and we will be using concepts from both base R and the tidyverse in our analyses, as well as task specific packages for analyses such as gene expression.
### What is RStudio?
RStudio is a _graphical environment_ ("integrated development environment" or IDE) for writing and developing R code. RStudio is NOT a separate programming language - it is an interface we use to facilitate R programming.
In other words, you can program in R without RStudio, but you can't use the RStudio environment without R.
For more information about RStudio than you ever wanted to know, see this [RStudio IDE Cheatsheet (pdf)](https://github.com/rstudio/cheatsheets/raw/main/rstudio-ide.pdf).
## The RStudio Environment
The RStudio environment has four main **panes**, each of which may have a number of tabs that display different information or functionality. (their specific location can be changed under Tools -> Global Options -> Pane Layout).
![RStudio Appearance](screenshots/rstudio-panes.png)
1. The **Editor** pane is where you can write R scripts and other documents. Each tab here is its own document.
This is your _text editor_, which will allow you to save your R code for future use.
Note that change code here will not run automatically until you run it.
2. The **Console** pane is where you can _interactively_ run R code.
+ There is also a **Terminal** tab here which can be used for running programs outside R on your computer
3. The **Environment** pane primarily displays the variables, sometimes known as _objects_ that are defined during a given R session, and what data or values they might hold.
4. The **Help viewer** pane has several tabs all of which are pretty important:
+ The **Files** tab shows the structure and contents of files and folders (also known as directories) on your computer.
+ The **Plots** tab will reveal plots when you make them
+ The **Packages** tab shows which installed packages have been loaded into your R session
+ The **Help** tab will show the help page when you look up a function
+ The **Viewer** pane will reveal compiled R Markdown documents
## Basic Calculations
### Mathematical operators
The most basic use of R is as a regular calculator:
| Operation | Symbol |
|-----------|--------|
| Add | `+` |
| Subtract | `-` |
| Multiply | `*` |
| Divide | `/` |
| Exponentiate | `^` or `**` |
For example, we can do some simple multiplication like this.
When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the *Run* button within the chunk or by
placing your cursor inside it and pressing *Cmd+Shift+Enter*.
```{r calculator}
5 * 6
```
Use the console to calculate other expressions. Standard order of operations applies (mostly), and you can use parentheses `()` as you might expect (but not brackets `[]` or braces`{}`, which have special meanings). Note however, that you must **always** specify multiplication with `*`; implicit multiplication such as `10(3 + 4)` or `10x` will not work and will generate an error, or worse.
```{r expressions, live = TRUE}
10 * (3 + 4)^2
```
### Defining and using variables
To define a variable, we use the _assignment operator_ which looks like an arrow: `<-`, for example `x <- 7` takes the value on the right-hand side of the operator and assigns it to the variable name on the left-hand side.
```{r var-define, live = TRUE}
# Define a variable x to equal 7, and print out the value of x
x <- 7
# We can have R repeat back to us what `x` is by just using `x`
x
```
Some features of variables, considering the example `x <- 7`:
Every variable has a **name**, a **value**, and a **type**.
This variable's name is `x`, its value is `7`, and its type is `numeric` (7 is a number!).
Re-defining a variable will overwrite the value.
```{r var-redefine}
x <- 5.5
x
```
We can modify an existing variable by reassigning it to its same name.
Here we'll add `2` to `x` and reassign the result back to `x`.
```{r var-modify, live = TRUE}
x <- x + 2
x
```
### Variable naming note:
As best you can, it is a good idea to make your variable names informative (e.g. `x` doesn't mean anything, but `sandwich_price` is meaningful... if we're talking about the cost of sandwiches, that is..).
### Comments
Arguably the __most important__ aspect of your coding is comments: Small pieces of explanatory text you leave in your code to explain what the code is doing and/or leave notes to yourself or others.
Comments are invaluable for communicating your code to others, but they are most important for **Future You**.
Future You comes into existence about one second after you write code, and has no idea what on earth Past You was thinking.
Comments in R code are indicated with pound signs (*aka* hashtags, octothorps). R will _ignore_ any text in a line after the pound sign, so you can put whatever text you like there.
```{r comments}
22/7 # not quite pi
# If we need a better approximation of pi, we can use Euler's formula
# This uses atan(), which calculates arctangent.
20 * atan(1/7) + 8 * atan(3/79)
```
Help out Future You by adding lots of comments!
Future You next week thinks Today You is an idiot, and the only way you can convince Future You that Today You is reasonably competent is by adding comments in your code explaining why Today You is actually not so bad.
## Functions
We can use pre-built computation methods called "functions" for other operations.
Functions have the following format, where the _argument_ is the information we are providing to the function for it to run.
An example of this was the `atan()` function used above.
```r
function_name(argument)
```
To learn about functions, we'll examine one called `log()` first.
To know what a function does and how to use it, use the question mark which will reveal documentation in the **help pane**: `?log`
![rhelp](screenshots/rhelp-log.png)
The documentation tells us that `log()` is derived from `{base}`, meaning it is a function that is part of base R.
It provides a brief description of what the function does and shows several examples of to how use it.
In particular, documentation tells about how what argument(s) to provide:
+ The first _required_ argument is the value we'd like to take the log of, by default its _natural log_
+ The second _optional_ argument can specify a different base rather than the default `e`.
Functions also _return_ values for us to use.
In the case of `log()`, the returned value is the log'd value the function computed.
```{r log}
log(73)
```
Here we can specify an _argument_ of `base` to calculate log base 3.
```{r log3}
log(81, base = 3)
```
If we don't specify the _argument_ names, it assumes they are in the order that `log` defines them.
See `?log` to see more about its arguments.
```{r log2, live = TRUE}
log(8, 2)
```
We can switch the order if we specify the argument names.
```{r log-order}
log(base = 10, x = 4342)
```
We can also provide variables as arguments in the same way as the raw values.
```{r log-variable}
meaning <- 42
log(meaning)
```
## Working with variables
### Variable Types
Variable types in R can sometimes be _coerced_ (converted) from one type to another.
```{r}
# Define a variable with a number
x <- 15
```
The function `class()` will tell us the variable's type.
```{r}
class(x)
```
Let's coerce it to a character.
```{r}
x <- as.character(x)
class(x)
```
See it now has quotes around it? It's now a character and will behave as such
```{r}
x
```
Use this chunk to try to perform calculations with `x`, now that it is a character, what happens?
```{r live = TRUE}
# Try to perform calculations on `x`
```
But we can't coerce everything:
```{r}
# Let's create a character variable
x <- "look at my character variable"
```
Let's try making this a numeric variable:
```{r coerce-char, error=TRUE}
x <- as.numeric(x)
```
Print out `x`.
```{r}
x
```
R is telling us it doesn't know how to convert this to a numeric variable, so it has returned `NA` instead.
For reference, here's a summary of some of the most important variable types.
| Variable Type | Definition | Examples | Coercion |
|---------------|------------|----------| --------|
| `numeric` | Any number value | `5`<br>`7.5` <br>`-1`| `as.numeric()`
| `integer` | Any _whole_ number value (no decimals) | `5` <br> `-100` | `as.integer()`
|`character` | Any collection of characters defined within _quotation marks_. Also known as a "string". | `"a"` (a single letter) <br>`"stringofletters"` (a whole bunch of characters put together as one) <br> `"string of letters and spaces"` <br> `"5"` <br> `'single quotes are also good'` | `as.character()`
|`logical` | A value of `TRUE`, `FALSE`, or `NA` | `TRUE` <br> `FALSE` <br> `NA` (not defined) | `as.logical()`
|`factor` | A special type of variable that denotes specific categories of a categorical variable | (stay tuned..) | `as.factor()`
### Vectors
You will have noticed that all your computations tend to pop up with a `[1]` preceding them in R's output.
This is because, in fact, all (ok mostly all) variables are _by default_ vectors, and our answers are the first (in these cases only) value in the vector.
As vectors get longer, new index indicators will appear at the start of new lines.
```{r}
# This is actually an vector that has one item in it.
x <- 7
```
```{r vector-length}
# The length() functions tells us how long an vector is:
length(x)
```
We can define vectors with the function `c()`, which stands for "combine".
This function takes a comma-separated set of values to place in the vector, and returns the vector itself:
```{r make-vector}
my_numeric_vector <- c(1, 1, 2, 3, 5, 8, 13, 21)
my_numeric_vector
```
We can build on vectors in place by redefining them
```{r fibbonacci, live = TRUE}
# add the next two Fibonacci numbers to the series.
my_numeric_vector <- c(my_numeric_vector, 34, 55)
my_numeric_vector
```
We can pull out specific items from an vector using a process called _indexing_, which uses brackets `[]` to specify the position of an item.
```{r subset1}
# Grab the fourth value from my_numeric_vector
# This gives us an vector of length 1
my_numeric_vector[4]
```
Colons are also a nice way to quickly make ordered numeric vectors
Use a colon to specify an inclusive range of indices
This will return an vector with 2, 3, 4, and 5.
```{r subset-many}
my_numeric_vector[2:5]
```
One major benefit of vectors is the concept of **vectorization**, where R by default performs operations on the _entire vector at once_.
For example, we can get the log of all numbers 1-2 with a single, simple call, and more!
```{r vectorize}
values_1_to_20 <- 1:20
```
```{r vectorize-log, live = TRUE}
# calculate the log of values_1_to_20
log(values_1_to_20)
```
Finally, we can apply logical expressions to vectors, just as we can do for single values.
The output here is a logical vector telling us whether each value in example_vector is TRUE or FALSE
```{r vector-compare}
# Which values are <= 3?
values_1_to_20 <= 3
```
There are several key functions which can be used on vectors containing numeric values, some of which are below.
+ `mean()`: The average value in the vector
+ `min()`: The minimum value in the vector
+ `max()`: The maximum value in the vector
+ `sum()`: The sum of all values in the vector
We can try out these functions on the vector `values_1_to_20` we've created.
```{r vector-funcs}
mean(values_1_to_20)
# Try out some of the other functions we've listed above
```
### A note on variable naming
We have learned functions such as `c`, `length`, `sum`, and etc.
Imagine defining a variable called `c`: This will work, but it will lead to a
lot of unintended bugs, so its best to avoid this.
### The `%in%` logical operator
`%in%` is useful for determining whether a given item(s) are in an vector.
```{r in-operator}
# is `7` in our vector?
7 %in% values_1_to_20
```
```{r in2, live = TRUE}
# is `50` in our vector?
50 %in% values_1_to_20
```
We can test a vector of values being within another vector of values.
```{r vector-in, live = TRUE}
question_values <- c(1:3, 7, 50)
# Are these values in our vector?
question_values %in% values_1_to_20
```
## Data frames
_Data frames are one of the most useful tools for data analysis in R._
They are tables which consist of rows and columns, much like a _spreadsheet_.
Each column is a variable which behaves as a _vector_, and each row is an observation.
We will begin our exploration with dataset of measurements from three penguin species measured, which we can find in the [`palmerpenguins` package](https://allisonhorst.github.io/palmerpenguins/).
We'll talk more about packages soon!
To use this dataset, we will load it from the `palmerpenguins` package using a `::` (more on this later) and assign it to a variable named `penguins` in our current environment.
```{r penguin-library}
penguins <- palmerpenguins::penguins
```
![drawings of penguin species](diagrams/lter_penguins.png) Artwork by @allison_horst
### Exploring data frames
The first step to using any data is to look at it!!!
RStudio contains a special function `View()` which allows you to literally view a variable.
You can click on the object in the environment pane.
Some useful functions for exploring our data frame include:
+ `head()` to see the first 6 rows of a data frame. Additional arguments supplied can change the number of rows.
+ `tail()` to see the last 6 rows of a data frame. Additional arguments supplied can change the number of rows.
+ `names()` to see the column names of the data frame.
+ `nrow()` to see how many rows are in the data frame
+ `ncol()` to see how many columns are in the data frame.
We can additionally explore _overall properties_ of the data frame with two different functions: `summary()` and `str()`.
This provides summary statistics for each column:
```{r penguins-summary}
summary(penguins)
```
This provides a short view of the **str**ucture and contents of the data frame.
```{r penguins-str}
str(penguins)
```
You'll notice that the column `species` is a _factor_: This is a special type of character variable that represents distinct categories known as "levels".
We have learned here that there are three levels in the `species` column: Adelie, Chinstrap, and Gentoo.
We might want to explore individual columns of the data frame more in-depth.
We can examine individual columns using the dollar sign `$` to select one by name:
```{r penguins-subset}
# Extract bill_length_mm as a vector
penguins$bill_length_mm
# indexing operators can be used too
penguins$bill_depth_mm[1:10]
```
We can perform our regular vector operations on columns directly.
```{r penguins-col-mean, live = TRUE}
# calculate the mean of the bill_length_mm column
mean(penguins$bill_length_mm,
na.rm = TRUE) # remove missing values before calculating the mean
```
We can also calculate the full summary statistics for a single column directly.
```{r penguins-col-summary, live = TRUE}
# show a summary of the bill_length_mm column
summary(penguins$bill_length_mm)
```
Extract `Species` as a vector and subset it to see a preview.
```{r penguins-col-subset, live = TRUE}
# get the first 10 values of the Species column
penguins$species[1:10]
```
And view its _levels_ with the `levels()` function.
```{r penguin-levels}
levels(penguins$species)
```
## Files and directories
In many situations, we will be reading in tabular data from a file and using it as a data frame.
To practice, we will read in a file we will be using in the next notebook as well, `gene_results_GSE44971.tsv`, in the `data` folder.
File paths are relative to the location where this notebook file (.Rmd) is saved.
Here we will use a function, `read_tsv()` from the `readr` package.
Before we are able to use the function, we have to load the package using `library()`.
```{r readr}
library(readr)
```
`file.path()` creates a properly formatted file path by adding a path separator (`/` on Mac and Linux operating system, which is the operating system that our RStudio Server runs on) between separate folders or directories.
Because file path separators can differ between your computer and the computer of someone who wants to use your code, we use `file.path()` instead of typing out `"data/gene_results_GSE44971.tsv"`.
Each _argument_ to `file.path()` is a directory or file name.
You'll notice each argument is in quotes, we specify `data` first because the file, `gene_results_GSE44971.tsv` is in the `data` folder.
```{r file.path}
file.path("data", "gene_results_GSE44971.tsv")
```
We can store this file path as a variable in our environment.
```{r file.path-variable}
gene_file_path <- file.path("data", "gene_results_GSE44971.tsv")
```
Now we are ready to use `read_tsv()` to read the file into R.
The resulting data frame will be stored in a variable named `stats_df`.
Note the `<-` which is responsible for saving this to our global environment.
```{r read-stats}
# read in the file `gene_results_GSE44971.tsv` from the data directory
stats_df <- read_tsv(gene_file_path)
```
Take a look at your environment panel to see what `stats_df` looks like.
We can also print out a preview of the `stats_df` data frame here.
```{r show-stats, live = TRUE}
# display stats_df
stats_df
```
### Session Info
At the end of every notebook, you will see us print out `sessionInfo`.
This aids in the reproducibility of your code by showing exactly what packages
and versions were being used the last time the notebook was run.
```{r}
sessionInfo()
```