-
Notifications
You must be signed in to change notification settings - Fork 1
/
RPrimer.Rmd
664 lines (406 loc) · 19 KB
/
RPrimer.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
---
title: "An R primer for humdrumR users"
author: "Nathaniel Condit-Schultz"
date: "July 2022"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{An R primer for humdrumR users}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
markdown:
wrap: sentence
---
```{r, include = FALSE, message=FALSE, echo = FALSE }
source('vignette_header.R')
```
`r Hm` is a package for the R programming language.
You don't need to be an R master to use `r hm`, but there are some basic concepts from R
that you will need to learn, and ultimately, if you want to get really advanced, you'll
need to develop some R skills.
This document is a basic primer for R, which will teach you the basics you need to know in order to
make the most out of `r hm`.
# Basic Commands
R code is made up of "expressions" like `2 + 2`, `sqrt(2)`, or `(x - mean(x))^2`.
As you can see, you can create very intuitive *arithmetic* expressions, like `5 / 2` or `3 * 3`.
However, the most common elements of R expressions are "calls" to *functions*.
A "function" in R is a pre-built bit of code that does something.
Most functions take one or more input *arguments*, and "return" some kind of output.
For example, the function `sqrt()` takes a number as an input argument, and "returns" the square root of that number.
```{r}
sqrt(2)
sqrt(9)
```
To "call" a function, we write the function's name, followed by parentheses (`()`).
Any input arguments to the function must go inside the parentheses, separated by commas if there are more than one.
Here are some examples of common functions being "called" with zero or more input arguments:
```{r}
abs(-3)
mean(1:5)
max(1, 5) # two arguments
c(1, 2, 3) # three arguments
Sys.time() # no arguments!
```
Different functions have different arguments they recognize, with specific names.
For example, the function `log()` takes two arguments, called `x` and `base`.
Other functions can take any number of arguments, with any name.
You can learn about a function, including the arguments it accepts, by typing `?functionName` at the command line; for example, `?sqrt` or `?mean`.
If you see an argument called `...`, that tells you that the function can take any number of arguments.
We can explicitly "name" the function arguments we want by putting `argname = argument` into our calls:
For example, you could say `log(10, base = 2)`.
Named arguments are very useful when we are creating data, like vectors and `data.frame`s (see below).
### Pipes
Complex expressions might involve a large number of function calls, which can get tiresome to read (or write).
For example, something like
```
log(round(sqrt(mean(x^2)), base = 2)
```
calls *four* functions!
An expression like that is a bit tricky to read, and it can be really easy to make a mistake where
you put the wrong number of parentheses.
As an alternative, R gives us the option of calling functions in a "pipe."
The way this works is we use the "pipe" command `|>`, which takes an input on the left and "pipes" it into a function call on the right.
For example, we can rewrite the previous command as:
```
x^2 |> mean() |> sqrt() |> round() |> log(base = 2)
```
Much better!
To make things even cleaner, R will understand if you spread your expressions across multiple lines, by putting a new line after each `|>`, or function argument:
```
x^2 |>
mean() |>
sqrt() |>
round() |>
log(base = 2)
max(sqrt(2),
log(2),
exp(2),
pi / 2)
```
## Variables
When coding in R, you'll often want to "save" data or other objects so you can reuse them.
We do this by "assigning" something (often the result of a function) to a "*variable*".
This is done using the assignment operators, either `<-` or `->`.
A variable name can be any combination of upper and lowercase letters.
Let's calculate the square-root of two and save it to a variable:
```{r}
tworoot <- sqrt(2)
```
We can then reuse that value as many times as we want:
```{r}
tworoot^2
tworoot * 2
c(tworoot, tworoot)
```
You can also assign from left to right, using `->`.
This is useful in combination with pipes:
```{r}
tworoot |> exp() |> round() -> newvalue
```
----
Note your variable names can also include `_`, `.`, or numeric digits, as long as they aren't at the beginning of the name.
For example, `X1` or `my_name` are valid names---but not `2X`.
# Basic Data Structures
In R, there two fundamental data structures that are used all the time:
+ "atomic" **vectors**
+ data.frames
## Vectors
In R, the basic units---the *atoms*, if you will---of information are called "atomic" vectors.
There are three basic atomic data types:
+ **Numbers**---`numeric` values.
+ Examples: `3`, `4.2`, `-13`, `254.30`
+ **Strings of characters**: `character` values.
+ Examples: `"note"`, `"a"`, `"do, a dear, a female dear"`
+ **Logical Trues and Falses**: `logical` values.
+ Examples: `TRUE`, `FALSE`
You might be wondering, why are we calling these basic atoms "*vectors*"?
Well, in R, the basic atomic data types are always considered a *collection* of ordered values.
These ordered collections are called vectors.
In the simple examples above, each vector only had a single value, so it just looks like one
value---single values like this are often called "scalars".
However, R doesn't *really* distinguish between scalars (single values) and vectors
(multiple values)---everything is always a vector.
(Still, we sometimes refer to length-1 vectors as scalars.)
To make a vector from scratch in R, use `c()`, as so:
```{r}
c(1, 2, 3)
c("Bach", "Mozart", "Beethoven", "Brahms")
c(TRUE, FALSE)
c(32.3)
32.3
```
In this example, we've created five vectors.
+ A `numeric` vector of length 3.
+ A `character` vector of length four (composers).
+ A `logical` vector of length 2.
+ Two `numeric` vectors of length 1.
+ That's right, `c(32.3)` and `32.3` are the same thing---a vector of length 1.
---
Notice that vectors can't mix-and-match different data types; which makes sense because a vector *is* a single type of thing.
But this means that commands like `c(3, "a")` will actually create a `character` vector, where the `3` is forced to be a character (`"3"`).
### Vectorization
Having everything be a vector all the time is very useful, because it allows us to think of and
use collections of data as single thing.
If I give you, say, ten thousand numbers, you don't have to worry about manipulating ten thousand things:
rather, you just work with *one* thing: a vector, which happens to be of length 10,000.
In R, we call this **vectorization**---generally, in R and in `r hm` we will constantly
be taking advantage of vectorization to make our lives super easy!
---
For an example of vectorization, watch this:
```{r}
c(1, 1, 2, 3, 5, 8, 13, 21) * 2
```
We created two `numeric` vectors:
1. The first eight numbers of the Fibonacci sequence
2. the single number `2`
and multiplied them together!
Notice that the entire Fibonacci vector is multiplied by two!
We don't have to worry about multiplying each number of the vector, it's done for us.
----
There are two ideal circumstances for working with vectors.
1. They are the same length.
2. One vector is length 1, and the other isn't.
In the first case, we work with multiple vectors that are all the same length,
each value in each vector is "lined" up with values in the other vector.
If we, for example, add two such vectors together, each "lined up" pair of numbers is added:
```{r}
c(1, 2, 3) + c(5, 4, 3)
paste(c('a', 'b', 'c'), c(1, 2, 3))
```
In the second case, one of the vectors is length-1 (a "scalar").
In this case, the scalar value is paired with each value in the longer vector (as in the
Fibonacci example above).
```{r}
c(1, 2, 3) + 5
paste(c('a', 'b', 'c'), 1)
```
----
What happens if we have vectors that are longer than one, but are not the same length?
Well, R will generally attempt to "recycle" the shorter vector---which means repeat it---
as necessary to match the length of the longer vector.
If the shorter vector evenly divides the longer vector, you generally won't have a problem:
```{r}
c(1, 2, 3, 4) * c(2, 3)
```
If the division is not perfect, R will still "recycle" the shorter vector, but you'll get a warning:
```{r}
c(1, 2, 3, 4) * c(2, 3, 4)
```
You see the warning message R have us?
> "longer object length is not a multiple of shorter object length"
That's R telling us that we've got an obvious mismatch in the lengths of our vectors.
---
Generally, it is best to work with vectors that are all the same length and/or scalar values (length-1 vectors),
so you can avoid worrying about how exactly R is "recycling" values.
This brings us too...
### Factors
Factors are a useful modification of `character` vectors, which keep track of all the possible values ("levels")
you expect in your data, *even when some of those levels are missing from the vector*.
This is mainly useful when we are counting data with `table()`.
For example, let's consider the built-in R object called `letters`:
```{r}
letters
```
What happens if we call `table()` on `letters`?:
```{r}
table(letters)
```
Every letter appears once in the table, duh!
What if we randomly sample a handful letters and table the result?
```{r}
sample(letters, 15, replace = TRUE) |> table()
```
Notice that not all the letters from table appear in the output.
E.g., if a letter never appears in the sample, it doesn't get counted.
Let's try something new: before sampling, I will call the command `factor()` on `letters`:
```{r}
factor(letters) |> sample(15, replace = TRUE) |> table()
```
Ah! Now our table includes all possible letters, even though many of them appear `0` times.
----
So how does this work?
Well the `factor()` function looks at a `character` vector and outputs a new "factor" vector.
The factor vector acts just like a `character` vector, except it remembers all the unique values,
or "levels", in the vector:
```{r}
factor(letters)
```
Even if we remove some values from the factor vector, the vector will "remember" these levels.
The factor will also remember the *order* of the levels, so you can make tables ordered the way you want them.
You can access, or set, the levels of a factor using these using the `levels()` function, or with the `levels`
argument to the `factor()` function itself.
Maybe we want to tabulate the letters, but put the vowels first:
```{r}
factor(letters, levels = c('a', 'e', 'i', 'o', 'u',
"b", "c", "d", "f", "g", "h", "j", "k",
"l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "y", "z")) |>
sample(15) |>
table()
```
Note that if a `character` string contains values that you don't include in your `levels`, the value will show up `NA` in the resulting factor, and you may see warnigns like "`invalid factor level, NA generated`."
## Data frames
Data frames are the heart and soul of R.
A `data.frame` is simply a collection of vectors that are *all the same length*---ideal for vectorized operations!
The vectors in a `data.frame` are arranged as columns in a two dimension table.
Let's make a data frame, by feeding some vectors to the [data.frame()] function:
```{r}
X <- c("C", "D", "E", "F", "G", "A", "B", "C")
Y <- c(0, 2, 4, 5, 7, 9, 11, 12)
Z <- c("P1", "M2", "M3", "P4", "P5", "M6", "M7", "P8")
df <- data.frame(X, Y, Z)
df
```
Notice that each of your columns/vectors can be a different type, with no problem.
Also notice, that each column has a name; we can inspect these names using the `colnames()` function.
```{r}
colnames(df)
```
Or change them:
```{r}
colnames(df) <- c('Letters', "Semitones", "Intervals")
df
```
Finally, it's also possible to assign the column name we want when creating the data frame:
```{r}
data.frame(Letters = X, Semitones = Y, Intervals = Z)
```
----
Remember, the vectors in a `data.frame` must all be the same length.
If you tried to make a `data.frame` with a vectors that don't match in length, you'll get an error "`arguments imply differing number of rows`."
The one exception is that you can call `data.frame` with some *scalar* single values, which will be automatically recycled to match
the length of the other vectors.
### With and Within You
We often want to access the columns/vectors held in a data frame.
We can do this several ways.
One approach is with the `$` operator, combined with the name of the column we want.
For example, we can get the `Letters` column from the data frame we made above using `df$Letters`.
Often, we'll want to write code that uses a bunch of different columns from the same data.frame---in fact, this is the *main* thing we do most of the time in R!
To avoid writing `df$` over and over again, we can use the `with()` function.
`with()` allows us to drop "inside" our `data.frame`, where our R commands can "see" the columns variables:
```{r}
with(df,
paste(Intervals, Semitones, sep = ' = '))
```
## Missing Data
Sometimes we'll encounter data points which are irrelevant, meaningless, or "not applicable."
In other cases, there may be relevant data that is "missing."
R provides two distinct ways to represent missing/irrelevant data: `NULL` and `NA`.
`NULL` is a special R object/variable, which is used represent something that is totally missing or empty.
`NULL` has no length (`length(NULL) == 0`) and no value. It cannot be indexed. Many functions will give an error if passed a `NULL`.
`NA` is quite different than `NULL`.
Any atomic vector can have `NA` value at any (or all) indices---in fact, you can have vectors or `NA` values.
The `NA` values are still "values" in a vector, but they are used indicate when there are values that are missing or problematic.
Passing a vector with `NA` values to most functions does not lead to an error, though you'll often get a warning message instead.
For example, consider what happens if we apply the command `as.numeric()` to the following strings:
```{r}
numbers <- as.numeric(c("1", "2", "apple", "4.2"))
numbers
```
Four of the strings in this vector are converted to numbers without a problem,
but the string `"apple"` makes no sense as a number.
So what does R do?
It converts the three strings to numbers, just like `as.numeric()` is supposed to,
but the `"apple"` string appears as `NA` in the outut.
We also get warning message: `NAs introduced by coercion`.
You might see that warning sometimes, so now you know what it means!
What would happen if we tried applying a different function onto our vector with an `NA`?
```{r}
sqrt(numbers)
```
The `sqrt()` function has no problem taking the square-roots of the three numbers, and it simply "propogates" the `NA` value in its input through to its output.
The "propogation" of missing values is a very useful feature in R:
it makes sure that we *keep track* of what data is missing, while keeping our vectors all their original lengths.
# Common Functions
+ `getwd()` --- Get R's current working directory.
+ `setwd()` --- Set R's working directory.
+ `summary()` --- Summarize the contents of an R object.
## Vector functions
+ `sort()` --- Put values of a vector into ascending order.
+ set `decreasing = TRUE` for decreasing order.
+ `rev()` --- Reverse the order of a vector.
+ `rep()` --- Repeat a vector.
+ `unique()` --- Returns only the unique values of a vector.
+ `x %in% y` --- Which elements of the vector `x` appear in the vector `y`?
+ `length()` --- How long is the vector (or `list()`)?
+ `head()` and `tail()` --- Return the first or last $N$ elements of a vector.
+ Provide the `n` argument a natural number to control $N$.
### Sequences and Indices
+ `x:y` --- Create a sequence of integers from `x` to `y`.
+ For example, `1:10` makes a vector of integers from one to ten.
+ `seq()` --- Create arbitrary sequences of numbers.
+ `which()` --- Which indices in a logical vector are true?
+ For example, `which(c(TRUE, FALSE, TRUE))` returns `c(1,3)`.
## String functions
+ `paste()` --- Paste together multiple `character` strings.
+ `nchar()` --- Counts how many characters there are in each string of a character vector.
## Math
### Arithmetic
+ `x + y` --- Addition; $x + y$.
+ `x - y` --- Subtraction; $x - y$.
+ `-x` --- Negation; $-x$.
+ `x * y` --- Multiplication; $xy$
+ `x^y` --- Exponentiation; $x^y$.
+ Use parentheses for things like `x^(1/3)`; $x^{\frac{1}{3}}$.
+ `x / y` --- Real division; $\frac{x}{y}$.
+ `x %/% y` --- [Euclidean division](https://en.wikipedia.org/wiki/Euclidean_division); $\lfloor \frac{x}{y} \rfloor$.
+ E.g., whole-number division with remainder.
+ `x %% y` --- `x` modulo `y`; $x \mod y$.
+ E.g., remainder after whole-number division.
+ `diff(x)` --- This function calculates the differences between consecutive values in a numeric vector.
+ `diff(c(5, 3))` is the same as `3 - 5`.
### Other Math functions
+ `sqrt(x)` --- Square-root of numbers; $\sqrt{x}$.
+ `abs(x)` --- Absolute value of numbers; $|x|$
+ `round(x)` --- Round number to nearest integer; $\lfloor x \rceil$
+ `log(x)` --- Log of number (natural log by default); $\log(x)$
+ `sign(x)` --- Sign (1, -1, or 0) of x; $\text{sgn}\ x$
### Distribution and Tendency Functions
+ `sum(x)` --- The sum of a numeric vector.
+ `max(x)` --- The maximum value in a numeric vector.
+ `min(x)` --- The minimum value in a numeric vector.
+ `range(x)` --- The minimum *and* maximum values of a numeric vector.
+ To get the size of the range, use `diff(range(x))`.
+ `mean(x)` --- The arithmetic mean of numeric vector.
+ `median(x)` --- The median of numeric vector.
+ `quantile(x)` --- Other distribution [quantiles](https://en.wikipedia.org/wiki/Quantile) of numeric vector.
## Randomization functions
+ `sample()` --- Takes a random sample from a vector. Can also be used to randomize the order of a vector.
## Analysis Functions
+ `table()` --- Tabulate all unique values in vector, or cross-tabulate across multiple vectors.
+ When using `r hm`, you should use the similar [count()] instead!
# Useful tricks
+ Would you like to know how many elements in a vector match a logical criteria? Take the sum of the logic:
+ `sum((1:100) > 55)`
+ `sum(letters %in% c('a', 'e', 'i', 'o', 'u'))`
+ Would you like to know what *proportion* of values match your logical criteria? Take the *mean* of the logic:
+ `mean((1:100) > 55)`
+ `sum(letters %iN% c('a', 'e', 'i', 'o', 'u'))`
# Making your own functions
To make your function in R, you use the `function` keyword, like so:
```
function(argument1, argument2, etc.) {
Expressions to evaluate here, involving the arguments
}
```
For example, let's make a function that subtracts the mean from a vector of numbers.
We'll have one argument, which we'll call `numbers`.
```{r}
myfunc <- function(numbers) {
mean <- mean(numbers)
numbers - mean
}
```
We've created our function, and assigned it the name `myfunc`, just like any other assignment.
Let's try it out:
```{r}
myfunc(1:9)
```
Notice that the *last* expression in your function definition is the value that gets "returned" by the function.
---
If you are feeling lazy, you can also define a function using a few less keystrokes using the command `\()` instead of `function()`.
For example,
```{r}
\(x) x + 1
```
# Common Errors and Warnings
TBA