-
Notifications
You must be signed in to change notification settings - Fork 2
/
01-week01.Rmd
956 lines (663 loc) · 36.9 KB
/
01-week01.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
# Week 1 {#week1}
```{r, echo=FALSE, message=FALSE, error=FALSE}
library(tidyverse)
library(magrittr)
library(knitr)
library(kableExtra)
# this course's URL
myurl <- "https://csde-uw.github.io/csde502-winter-2022"
# path to this file name
if (!interactive()) {
fnamepath <- current_input(dir = TRUE)
fnamestr <- paste0(Sys.getenv("COMPUTERNAME"), ": ", fnamepath)
} else {
fnamepath <- ""
}
```
<h2>Topics:</h2>
* [Getting started on terminal server 4](#gettingstarted)
* [Introduction to R/RStudio/R Markdown](#intrormd)
* [R data types](#rdatatypes)
* [R data structures](#rdatastructures)
* [File systems](#filesystems)
* [Data manipulation in the `tidyverse`](#tidyverse)
* [Data sets:](#datasets001)
* Employee turnover data
<hr>
Today's lessons will cover getting started with computing at CSDE, and quickly introduce R, RStudio, and R Markdown.
It is assumed that students in this course have a basic working knowledge of using R, including how to create variables with the assignment operator ("`<-`"), and how to run simple functions(e.g., `mean(dat$age)`). Often in courses that include using R for statistical analysis, some of the following foundations are not explained fully. This section is not intended to be a comprehensive treatment of R data types and structures, but should provide some background for students who are either relatively new at using R or who have not had a systematic introduction.
The other main topic for today is [`tidyverse`](https://www.tidyverse.org/), which refers to a related set of R packages for data management, analysis, and display. See Hadley Wickham's [tidy tools manifesto](https://tidyverse.tidyverse.org/articles/manifesto.html) for the logic behind the suite of tools. For a brief description of the specific R packages, see [Tidyverse packages](https://www.tidyverse.org/packages/). This is not intended to be a complete introduction to the `tidyverse`, but should provide sufficient background for data handling to support most of the technical aspects of the rest of the course and CSDE 533.
## Getting started on Terminal Server 4 {#gettingstarted}
First, if you are not on campus, make sure you have the Husky OnNet VPN application running and have connected to the UW network. You should see the f5 icon in your task area:
![](images/week01/2021-01-07_21_40_25-.png)
Connect to TS4: `csde-ts4.csde.washington.edu`
If you are using the Windows Remote Desktop Protocol (RDP) connection, your connection parameters should look like this:
![](images/week01/2021-01-07_21_48_03-Remote Desktop Connection.png)
If you are using mRemoteNG, the connection parameters will match this:
![](images/week01/2021-01-07_21_37_36-Window.png)
Once you are connected you should see a number of icons on the desktop and application shortcuts in the Start area.
![](images/week01/2021-01-07_21_59_38-.png)
![](images/week01/2021-01-07_22_00_14-Window.png)
Open a Windows Explorer (if you are running RDP in full screen mode you should be able to use the key combination Win-E).
Before doing anything, let's change some of the annoying default settings of the Windows Explorer. Tap `File > Options`. In the `View` tab, make sure that `Always show menus` is checked and `Hide extensions for known file types` is unchecked. The latter setting is very important because we want to see the complete file name for all files at all times.
![](images/week01/2021-01-07_22_30_46-Folder_Options.png)
Click `Apply to Folders` so that these settings become default. Click `Yes` to the next dialog.
![](images/week01/2021-01-07_22_31_37-FolderViews.png)
Now let's make a folder for the files in this course.
Navigate to This PC:
![](images/week01/2021-01-07_22_05_59-Window.png)
You should see the `H:` drive. This is is the mapped drive that links to your [U Drive](https://itconnect.uw.edu/wares/online-storage/u-drive-central-file-storage-for-users/), and is the place where all of the data for this course is to be stored. __Do not store any data on the `C:` drive!__ The `C:` drive can be wiped without any prior notification.
__Be very careful with your files on the U Drive!__ If you delete files, there is no "undo" functionality. When you are deleting files, you will get a warning that you should take seriously:
![](images/week01/2021-01-07_23_01_10-Delete_Folder.png)
Navigate into `H:` and create a new folder named `csde502_winter_2022`. Note the use of lowercase letters and underscores rather than spaces. This will be discussed in the section on file systems later in this lesson.
![](images/week01/2021-01-07_22_32_29-new_folder.png)
## Introduction to R Markdown in RStudio {#intrormd}
### Create a project
Now we will use RStudio to create the first R Markdown source file and render it to HTML.
Start RStudio by either dbl-clicking the desktop shortcut or navigating to the alphabetical R section of the Start menu:
![](images/week01/2021-01-07_23_05_49-Window.png)
:::{.rmdnote}
***A brief aside: install R packages.***
To get started, because it usually takes some time to install, open a second RStudio session and at the console, to install `tidyverse`, the other packages for CSDE 502 and 533, and for this lesson, download the file [`packages.R`](tools/packages.R).
Open the file in your second RStudio session and in the upper right of the source code pane, click `Source > Source`.
![](images/week01/2022-01-06 21_47_15-source.png)
Now continue on with the lesson in your original RStudio session.....
:::
Create a new project (`File > New Project...`).
![](images/week01/2021-01-07_23_08_34-rstudiorappbroker.csde.washington.edu.png)
Since we just created the directory to house the project, select `Existing Directory`.
![](images/week01/2021-01-07_23_09_11-csde502_winter_2021_course-RStudiorappbroker.csde.washington.edu.png)
Navigate to that directory and select `Open`.
![](images/week01/2021-01-07_23_09_48-ChooseDirectoryrappbroker.csde.washington.edu.png)
Click `Create Project`.
![](images/week01/2021-01-07_23_10_02-csde502_winter_2021_course-RStudiorappbroker.csde.washington.edu.png)
You will now have a blank project with only the project file.
![](images/week01/2021-01-07_23_11_16-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)
### Create an R Markdown file from built-in RStudio functionality
Let's make an R Markdown file (`File > New File > R Markdown...`).
![](images/week01/2021-01-07_23_12_31-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)
Do not change any of the metadata ... this is just for a quick example.
![](images/week01/2021-01-07_23_13_41-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)
Click `OK` and then name the file `week_01.Rmd`.
![](images/week01/2021-01-07_23_14_59-SaveFile-Untitled1rappbroker.csde.washington.edu.png)
#### Render the Rmd file as HTML
At the console prompt, enter `R Markdown::render("W` and tap the `TAB` key. This should bring up a list of files that have the character "w" in the file name. Click `week_01.Rmd`.
The syntax here means "run the `render()` function from the `R Markdown` package on the file `week_01.Rmd`"
![](images/week01/2021-01-07_23_15_32-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)
After a few moments, the process should complete with a message that the output has been created.
![](images/week01/2021-01-07_23_16_13-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)
If the HTML page does not open automatically, look for `week_01.html` in the list of files. Click it and select `View in Web Browser`.
![](images/week01/2021-01-07_23_16_39-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)
You will now see the bare-bones HTML file.
![](images/week01/2021-01-07_23_17_10-Untitled.png)
Compare the output of this file with the source code in `week_01.Rmd`. Note there are section headers that begin with hash marks, and R code is indicated with the starting characters
<code>
\`\`\`\{r\}
</code>
and the ending characters
<code>
\`\`\`
</code>
Next, we will explore some enhancements to the basic R Markdown syntax.
### Create an R Markdown file with some enhancements
Download this version of [`week_01.Rmd`](files/week_01.Rmd) and overwrite the version you just created.
If RStudio prints a message that some packages are required but are not installed, click `Install`.
![](images/week01/2021-01-07_23_26_55-csde502_winter_2021-RStudiorappbroker.csde.washington.edu.png)
Change line 3 to include your name and e-mail address as shown.
![](images/week01/2021-01-08_00_35_18-Window.png)
#### Render and view the enhanced output
Repeat the rendering process (`R Markdown::render("Week_01.Rmd")`)
The new HTML file has a number of enhancements, including a mailto: hyperlink for your name, a table of contents at the upper left, a table that is easier to read, a Leaflet map, captions and cross-references for the figures and table, an image derived from a PNG file referenced by a URL, the code used to generate various parts of the document that are produced by R code, and the complete source code for the document. A downloadable version of the rendered file: [week_01.html](files/week_01.html).
![](images/week01/2021-01-08_00_19_27-Week01.png)
Including the source code for the document is especially useful for readers of your documents because it lets them see exactly what you did. An entire research chain can be documented in this way, from reading in raw data, performing data cleaning and analysis, and generating results.
## R data types {#rdatatypes}
You may want to download the file [week01.Rmd](files/week01/week01.R), which contains many of the examples below.
There are six fundamental data types in R:
1. logical
1. numeric
1. integer
1. complex
1. character
1. raw
The most atomic object in R will exist having one of those data types, described below. An atomic object of the data type can have a value, `NA` which represents an observation with no data (e.g., a missing measurement), or `NULL` which isn't really a value at all, but can still have the data type class.
You will encounter other data types, such as `Date` or `POSIXct` if you are working with dates or time stamps. These other data types are extensions of the fundamental data types.
To determine what data type an object is, use `is(obj)`, `str(obj)`, or `class(obj)`.
```{r}
print(is("a"))
print(str(TRUE))
print(class(123.45))
print(class(as.integer(1000)))
n <- as.numeric(999999999999999999999)
print(class(n))
```
### Logical
Use `logical` values for characteristics that are either `TRUE` or `FALSE`. Note that if `logical` elements can also have an `NA` value if the observation is missing. In the following examples,
```{r}
# evaluate as logical, test whether 1 is greater than two
a <- 1 > 2
```
```{r}
# create two numerical values, one being NA, representing ages
age_john <- 39
age_jane <- NA
# logical NA from Jane's undefined age
(jo <- age_john > 50)
(ja <- age_jane > 50)
```
Logical values are often expressed in binary format as 0 = `FALSE` and ` = `TRUE`. in R these values are interconvertible. Other software (e.g., Excel, MS Access) may convert logical values to numbers that you do not expect.
```{r}
(t <- as.logical(1))
(f <- as.logical(0))
```
### Numeric
`Numeric` values are numbers with range about 2e-308 to 2e+308, depending on the computer you are using. You can see the possible range by entering `.Machine` at the R console. These can also include decimals. For more information, see [Double-precision floating-point format](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)
### Integer
`Integer` values are numerical, but can only take on whole, rather than fractional values, and have a truncated range compared to `numeric`. For example, see below, if we try to create an integer that is out of range. The object we created is an integer, but because it is out of range, is value is set to `NA`.
```{r}
i <- as.integer(999999999999999999999)
print(class(i))
```
### Complex
The `complex` type is used in mathematics and you are unlikely to use it in applied social science research unless you get into some heavy statistics. See [Complex number](https://en.wikipedia.org/wiki/Complex_number) for a full treatment.
### Character
`Character` data include the full set of keys on your keyboard that print out a character, typically [A-Z], [a-z], [0-9], punctuation, etc. The full set of ASCII characters is supported, e.g. the `accent aigu` in Café:
```{r}
print(class("Café"))
```
Also numbers can function as characters. Be careful in converting between numerical and character versions. For example, see these ZIP codes:
```{r error=TRUE}
# this is a character
my_zip <- "98115"
# it is not numeric.
my_zip + 2
```
```{r}
# we can convert it to numeric, although it would be silly to do with ZIP codes, which are nominal values
as.numeric(my_zip) + 2
# Boston has ZIP codes starting with zeros
boston_zip <- "02134"
as.numeric(boston_zip)
```
### Raw
`Raw` values are used to store raw bytes in hexadecimal format. You are unlikely to use it in applied social science research. For example, the hexadecimal value for the character `z` is `7a`:
```{r}
print(charToRaw("z"))
class(charToRaw("z"))
```
## R data structures {#rdatastructures}
![](images/week02/data_structures.png)
There are 5 basic data structures in R, as shown in the graphic:
1. vector
1. matrix
1. array
1. list
1. data frame
In addition, the `factor` data type is very important
### Vector
A vector is an ordered set of elements of one or more elements of the same data type and are created using the `c()` constructor function. For example, a single value is a vector:
```{r}
# create a vector of length 1
a <- 1
is(a)
```
If you try creating a vector with mixed data types, you may get unexpected results; mixing character elements with other type elements will result in character representations, e.g.,
```{r}
c(1, "a", TRUE, charToRaw("z"))
```
Results will depend on the data type you are mixing, for example because logical values can be expressed numerically, the `TRUE` and `FALSE` values are converted to `1` and `0`, respectively.
```{r}
(c(1:3, TRUE, FALSE))
```
But if a character is added, all elements are converted to characters.
```{r}
c(1:3, TRUE, FALSE, "awesome!")
```
Order is important, i.e.,
`1, 2, 3` is not the same as `1, 3, 2`
R will maintain the order of elements in vectors unless a process is initiated that changes the order of those elements:
```{r}
# a vector
(v <- c(1, 3, 2))
(sort(v))
```
You can get some information about vectors, such as length and data type:
```{r}
# create a random normal
set.seed(5)
normvec1000 <- rnorm(n = 1000)
length(normvec1000)
class(normvec1000)
class(normvec1000 > 1)
```
Elements of vectors are specified with their index number (1 .. n):
```{r}
v <- seq(from = 0, to = 10, by = 2)
v[4]
```
### Matrix
A matrix is like a vector, in that it an contain only one data type, but it is two-dimensional, having rows and columns. A simple example:
```{r}
# make a vector 1 to 100
(v <- 1:100)
# load to a matrix
(m1 <- matrix(v, ncol = 10, byrow = TRUE))
# different r, c ordering
(m2 <- matrix(v, ncol = 10, byrow = FALSE))
```
If you try to force a vector into a matrix whose row $\times$ col length does not match the length of the vector, the elements will be recycled, which may not be what you want. At least R will give you a warning.
```{r}
(m3 <- matrix(letters, ncol = 10, nrow = 10))
```
### Array
An array is similar to matrix, but it can have more than one dimension. These can be useful for analyzing time series data or other multidimensional data. We will not be using array data in this course, but a simple example of creating and viewing the contents of an array:
```{r}
# a vector 1 to 27
v <- 1:27
# create an array, 3 x 3 x 3
(a <- array(v, dim = c(3, 3, 3)))
# array index is r, c, m (row, column, matrix), e.g., row 1 column 2 matrix 3:
(a[1,2,3])
```
### List
R lists are ordered collections of objects that do not need to be of the same data type. Those objects can be single-value vectors, multiple-value vectors, matrices, data frames, other lists, etc. Because of this, lists are a very flexible data type. But because they can have as little or as much structure as you want, can become difficult to manage and analyze.
Here is an example of a list comprised of single value vectors of different data type. Compare this with the attempt to make a vector comprised of elements of different data type:
```{r}
(l <- list("a", 1, TRUE))
```
Let's modify that list a bit:
```{r}
(l <- list("a",
1:20,
as.logical(c(0,1,1,0))))
```
The top-level indexing for a list is denoted using two sets of square brackets. For example, the first element of our list can be accessed by `l[[1]]`. For example, the mean of element 2 is obtained by `mean(l[[2]])`: ``r mean(l[[2]])``.
To perform operations on all elements of a list, use `lapply()`:
```{r}
# show the data types
(lapply(X = l, FUN = class))
# mean, maybe?
(lapply(X = l, FUN = function(x) {mean(x)}))
```
### Factor
Factors are similar to vectors, in that they are one-dimensional ordered sets. However, factors also use informational labels. For example, you may have a variable with household income as a text value:
* "<$10,000"
* "$10,000-$549,999"
* "$50,000-$99,999"
* "$100,000-$200,000"
* ">$200,000"
As a vector:
```{r}
(income <- c("<$10,000"
, "$10,000-$49,999"
, "$50,000-$99,999"
, "$100,000-$200,000"
, ">$200,000"))
```
Because these are characters, they do not sort in proper numeric order:
```{r}
sort(income)
```
If these are treated as a factor, the levels can be set for proper ordering:
```{r}
# create a factor from income and set the levels
(income_factor <- factor(x = income, levels = income))
# sort again
(sort(income_factor))
```
As a factor, the data can also be used in statistical models and the magnitude of the variable will also be correctly ordered.
### Data frame
Other than vectors, data frames are probably the most used data type in R. You can think of data frames as matrices that allow columns with different data type. For example, you might have a data set that represents subject IDs as characters, sex or gender as text, height, weight, and age as numerical values, income as a factor, and smoking status as logical. Because a matrix requires only one data type, it would not be possible to store all of these as a matrix. An example:
```{r}
# income levels
inc <- c("<$10,000"
, "$10,000-$49,999"
, "$50,000-$99,999"
, "$100,000-$200,000"
, ">$200,000")
BMI <- data.frame(
sid = c("A1001", "A1002", "B1001"),
gender = c("Male", "Male","Female"),
height_cm = c(152, 171.5, 165),
weight_kg = c(81, 93, 78),
age_y = c(42, 38, 26),
income = factor(c("$50,000-$99,999", "$100,000-$200,000", "<$10,000"), levels = inc)
)
print(BMI)
```
## File systems {#filesystems}
Although a full treatment of effective uses of file systems is beyond the scope of this course, a few basic rules are worth covering:
1. Never use spaces in folder or file names.
Ninety-nine and 44/100ths percent of the time, most modern software will have no problems handling file names with spaces. But that 0.56% of the time when software chokes, you may wonder why your processes are failing. If your directly and file names do not have spaces, then you can at least rule that out!
1. Use lowercase letters in directory and file names.
In the olden days (MS-DOS), there was not case sensitivity in file names. UNIX has has always used case sensitive file names. So
`MyGloriousPhDDissertation.tex` and `mygloriousphddissertation.tex` could actually be different files. Macs, being based on a UNIX kernel, also employ case sensitivity in file names. But Windows? No. Consider the following: there cannot be both `foo.txt` and `FOO.txt` in the same directory.
![](images/week01/2021-01-08_01_13_50-CommandPrompt.png)
So if Windows doesn't care, why should we? Save yourself some keyboarding time and confusion by using only lowercase characters in your file names.
1. Include dates in your file names.
If you expect to have multiple files that are sequential versions of a file in progress, an alternative to using a content management system such as [git](https://git-scm.com/), particularly for binary files such as Word documents or SAS data files, is to have multiple versions of the files but including the date as part of the file name. If you expect to have multiple versions on the same date, include a lowercase alphabetical character; it is improbable that you would have more than 26 versions of a fine on a single calendar date. If you are paranoid, use a suffix number `0000`, `0002` .. `9999`. If you have ten thousand versions of the same file on a given date, you are probably doing something that is not right.
Now that you are convinced that including dates in file names is a good idea, _please_ use the format `yyyy-mm-dd` or `yyyymmdd`. If you do so, your file names will sort in temporal order.
1. Make use of directories!
Although a folder containing 100,000 files can be handled programatically (if file naming conventions are used), it is not possible for a human being to visually scan 100,000 file names. If you have a lot of files for your project, consider creating directories, e.g.,
- raw_data
- processed_data
- analysis_results
- scripts
- manuscript
1. Agonize over file names.
Optimally when you look at your file names, you will be able to know something about the content of the file. We spend a lot of time doing analysis and creating output. Spending an extra minute thinking about good file names is time well spent.
## Data manipulation in the `tidyverse` {#tidyverse}
One of the R packages we will use frequently is [`tidyverse`](https://www.tidyverse.org/packages/), which is itself a collection of several other packages, each with a specific domain:
* `ggplot2` (graphics)
* `dplyr` (data manipulation)
* `tidyr` (reformatting data for efficient processing)
* `readr` (reading rectangular R x C data)
* `purrr` (functional programming, e.g., to replace `for()` loops)
* `tibble` (enhanced data frames)
* `stringr` (string, i.e., text manipulation)
* `forcats` (handling factor, i.e., categorical variables)
We will touch on some of these during this course, but there will not be a full review or treatment of the `tidyverse`.
This section will introduce some of the main workhorse functions in tidy data handling.
Installing tidyverse is straightforward but it may take some time to download and install all of the packages. If you have not done so yet, use
```
install.packages("tidyverse")
```
For today's lesson we will be using one of the Add Health public use data sets, [AHwave1_v1.dta](data/AHwave1_v1.dta).
```{r warning=FALSE, message=FALSE}
# load pacman if necessary
package.check <- lapply("pacman", FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
library(x, character.only = TRUE)
}
})
# load readstata13 if necessary
pacman::p_load(readstata13)
# read the dta file
dat <- readstata13::read.dta13(file.path(myurl, "data/AHwave1_v1.dta"))
```
The data set includes variable labels, which make handling the data easier. Here we print the column names and their labels. Wrapping this in a `DT::data_table` presents a nice interface for showing only a few variables at a time and that allows sorting and searching.
```{r}
x <- data.frame(colname = names(dat), label = attributes(dat)$var.labels)
DT::datatable(data = x, caption = "Column names and labels in AHwave1_v1.dta.")
```
### magrittr{#magrittr}
![](images/week02/unepipe.jpeg)
The R package [`magrittr`](https://cran.r-project.org/web/packages/magrittr/index.html) allows the use of "pipes". In UNIX, pipes were used to take the output of one program and to feed as input to another program. For example, the UNIX command `cat` prints the contents of a text file. This would print the contents of the file `00README.txt`:
```cat 00README.txt```
but with large files, the entire contents would scroll by too fast to read. Using a "pipe", denoted with the vertical bar character `|` allowed using the `more` command to print one screen at a time by tapping the `Enter` key for each screen full of text:
```cat 00README.txt | more```
As shown in these two screen captures:
![](images/week02/cat_more.png)
![](images/week02/cat_more2.png)
The two main pipe operators we will use in `magrittr` are `%>%` and '%<>%'.
`%>%` is the pipe operator, which functions as a UNIX pipe, that is, to take something on the left hand side of the operator and feed it to the right hand side.
`%<>%` is the assignment pipe operator, which takes something on the left hand side of the operator, feeds it to the right hand side, and replaces the object on the left-hand side.
For a simple example of the pipe, to list only the first 6 lines of a data frame in base R, we use `head()`, e.g.,
```{r}
head(iris)
```
using a tidy version of this:
```{r}
iris %>% head()
```
In the R base version, we first read `head`, so we know we will be printing the first 6 elements of something, but we don't know what that "something" is. We have to read ahead to know we are reading the first 6 records of `iris`. In the tidy version, we start by knowing we are doing something to the data set, after which we know we are printing the first 6 records.
In base R functions, the process is evaluated from the inside out. For example, to get the mean sepal length of the _setosa_ species in iris, we would do this:
```{r}
mean(iris[iris$Species == 'setosa', "Sepal.Length"])
```
From the inside out, we read that we are making a subset of `iris` where Species = "setosa", we are selecting the column "Sepal.Length", and taking the mean. However, it requires reading from the inside out. For a large set of nested functions, we would have ` y <- f(g(h((i(x)))))`, which would require first creating the innermost function (`i()`) and then working outward.
In a tidy approach this would be more like y <- x %>% i() %>% h() %>% g() %>% f()` because the first function applied to the data set `x` is `i()`. Revisiting the mean sepal length of _setosa_ irises, example, under a tidy approach we would do this:
```{r}
iris %>% filter(Species == 'setosa') %>% summarise(mean(Sepal.Length))
```
Which, read from left to right, translates to "using the iris data frame, make a subset of records where species is _setosa_, and summarize those records to get the mean value of sepal length." The tidy version is intended to be easier to write, read, and understand. The command uses the `filter()` function, which will be described below.
### Data subsetting (dplyr)
`dplyr` is the tidyverse R package used most frequently for data manipulation. Selection of records (i.e., subsetting) is done using logical tests to determine what is in the selected set. First we will look at logical tests and then we will cover subsetting rows and columns from data frames.
##### Logical tests
If elements meet a logical test, they will end up in the selected set. If data frame records have values in variables that meet logical criteria, the records will be selected.
Some logical tests are shown below.
###### `==`: equals
```{r}
# numeric tests
(1 == 2)
```
```{r}
(1 == 3 - 2)
```
```{r}
# character test (actually a factor)
(dat$imonth %>% head() %>% str_c(collapse = ", "))
((dat$imonth == "(6) June") %>% head())
```
```{r}
# character test for multiple patterns
(dat$imonth %in% c("(6) June", "(7) July") %>% head())
```
###### `>`, `>=`, `<`, `<=`: numeric comparisons
```{r}
1 < 2
```
```{r}
1 > 2
```
```{r}
1 <= -10:10
```
```{r}
1 >= -10:10
```
###### `!=`: not equals
```{r}
1 != 2
```
```{r}
# those of the first 6 days that are not 14
(dat$iday %>% head())
((dat$iday != 14) %>% head())
```
###### `!`: invert, or "not"
Sometimes it is more convenient to negate a single condition rather than enumerating all possible matching conditions.
```{r}
dat$imonth %>% head(20)
((!dat$imonth %in% c("(6) June", "(7) July")) %>% head(20))
```
#### Subset rows (`filter()`)
The `filter()` function creates a subset of records based on a logical test. Logical tests can be combined as "and" statements using the `&` operator and "or" statements using the `|` operator. Here we will perform a few filters on a subset of the data.
```{r}
# first 20 records, fist 10 columns
dat_sub <- dat[1:20, 1:10]
kable(dat_sub, format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
```
Records from one month:
```{r}
# from May
(dat_sub %>% filter(imonth == "(5) May"))
```
Records from one month from females:
```{r}
(dat_sub %>% filter(imonth == "(5) May" & bio_sex == "(2) Female"))
```
Records from one month and from females or where the day of month was before the 15th, which will probably include some males:
```{r}
(dat_sub %>% filter(imonth == "(5) May" & (bio_sex == "(2) Female") | iday < 15))
```
Although these examples are silly and trivial, they show how `filter()` is used to create a selected set of data
#### Subset columns (`select()`)
A subset of columns can be extracted from data frames using the `select()` function, most simply using named list of columns to keep.
```{r}
# select 3 columns
(dat_sub_sel <- dat_sub %>%
select("aid", "imonth", "iday"))
```
```{r}
# select all but two named columns
(dat_sub_sel <- dat_sub %>%
select(-"imonth", -"iday"))
```
```{r}
# select columns by position and whose name matches a pattern, in this case the regular expression "^i" meaning "starts with lowercase i"
(dat_sub_sel <- dat_sub %>%
select(1, matches("^i")))
```
`select()` can also be used to rename columns:
```{r}
#select one column, rename two columns
(dat_sub_sel %>%
select(aid, Month = imonth, Day = iday))
```
Or column renaming can be done with `rename()`, which maintains all input data and only changes the named columns:
```{r}
(dat_sub_sel %>%
rename(Month = imonth, Day = iday))
```
#### Subset rows and columns: `filter()` and `select()`
We can combine `filter()` and `select()` with a pipe to create a new data frame with a subset of rows and columns:
```{r}
# records with day of month > 15 and the first 3 named columns
(x <- dat_sub %>%
filter(iday > 15) %>%
select(aid, imonth, iday)
)
```
#### Create or calculate columns: `mutate()`
`mutate()` will create new named columns or re-calculate existing columns. Here we will make a column that stratifies birth month, with the cut at June.
Although the birth month column (`h1gi1m`) is a factor, it is unordered, so we need to make it ordered before using the factor label in a numeric comparison. Fortunately, the factor labels were handled in correct order:
```{r}
# is this ordered?
is.ordered(dat$h1gi1m)
```
```{r}
# what are the levels?
(levels(dat$h1gi1m))
```
Assign order, create a new column, and print nicely:
```{r}
# make birth month ordered
dat$h1gi1m <- factor(dat$h1gi1m, ordered = TRUE)
# now is it ordered?
is.ordered(dat$h1gi1m)
```
```{r}
# perform the mutate() using the string representation of the factor for comparison
dat %>%
filter(iday > 15) %>%
select(aid, imonth, iday, birth_month = h1gi1m) %>%
mutate(birth_1st_half = (birth_month < "(7) July")) %>%
head(20) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
```
A silly example but showing that `mutate()` can change values of existing columns:
```{r}
(X <- dat_sub %>%
mutate(iday = -1000 + iday))
```
... so do be careful!
Other functions can be used with mutate include (but are by no means limited to!)
* `if_else()`: create a column by assigning values based on logical criteria
* `case_when()`: similar to `if_else()` but for multiple input values
* `recode()`: change particular values
When we recoded the birth month, the output was a `logical` data type. If we wanted to create a
`character` or `factor`, we could use `if_else()`. Here we are creating a new data frame based on several operations on `dat`.
```{r}
dat_1 <- dat %>%
filter(iday > 15) %>%
head(20) %>%
select(aid, imonth, iday, birth_month = h1gi1m) %>%
mutate(birth_year_half = ifelse(test = birth_month < "(7) July", yes = "first", no = "last"))
# make that a factor
dat_1$birth_year_half <- factor(dat_1$birth_year_half, levels = c("first", "last"))
# print
kable(dat_1) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
```
If one of your variables contains multiple values and you want to create classes, use `case_when()`. Here is a verbose example stratifying months into quarters. Also we are using the `magrittr` assignment pipe to update the input based on the statement, i.e., `dat_1` will change based on the commands we use. __Be careful using the assignment pipe because it will change your data frame.__
`case_when()` will recode in order or the way the command is written, so for months and quarters, it is not necessary to specify both ends of the quarter. Also any cases that are not explicitly handled can be addressed with the `TRUE ~ ...` argument; in this case, any records that had birth months that were not before September get assigned to quarter 4.
```{r}
dat_1 %<>%
mutate(quarter = case_when(
birth_month < "(3) March" ~ 1,
birth_month < "(6) June" ~ 2,
birth_month < "(9) September" ~ 3,
TRUE ~ 4))
# print
kable(dat_1) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")
```
`recode()` is used to change the `birth_year_half` column:
```{r}
(dat_1 %<>%
mutate(birth_year_half_split = recode(birth_year_half,
"first" = "early",
"last" = "late")))
```
#### Summarizing/aggregating data
We will spend more time later in the course on data summaries, but an introduction with `dplyr` is worthwhile introducing at this stage. The two main functions are `summarise()` and `group_by()`.
A simple summary will tabulate the count of respondents and the mean age. The filter `! str_detect(h1gi1y, "Refused")` drops records from respondents who refused to give their birth year.
```{r}
dat %>%
filter(! str_detect(h1gi1y, "Refused")) %>%
mutate(yeari = str_replace_all(iyear, ".* ", "") %>% as.integer(),
yearb = str_replace_all(h1gi1y, ".* ", "") %>% as.integer()) %>%
summarise(n = n(),
mean_age = mean(yeari - yearb))
```
Here we will summarize age by sex using the `group_by()` function, and also piping to `prop_table()` to get the percentage:
```{r}
dat %>%
filter(! str_detect(h1gi1y, "Refused")) %>%
mutate(yeari = str_replace_all(iyear, ".* ", "") %>% as.integer(),
yearb = str_replace_all(h1gi1y, ".* ", "") %>% as.integer()) %>%
group_by(bio_sex) %>%
summarise(mean_age = mean(yeari - yearb),
sd_age = sd(yeari - yearb),
n = n(),
.groups = "drop_last") %>%
mutate(pct = prop.table(n) * 100)
```
#### purrr: efficient iterating over elements in vectors and lists
More attention will be paid to `purrr` in the lesson on [functions](#week2).
The workhorse function in `purrr` is `map()`, which applies a function over a list or atomic vector.
A brief example uses a vector `c(9, 16, 25)` and the `map()` function is used to get the square root of each element. The output is a list
```{r}
# apply the sqrt() function to each element of a vector of integers
map(c(9, 16, 25), sqrt)
```
Other resources for `purrr`: [Learn to purrr](https://www.rebeccabarter.com/blog/2019-08-19_purrr/), [purrr tutorial](https://jennybc.github.io/purrr-tutorial/)
## Data sets {#datasets001}
### Edward Babushkin's Employee turnover data {#babushkin1}
Some data that will be used in CSDE 533: Edward Babushkin's employee turnover data, explained a bit at [kaggle.com](https://www.kaggle.com/davinwijaya/employee-turnover) and as a [downloadable file](https://github.com/teuschb/hr_data/blob/master/datasets/turnover_babushkin.csv).
Here we will load the data set from a URL:
```{r}
etdata <- read.csv("https://raw.githubusercontent.com/teuschb/hr_data/master/datasets/turnover_babushkin.csv")
```
Just to get a bit of `tidyverse` in at the last minute, let's get mean and standard deviation of job tenure by gender and 10-year age class:
```{r}
# create 10-year age classes
etdata %<>%
mutate(age_decade = plyr::round_any(age, 10, f = ceiling))
# summarize
etdata %>%
# group by gender and age class
group_by(gender, age_decade) %>%
# mean and sd
summarize(mean_tenure_months = mean(tenure) %>% round(1),
sd_tenure_months = sd(tenure) %>% round(1),
.groups = "keep") %>%
# order the output by age and gender
arrange(gender, mean_tenure_months) %>%
# print it nicely
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
<hr>
Rendered at <tt>`r Sys.time()`</tt>
## Source code
File is at `r fnamestr`.
### R code used in this document
```{r ref.label=knitr::all_labels(), echo=TRUE, eval=FALSE}
```
### Complete Rmd code
```{r comment=''}
cat(readLines(fnamepath), sep = '\n')
```