-
Notifications
You must be signed in to change notification settings - Fork 2
/
09-week09.Rmd
847 lines (671 loc) · 29.1 KB
/
09-week09.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
# Week 9 {#week9}
```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(tidyverse)
library(magrittr)
library(knitr)
library(kableExtra)
library(haven)
library(curl)
library(ggplot2)
# URL home
urlhome <- ""
# path to this file name
if (!interactive()) {
fnamepath <- current_input(dir = TRUE)
fnamestr <- paste0(Sys.getenv("COMPUTERNAME"), ": ", fnamepath)
} else {
fnamepath <- ""
}
```
<h2>Topic: Miscellaneous data processing </h2>
This week's lesson will cover a set of miscellaneous data processing topics that can be useful in different situations.
Mostly this is a set of coded examples with explanations.
Download [template.Rmd.txt](files/template.Rmd.txt) as a template for this lesson. Save in in your working folder for the course, renamed to `week_09.Rmd` (with no `.txt` extension). Make any necessary changes to the YAML header.
## Substituting text
### `paste()`, `paste0()`
Pasting text allows you to substitute variables within a text string. For example, if you are running a long loop over a series of files and you want to know which file name and loop iteration you are on.
The function `paste()` combines a set of strings and adds a space between the strings, e.g., combining the first values from the `LETTERS` and the `letters` built-in vectors:
```{r}
paste(LETTERS[1], letters[1])
```
whereas `paste0` does not add spaces:
```{r}
paste0(LETTERS[1], letters[1])
```
The code below uses the function `tempdir()` to specify a folder that is automatically generated per R session; for this rendering of the book, the location was ``r tempdir()`` but will almost certainly be different for your session. The code downloads and unzips the file [quickfox](files/quickfox.zip) to the `tempdir()` location. The zip file contains a separate file for each word in the phrase "the quick brown fox jumps over the lazy dog". The code then uses a loop and `paste0()` to show the contents of each separate file along with its file name.
```{r}
library(curl)
# zip file
zipfile <- file.path(tempdir(), "quickfox.zip")
# download
curl_download(url = "http://staff.washington.edu/phurvitz/csde502_winter_2021/files/quickfox.zip", destfile = zipfile)
# unzip
unzip(zipfile = zipfile, overwrite = TRUE, exdir = tempdir())
# files in the zip file
fnames <- unzip(zipfile = file.path(tempdir(), "quickfox.zip"), list = TRUE) %>%
pull(Name) %>%
file.path(tempdir(), .)
# read each file
for (i in seq_len(length(fnames))) {
# the file name with a forward slash
fname <- fnames[i] %>% normalizePath(winslash = "/")
# read the file
mytext <- scan(file = fname, what = "character", quiet = TRUE)
# vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# make a string using `paste()` and a tab
mystr <- paste0(mytext, "\t", i, " of ", length(fnames), "; file = ", fname)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# print the message
message(mystr)
}
```
### `sprintf()`
`sprintf()` can be used to format text. Here are just a few examples. The result is a formatted text string.
#### Formatting numerical values
<u>Leading zeros</u>
Numeric values can be formatted as character strings with a specific number of decimal places or leading zeros. For example, ZIP codes imported from CSV files often are converted to integers. The following code chunk converts some numerical ZIP code-like values to text values with the correct format.
Bad ZIP codes; the 5-digit numerical values are read in as double-precision numbers, so the leading zeros are dropped:
```{r}
# some numerical ZIP codes
(zip_bad <- data.frame(id = 1:3, zipcode = c(90201, 02134, 00501)))
```
Good ZIP codes:
```{r}
# fix them up
(zip_good <- zip_bad %>%
mutate(
zipcode = sprintf("%05d", zipcode)
))
```
The `sprintf()` format `%05d` indicates that if the input string is less than 5 characters in length, then make the output string 5 characters and pad on the left with zeros.
<u>Decimal places</u>
Numerical values with different numbers of decimal places can be rendered with a specific number of decimal places.
```{r}
# numbers with a variety of decimal places
v <- c(1.2, 2.345, 1e+5 + 00005)
# four fixed decimal places
v %>% sprintf("%0.4f", .)
```
Note that this is distinct from `round()`, which results in a numeric vector:
```{r}
# round to 4 places
v %>% round(., 4)
```
<u>Commas or spaces for large numbers</u>
Particularly in the narrative, including formatted numbers is important for readability based on the audience. For this, `prettyNum()` can be used. Here are a few examples:
```{r}
# a big number
bignum <- pi * 1e+08
# commas
(us_format <- bignum %>%
prettyNum(big.mark = ",", digits = 15)) %>%
paste("US:", .)
# spaces, common Euro format
(euro_format1 <- bignum %>%
prettyNum(big.mark = " ", digits = 15, decimal.mark = ",")) %>%
paste("some Europe:", .)
# other Euro format
(euro_format2 <- bignum %>% prettyNum(big.mark = "'", digits = 15, decimal.mark = ",")) %>%
paste("other Europe:", .)
```
This literal text:
```
For example, used in inline code:
"$\pi$ multiplied by `r format(1e+8, scientific = FALSE)`
equals approximately `r us_format`."
```
appears as:
For example, used in inline code: "$\pi$ multiplied by `r `format(1e+8, scientific = FALSE)` equals approximately `r us_format`."
#### String substitutions
`sprintf()` can also be used to achieve the same substitution in the file reading loop above. Each `%s` is substituted in order of the position of the arguments following the string. Also note that `\t` inserts a `TAB` character.
```{r}
# read each file
for (i in seq_len(length(fnames))) {
# the file name
fname <- fnames[i]
# read the file
mytext <- scan(file = fname, what = "character", quiet = TRUE)
# vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# make a string using `paste()`
mystr <- sprintf("%s\t%s of %s:\t%s\n", mytext, i, length(fnames), fname)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# print the message
cat(mystr)
}
```
### `str_replace()`, `str_replace_all()`
The `stringr` functions `str_replace()` and `str_replace_all()` can be used to substitute specific strings in other strings. For example, we might create a generic function to run over a set of subject IDs that generates a file for each subject.
```{r}
subjects <- c("a1", "b2", "c3")
f <- function(id) {
# create an output file name by substituting in the subject ID
outfname <- file.path(tempdir(), "xIDx.csv") %>%
str_replace(pattern = "xIDx", id)
# ... do a bunch of stuff, for example
val <- rnorm(1)
# write the file
message(paste0("writing subject ", id, "'s data to ", outfname))
write.csv(x = val, file = outfname)
}
for (i in subjects) {
f(i)
}
```
## Showing progress
A text-based progress bar can be shown using the `txtProgressBar()`. Here we run the same loop for reading the text files, but rather than printing the loop iteration and file names, we show the progress bar and the file contents. If no text is printed to the console (unlike what is demonstrated below with `cat()`), the progress bar will not print on several lines.
It is generally not recommended to put a progress bar in an R Markdown document since the output is usually read as a static file. However, progress bars may be helpful in `shiny` R Markdown documents if a process will take longer than a few seconds.
```{r}
n_fnames <- length(fnames)
# create progress bar
pb <- txtProgressBar(min = 0, max = n_fnames, style = 3)
for (i in 1:n_fnames) {
# delay a bit
Sys.sleep(0.1)
# update progress bar
setTxtProgressBar(pb, i)
# read and print from the file
txt <- scan(fnames[i], what = "character", quiet = TRUE)
cat("\n", txt, "\n")
}
close(pb)
```
For other implementations of progress bars, see [progress: Terminal Progress Bars
](https://cran.r-project.org/web/packages/progress).
## Turning text into code: `eval(parse(text = "some string"))`
Sometimes you may have variables whose values that you want to use in a command or function. For example, suppose you wanted to write a set of files, one for each ZIP code in a data frame, with a file name including the ZIP code. We would not want to use the column name `zipcode`, but we want the actual values in the column.
We can generate a string that represents a command using the same kind of text substitution as above with `sprintf()`. The loop processes each ZIP code record, `pull()`ing the ZIP code value at each iteration. A `write.csv()` command is generated for each iteration, setting the output file name to include the current iteration's ZIP code.
Finally, the last step in each iteration is `eval(parse(text = cmd))` which executes the `cmd` string as a command.
```{r}
verbose = TRUE
for (i in zip_good %>% pull(zipcode)) {
# do some stuff
vals <- rnorm(n = 3)
y <- bind_cols(zipcode = i, v = vals)
# a writing command using sprintf() to substitute %s = ZIP code
cmd <- sprintf("write.csv(x = y, file = file.path(tempdir(), '%s.csv'), row.names = FALSE)", i)
# show what the command is
if(verbose){
cat(cmd, "\n")
}
# this runs the command
eval(parse(text = cmd))
}
```
## SQL in R with `RSQLite` and `sqldf`
Sometimes R's syntax for processing data can be difficult and confusing. For programmers who are familiar with structured query language (SQL), it is possible to run SQL statements within R using a supported database back end (by default SQLite) and the `sqldf()` function.
For example, the mean sepal length by species from the built-in `iris` data set can be obtained, presented in Tables \@ref(tab:iris) and \@ref(tab:iris2).
```{r iris, message=FALSE}
library(sqldf)
library(kableExtra)
sqlc <- '
select
"Species" as species
, avg("Sepal.Length") as mean_sepal_length
, avg("Sepal.Width") as mean_sepal_width
from iris
group by "Species";'
iris_summary <- sqldf(x = sqlc)
iris_summary %>%
kable(caption = "Mean sepal length from the iris data set") %>%
kable_styling(full_width = FALSE, position = "left")
```
This would be equivalent in `tidyverse` as
```{r iris2, message=FALSE}
iris_summary2 <- iris %>%
group_by(Species) %>%
summarise(
mean_sepal_length = mean(Sepal.Length),
mean_sepal_width = mean(Sepal.Width)) %>%
select(species = Species, everything())
iris_summary2 %>%
kable(caption = "Mean sepal length from the iris data set (tidyverse approach)") %>%
kable_styling(full_width = FALSE, position = "left")
```
## Downloading files from password-protected web sites
Some web sites are protected by simple username/password protection. For example, try opening http://staff.washington.edu/phurvitz/csde502_winter_2021/password_protected/foo.csv. The username/password pair is csde/502, which will allow you to see the contents of the web folder.
If you try downloading the file through R, you will get an error because no password is supplied.
```{r}
try(
read.csv("http://staff.washington.edu/phurvitz/csde502_winter_2021/password_protected/foo.csv")
)
```
However, the username and password can be supplied as part of the URL, as below. When the username and password are supplied, they will be cached for that site for the duration of the R session (so if you try running this again, you will succeed without a password).
```{r}
try(
read.csv("http://csde:502@staff.washington.edu/phurvitz/csde502_winter_2021/password_protected/foo.csv")
)
```
## Dates and time stamps: `POSIXct` and `lubridate`
R uses POSIX-style time stamps, which are stored internally as the number of fractional seconds from January 1, 1970. It is imperative that the control over time stamps is commensurate with the temporal accuracy and precision your data. For example, in the measurement of years of residence, precision is not substantially important. For measurement of chemical reactions, fractional seconds may be very important. For applications such as merging body-worn sensor data from GPS units and accelerometers for estimating where and when physical activity occurs, minutes of error can result in statistically significant mis-estimations.
For example, you can see the numeric value of these seconds as `options(digits = 22); Sys.time() %>% as.numeric()`.
```{r}
options(digits = 22)
Sys.time() %>% as.numeric()
```
If you have time stamps in text format, they can be converted to POSIX time stamps, e.g., the supposed time Neil Armstrong stepped on the moon:
```{r}
(eagle <- as.POSIXct(x = "7/20/69 10:56 PM", tz = "CST6CDT", format = "%m/%d/%y %H:%M"))
```
Formats can be specified using specific codes, see `strptime()`.
The `lubridate` package has a large number of functions for handling date and time stamps. For example, if you want to convert a time stamp in the current time zone to a different time zone, first we get the current time
```{r, message=FALSE}
library(lubridate)
# set the option for fractional seconds
options(digits.secs = 3)
(now <- Sys.time() %>%
strptime("%Y-%m-%d %H:%M:%OS"))
```
And convert to UTC:
```{r}
# show this at time zone UTC
(with_tz(time = now, tzone = "UTC"))
```
or show in a different format:
```{r}
# in different format
now %>%
format("%A, %B %d, %Y %l:%m %p %Z")
```
```{r, echo=FALSE}
# reset the digits
options(digits = 7)
```
## Timing with `Sys.time()` and `difftime()`
It is easy to determine how long a process takes by using sequential `Sys.time()` calls, one before and one after the process, and getting the difference with `difftime()`. For example,
```{r}
# mark time and run a process
t0 <- Sys.time()
# delay 5 seconds
Sys.sleep(5)
# mark the time now that the 5 second delay has run
t1 <- Sys.time()
# difftime() unqualified will make its best decision about what to print
(difftime(time1 = t1, time2 = t0))
# time between moon step and now-ish
(difftime(time1 = t0, time2 = eagle))
```
`difftime()` can also be forced to report the time difference in the units of choice. Here is the `difftime()` from the 5-second delay we created above:
```{r}
(difftime(time1 = t1, time2 = t0, units = "secs") %>%
as.numeric()) %>% round(0)
(difftime(time1 = t1, time2 = t0, units = "mins") %>%
as.numeric()) %>% round(2)
(difftime(time1 = t1, time2 = t0, units = "hours") %>%
as.numeric()) %>% round(4)
(difftime(time1 = t1, time2 = t0, units = "days") %>%
as.numeric()) %>% round(6)
```
... and the time since the Eagle had landed to now:
```{r}
(difftime(time1 = t1, time2 = eagle, units = "secs") %>%
as.numeric())
(difftime(time1 = t1, time2 = eagle, units = "mins") %>%
as.numeric())
(difftime(time1 = t1, time2 = eagle, units = "hours") %>%
as.numeric())
(difftime(time1 = t1, time2 = eagle, units = "days") %>%
as.numeric())
```
In order to report intervals as years, use `lubridate::time_length()`:
```{r}
(time_length(x = difftime(time1 = t1, time2 = eagle), unit = "years") %>%
as.numeric() %>% round(1))
```
## Faster files with `fst()`
The `fst` package is great for rapid reading and writing of data frames. The format can also result in much smaller file sizes using compression. Here we will examine the large Add Health file. First, a download, unzip, and read as necessary:
```{r}
library(fst)
library(haven)
myUrl <- "http://staff.washington.edu/phurvitz/csde502_winter_2021/data/21600-0001-Data.dta.zip"
# zip file in $temp
zipfile <- file.path(tempdir(), basename(myUrl))
# download
curl_download(url = myUrl, destfile = zipfile)
# dta file in $temp
dtafname <- tools::file_path_sans_ext(zipfile)
# check if the dta file exists
if (!file.exists(dtafname)) {
# if the dta file doesn't exist, check for the zip file
# check if the zip file exists, download if necessary
if (!file.exists(zipfile)) {
curl::curl_download(url = myUrl, destfile = zipfile)
}
# unzip the downloaded zip file
unzip(zipfile = zipfile, exdir = tempdir())
}
# read the file
dat <- read_dta(dtafname)
# save as a CSV, along with timing
t0 <- Sys.time()
csvfname <- dtafname %>% str_replace(pattern = "dta", replacement = "csv")
write.csv(x = dat, file = csvfname, row.names = FALSE)
t1 <- Sys.time()
csvwrite_time <- difftime(time1 = t1, time2 = t0, units = "secs") %>%
as.numeric() %>%
round(1)
# file size
csvsize <- file.info(csvfname) %>%
pull(size) %>%
sprintf("%0.f", .)
# save as FST, along with timing
t0 <- Sys.time()
fstfname <- dtafname %>% str_replace(pattern = "dta", replacement = "fst")
write.fst(x = dat, path = fstfname)
t1 <- Sys.time()
# file size
fstsize <- file.info(fstfname) %>%
pull(size) %>%
sprintf("%0.f", .)
fstwrite_time <- difftime(time1 = t1, time2 = t0, units = "secs") %>%
as.numeric() %>%
round(1)
```
It took `r csvwrite_time` s to write `r csvsize` bytes as CSV, and `r fstwrite_time` s to write `r fstsize` bytes as a FST file (with the default compression amount of 50). Reading speeds are comparable.
___It should be noted___ that some file attributes will not be saved in FST format and therefore it should be used with caution if you have a highly attributed data set (e.g., a Stata DTA file with extensive labeling, or a data frame with a lot of customized attribute labels). You will lose those attributes! But for data sets with a simple structure, including factors, the FST format is a good option. With a little work, the attributes of a data frame could be saved as a list (e.g., as an `.RData` file along with the `.fst` file) and then applied after the `.fst` file is loaded.
## Load US Census Boundary and Attribute Data as 'tidyverse' and 'sf'-Ready Data Frames: `tigris`, `tidycensus`
*[This has been covered previously, but is included here as a quick recap; skip to [RVerbalExpressions](#rverbalexpressions).]*
Dealing with US Census data can be overwhelming, particularly if using the raw text-based data. The Census Bureau has an API that allows more streamlined downloads of variables (as data frames) and geographies (as simple format shapes). It is necessary to get an API key, available for free. See [tidycensus](https://walker-data.com/tidycensus/) and [tidycensus basic usage](https://walker-data.com/tidycensus/articles/basic-usage.html).
`tidycensus` uses [`tigris`](https://www.rdocumentation.org/packages/tigris/versions/1.0), which downloads the geographic data portion of the census files.
### Download data
A simple example will download the variables representing the count of White, Black/African American, American Indian/Native American, and Asian persons from the American Community Survey (ACS) data for King County in 2019. For this example to run, you need to have your US Census API key installed, e.g.,
<tt>
tidycensus::census_api_key("*****************", install = TRUE)<br>
<font color="red">
Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY").<br>
To use now, restart R or run `readRenviron("~/.Renviron")`
</font>
</tt>
The labels from the census API are:
```
"Estimate!!Total"
"Estimate!!Total!!White alone"
"Estimate!!Total!!Black or African American alone"
"Estimate!!Total!!American Indian and Alaska Native alone"
"Estimate!!Total!!Asian alone"
```
```{r, warning=FALSE, message=FALSE}
library(tidycensus)
# the census variables
census_vars <- c(
p_denom_race = "B02001_001",
p_n_white = "B02001_002",
p_n_afram = "B02001_003",
p_n_aian = "B02001_004",
p_n_asian = "B02001_005"
)
# get the data
ctdat <- get_acs(
geography = "tract",
variables = census_vars,
cache_table = TRUE,
year = 2019,
output = "wide",
state = "WA",
county = "King",
geometry = TRUE,
survey = "acs5",
progress_bar = FALSE
)
```
A few values are shown in Table \@ref(tab:census)
```{r census}
# print a few records
ctdat %>%
head() %>%
kable(caption = "Selected census tract variables from the 5-year ACS from 2019 for King County, WA") %>%
kable_styling(full_width = FALSE, position = "left")
```
### Mapping census data
A `leaflet` simple map is shown in \@ref(fig:ct), with percent African American residents and tract identifier.
```{r ct, fig.cap="Percent African American in census tracts in King County, 2019 ACS 5-year estimate", warning=FALSE, message=FALSE}
library(leaflet)
library(htmltools)
library(sf)
# define the CRS
st_crs(ctdat) <- 4326
# proportion Black
ctdat %<>%
mutate(pct_black = (p_n_aframE / p_denom_raceE * 100) %>% round(1))
# a label
labels <- sprintf("%s<br/>%s%s", ctdat$GEOID, ctdat$pct_black, "%") %>% lapply(htmltools::HTML)
bins <- 0:50
pal <- colorBin(
palette = "Reds",
domain = ctdat$pct_black,
bins = bins
)
bins2 <- seq(0, 50, by = 10)
pal2 <- colorBin(
palette = "Reds",
domain = ctdat$pct_black,
bins = bins2
)
# the leaflet map
m <- leaflet(height = "500px") %>%
# add polygons from tracts
addPolygons(
data = ctdat,
weight = 1,
fillOpacity = 0.8,
# fill using the palette
fillColor = ~ pal(pct_black),
# highlighting
highlight = highlightOptions(
weight = 5,
color = "#666",
fillOpacity = 0.7,
bringToFront = TRUE
),
# popup labels
label = labels,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"
)
) %>%
# legend
addLegend(
position = "bottomright", pal = pal2, values = ctdat$pct_black,
title = "% African American",
opacity = 1
)
m %>% addTiles()
```
### Creating population pyramids from census data
See [Estimates of population characteristics](https://walker-data.com/tidycensus/articles/other-datasets.html#estimates-of-population-characteristics-1).
Refer also back to CSDE 533 Week 2 age structure; Week 7 interpreting age structure.
## Easier regular expressions with `RVerbalExpressions` {#rverbalexpressions}
Regular expressions are powerful but take some time and trial-and-error to master. The `RVerbalExpresions` package can be used to more easily generate regular expressions. See the help for `rx()` and associated functions.
These examples show two constructions of regular expressions for matching two similar but different URLs. First we build a regex using easy-to-understand controls:
```{r}
library(RVerbalExpressions)
# a pattern
x <- rx_start_of_line() %>%
rx_find("http") %>%
rx_maybe("s") %>%
rx_find("://") %>%
rx_maybe("www.") %>%
rx_anything_but(" ") %>%
rx_end_of_line()
# print the expression
(x)
```
That regex is then used to try matching against two URLs:
```{r}
# search for a pattern in some URLs
urls <- c(
"http://www.google.com",
"http://staff.washington.edu/phurvitz/csde502_winter_2021/"
)
grepl(pattern = x, x = urls)
```
We can try a slightly different regex pattern. The former pattern used the less strict `rx_maybe("www.")`, whereas the following pattern uses the more strict `rx_find("www.")`.
```{r}
# a different pattern
y <- rx_start_of_line() %>%
rx_find("http") %>%
rx_maybe("s") %>%
rx_find("://") %>%
rx_find("www.") %>%
rx_anything_but(" ") %>%
rx_end_of_line()
# print the expression
(y)
# search for a pattern in the two URLs, matches one, does not match the other
grepl(pattern = y, x = urls)
```
## Quick copy from Excel (Windows only)
Under Windows, it is possible to copy selected cells from an Excel worksheet directly to R. This is not an endorsement for using Excel, but there are some cases in which Excel may be able to produce some quick data that you don't want to develop in other ways.
As a demonstration, you can use [analysis.xlsx](files/words_analysis.xlsx). Download and open the file. Select and copy a block of cells. Here is shown a selection of cells that was copied for this example.
![](images/week09/excel.png)
The code below shows how the data can be copied. The Windows clipboard can be used as a "file" in the `read.table()` tab-delimited function.
```{r, echo=FALSE}
xlsclip <- fst::read.fst("files/xlsclip.fst")
```
```{r, eval=FALSE}
xlsclip <- read.table(file = "clipboard", sep = "\t", header = TRUE)
xlsclip %>%
kable() %>%
kable_styling(
full_width = FALSE,
position = "left"
)
```
```{r, echo=FALSE}
xlsclip %>%
kable() %>%
kable_styling(
full_width = FALSE,
position = "left"
)
```
## Running system commands
R can run arbitrary system commands that you would normally run in a terminal or command window. The `system()` function is used to run commands, optionally with the results returned as a character vector. Under Mac and Linux, the usage is quite straightforward, for example, to list files in a specific directory:
```
tempdirfiles <- system("ls $TEMP", intern = TRUE)
```
Under Windows, it takes a bit of extra code. To do the same requires the prefix `cmd /c` in the `system()` call before the command itself. Also any backslashes in path names need to be specified as double-backslashes for R.
```{r}
library(magrittr)
# R prefers and automatically generates forward slashes
tmpdir <- dirname(tempdir())
# what is the OS
os <- .Platform$OS.type
# construct a system command
# under Windows
if (os == "windows") {
# under Windows, path delimiters are backslashes so need to be rendered in R as double backslashes
tmpdir %<>% str_replace_all("/", "\\\\")
# formulate the command
cmd <- sprintf("cmd /c dir %s", tmpdir)
# run the command
tempdirfiles <- system(command = cmd, intern = TRUE)
}
# under *NIX
if (os == "unix") {
cmd <- sprintf("ls %s", tmpdir)
tempdirfiles <- system(command = cmd, intern = TRUE)
}
```
If you are running other programs or utilities that are executed in a terminal or command window, this can be very helpful. Use `intern = TRUE` to return the results of the command as an object in the R environment.
## Code styling
Good code should meet at least the two functional requirements of getting the job done and being able able to read. Code that gets the job done but that is not easy to read will cause problems later when you try to figure out how or why you did something.
The [`styler`](https://github.com/r-lib/styler) package can help clean up your code so that it conforms to a specific style such as that in the [tidyverse style guide](https://style.tidyverse.org/). `styler` can be integrated into RStudio for interactive use. It can reformat selected code, an entire file, or an entire project. An example is shown:
![](images/week09/styler_0.1.gif)
[`lintr`](https://github.com/jimhester/lintr) is also useful for identifying potential style errors.
## Session information
It may be helpful in troubleshooting or complete documentation to report the complete session information. For example, sometimes outdated versions of packages may contain errors. The session information is printed with `sessionInfo()`.
```{r}
sessionInfo()
```
## Commenting out Rmd/HTML code
To comment out entire parts of your Rmd so they do not appear in your rendered HTML, use HTML comments, which are specified with the delimiters `<!--` and `-->`. For example, you will not see anything between these two blocks of angle brackets in the HTML output, but if you look at the complete code for the Rmd file that generated this document (below), you will get a treat.
<!--
from https://www.gnu.org/fun/jokes/error-haiku.txt:
IMAGINE IF INSTEAD OF CRYPTIC TEXT STRINGS,
YOUR COMPUTER PRODUCED ERROR MESSAGES IN HAIKU...
A file that big?
It might be very useful.
But now it is gone.
- - - - - - - - - - - -
Yesterday it worked
Today it is not working
Windows is like that
- - - - - - - - - - - -
Stay the patient course
Of little worth is your ire
The network is down
- - - - - - - - - - - -
Three things are certain:
Death, taxes, and lost data.
Guess which has occurred.
- - - - - - - - - - - -
You step in the stream,
but the water has moved on.
This page is not here.
- - - - - - - - - - - -
Out of memory.
We wish to hold the whole sky,
But we never will.
- - - - - - - - - - - -
Having been erased,
The document you're seeking
Must now be retyped.
- - - - - - - - - - - -
Rather than a beep
Or a rude error message,
These words: "File not found."
- - - - - - - - - - - -
Serious error.
All shortcuts have disappeared.
Screen. Mind. Both are blank.
- - - - - - - - - - - -
The Web site you seek
cannot be located but
endless more exist.
- - - - - - - - - - - -
Chaos reigns within.
Reflect, repent, and reboot.
Order shall return.
- - - - - - - - - - - -
ABORTED effort:
Close all that you have.
You ask way too much.
- - - - - - - - - - - -
First snow, then silence.
This thousand dollar screen dies
so beautifully.
- - - - - - - - - - - -
With searching comes loss
and the presence of absence:
"My Novel" not found.
- - - - - - - - - - - -
The Tao that is seen
Is not the true Tao, until
You bring fresh toner.
- - - - - - - - - - - -
Windows NT crashed.
I am the Blue Screen of Death.
No one hears your screams.
- - - - - - - - - - - -
A crash reduces
your expensive computer
to a simple stone.
- - - - - - - - - - - -
Error messages
cannot completely convey.
We now know shared loss.
-- Anonymous Author
-->
<hr>
Rendered at <tt>`r Sys.time()`</tt>
## Source code
File is at `r fnamestr`.
### R code used in this document
```{r ref.label=knitr::all_labels(), echo=TRUE, eval=FALSE}
```
### Complete Rmd code
```{r comment=''}
cat(readLines(fnamepath), sep = '\n')
```