-
Notifications
You must be signed in to change notification settings - Fork 0
/
basic_tidyverse.Rmd
2396 lines (1926 loc) · 74.2 KB
/
basic_tidyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "R Training"
output: learnr::tutorial
runtime: shiny_prerendered
description: Code format of the R Training
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(learnr)
library(tidyverse)
library(palmerpenguins)
library(lubridate)
```
## Visualisation
In R the ggplot2 package is used to create plots
ggplot(data = <dataframe>)
-Creates an empty plot
-Need to add a “geom” to plot something
ggplot template:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Replace the bracketed sections in the code to create the plot
Mapping defines how variables in the dataset are mapped to visual properties
### Create a plot
Using the "penguins" dataset, plot "bill_length_mm" vs "bill_depth_mm"
```{r ggplot_1, warning=FALSE}
ggplot(data = penguins) +
geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm))
```
Exercise:
For the "mpg" dataset, plot "displ" by "hwy"
```{r ggplot_2, exercise=TRUE}
ggplot(data = ) +
geom_point(mapping = aes(x = , y = ))
```
```{r ggplot_2-solution}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
Exercise:
For the "storms" dataset, plot "wind" vs "pressure"
```{r ggplot_3, exercise=TRUE}
```
```{r ggplot_3-solution}
ggplot(data = storms) +
geom_point(mapping = aes(x = wind, y = pressure))
```
### ggplot - aesthetics
Mappings can be used to add more information to the plots
#### Add colour to the penguins plot
Using the "penguin" plot above, colour by "species"
```{r ggplot_colour, warning=FALSE}
ggplot(data = penguins) +
geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm, colour = species))
```
#### Use size, shape, and alpha
Change the above to use different sizes for different "species"
```{r ggplot_size, warning=FALSE}
ggplot(data = penguins) +
geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm, size = species))
```
Change the above to use different shapes for different "species"
```{r ggplot_shape, warning=FALSE}
ggplot(data = penguins) +
geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm, shape = species))
```
Change the above to use different transparency for different "species"
```{r ggplot_alpha, warning=FALSE}
ggplot(data = penguins) +
geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm, alpha = species))
```
#### Manually set the colour and transparency
Can manually set colour etc. by taking the arguments outside of mapping
For the "penguins" plot above, colour all the points blue with a transparency of 0.8
```{r ggplot_manual_colour, warning=FALSE}
ggplot(data = penguins) +
geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm), colour = "blue", alpha = 0.8)
```
Exercise:
For the "mpg" plot above, add "class" as a colour
```{r ggplot_aes_1, exercise=TRUE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = ))
```
```{r ggplot_aes_1-solution}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = class))
```
Exercise:
For the "storms" plot above, add "status" as a colour and change shape to be a square
```{r ggplot_aes_2, exercise=TRUE}
ggplot(data = storms) +
geom_point(mapping = aes(x = wind, y = pressure))
```
```{r ggplot_aes_2-solution}
ggplot(data = storms) +
geom_point(mapping = aes(x = wind, y = pressure, colour = status), shape = "square")
```
### ggplot - other geoms
#### geom_line
Use "economics_long" to create a line chart of "value" for the different dates in "date", coloured by "variable"
```{r geom_line, warning=FALSE}
ggplot(data = economics_long) +
geom_line(mapping = aes(x = date, y = value, colour = variable))
```
#### geom_boxplot
Use "penguins" to create a box plot showing the spread of "flipper_length_mm" for each "species", coloured by "species"
```{r geom_boxplot, warning=FALSE}
ggplot(data = penguins) +
geom_boxplot(mapping = aes(x = species, y = flipper_length_mm, colour = species))
```
#### geom_bar
For the "mpg" dataset, create a bar chart for the number of entries for "class", coloured by "drv"
```{r geom_bar, warning=FALSE}
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = drv))
```
Take the same plot, but position the bars next to each other
```{r geom_bar_2, warning=FALSE}
ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = drv), position = "dodge")
```
Exercise:
Using "diamonds" create a boxplot of "carat" for each value of "cut"
```{r geom_boxplot_2, exercise=TRUE}
ggplot(data = ) +
geom_boxplot(mapping = aes(x = , y = ))
```
```{r geom_boxplot_2-solution}
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = carat, y = cut))
```
Exercise:
Using "diamonds" create a bar chart of "color", coloured by "cut", with the bars next to each other
```{r geom_bar_3, exercise=TRUE}
```
```{r geom_bar_3-solution}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = color, fill = cut), position = "dodge")
```
geom_col is a variation of geom_bar, where both the x and y arguments are specified in the mapping argument
### ggplot - facetting
#### facet_wrap
From "diamonds" plot "carat" against "price" and facet by "cut"
```{r facet_wrap, warning=FALSE}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))+
facet_wrap(~cut)
```
#### facet_grid
For "penguins" plot the "bill_length_mm" against "bill_depth_mm" and facet by island and species
```{r facet_grid, warning=FALSE}
ggplot(data = penguins) +
geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm))+
facet_grid(island~species)
```
Exercise:
Using "mpg" create a plot of "displ" vs "hwy" and facet by "class"
```{r facet_wrap_2, exercise=TRUE}
ggplot(data = ) +
geom_point(mapping = aes(x = , y = )) +
facet_wrap(~)
```
```{r facet_wrap_2-solution}
ggplot(data = mpg) +
geom_point(mapping = aes(x = disp, y = hwy)) +
facet_wrap(~class)
```
Exercise:
Using "mtcars" create a plot of "mpg" vs "wt" and facet by the "am" and "gear" variables
```{r facet_grid_2, exercise=TRUE}
```
```{r facet_grid_2-solution}
ggplot(data = mtcars) +
geom_point(mapping = aes(x=mpg, y = wt)) +
facet_grid(am~gear)
```
### Test your knowledge
#### Question 1
```{r ggplot_q1, echo=FALSE}
question("What is the output from running the code ggplot()?",
answer("A scatterplot of x vs y", message = "To add data to the plot need to add a geom (geom_point for a scatterplot)"),
answer("No output", message = "ggplot() is a valid function - creates an empty plot with no data until a geom is added"),
answer("An empty plot", correct = TRUE),
answer("An error", message = "ggplot() is a valid function - creates an empty plot with no data until a geom is added"))
```
#### Question 2
```{r ggplot_q2, echo=FALSE}
question("Which of these geoms will colour the points by class?",
answer("geom_point(mapping = aes(x = displ, y = hwy, colour = class))", correct = TRUE),
answer("point(mapping = aes(x = displ, y = hwy, colour = class))", message = "check the function - shouldn;t this be geom_point?"),
answer("geom_point(mapping = aes(x = displ, y = hwy), colour = class)", message = "check the position of the colour argument - if it's not inside the aesthetics function it will not be able to use parameters from the dataset"),
answer("geom_point(mapping = aes(x = displ, y = hwy, fill = class))", message = "geom_point uses colour or color, rather than fill"))
```
#### Question 3
```{r ggplot_q3, echo=FALSE}
question("Will this code change the shape based on species: geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm, shape = 'species'))?",
answer("Yes", message = "'species' in quotations will be read as a character, rather than identified as a parameter, which will result in one shape being used for the plot, labelled 'species' in the legend"),
answer("No", correct = TRUE, message = "'species' in quotations will be read as a character, rather than identified as a parameter"))
```
#### Question 4
```{r ggplot_q4, echo=FALSE}
question("Which of these is correct:",
answer("geom_bar(mapping = aes(x = species, colour = island)", message = "geom_bar uses 'fill' instead of 'colour'"),
answer("geom_bar(mapping = aes(x = species, y = average_bill_length, fill = island))", message = "geom_bar does not have a 'y' argument, use geom_col instead"),
answer("geom_bar(mapping = aes(x = species, fill = island)", correct = TRUE),
answer("geom_col(mapping = aes(x = species, y = average_bill_length, fill = island))", correct = TRUE),
answer("geom_col(mapping = aes(x = species, fill = island)", message = "geom_col requires a 'y' argument, use geom_bar instead"))
```
#### Question 5
Using the "penguins" dataset create a bar chart of "sex", coloured by "species" and facetted by "island", with bars next to each other
```{r ggplot_q5, exercise = TRUE, warning = FALSE}
```
```{r ggplot_q5-solution}
ggplot(data = penguins) +
geom_bar(mapping = aes(x = sex, fill = species), position = "dodge")+
facet_wrap(~island)
```
#### Question 6
For the "DNase" dataset, plot "conc" vs "density" using a scatterplot, change the colour to blue and the shape to a triangle, and facet by "Run"
```{r ggplot_q6, exercise = TRUE}
ggplot(data = DNase)+
geom_point(mapping = aes(x = conc, y = density), colour = "blue", shape = "triangle")+
facet_wrap(~Run)
```
```{r ggplot_q6-solution}
ggplot(data = DNase)+
geom_point(mapping = aes(x = conc, y = density), colour = "blue", shape = "triangle")+
facet_wrap(~Run)
```
#### Question 7
Using the "BOD" dataset, plot "demand" for each value of "Time" on a bar chart and colour all bars lightpink
```{r ggplot_q7, exercise = TRUE}
```
```{r ggplot_q7-solution}
ggplot(data = BOD)+
geom_col(mapping = aes(x = Time, y = demand), fill = "lightpink")
```
#### Question 8
Using the "InsectSprays", use a box plot to plot "count" for each "spray", and colour by "spray"
```{r ggplot_q8, exercise = TRUE}
```
```{r ggplot_q8-solution}
ggplot(data = InsectSprays) +
geom_boxplot(aes(x = spray, y = count, colour = spray))
```
#### Question 9
Using the "ChickWeight" dataset, plot "Time" vs "weight" using a line plot, colour by "Chick", and facet by "Diet"
```{r ggplot_q9, exercise = TRUE}
```
```{r ggplot_q9-solution}
ggplot(data = ChickWeight) +
geom_line(mapping = aes(x = Time, y = weight, colour = Chick))+
facet_wrap(~Diet)
```
#### Question 10
Correct the code:
```{r ggplot_q10, exercise = TRUE}
gplot(data = ToothGrowth)
geompoint(aes = (x = does, y = len), colour = supp)
```
```{r ggplot_q10-solution}
ggplot(data = ToothGrowth) + #spelling mistake, "+" missing
geom_point(mapping = aes(x = dose, y = len, colour = supp)) # spelling mistakes, remove "=" after aes, colour argument is outside of aes
```
## Coding Basics
R can be used as a calculator
```{r calculator, warning = FALSE}
5 * 6
```
Values or objects can be stored by assigning them to a name. Run the object name to call the output of the object
```{r storing_objects, warning = FALSE}
variable <- 5
object <- 2 * 5
variable
object
```
-Names can include capital letters, numbers, underscores, and full stops. Do not use special characters or spaces.
-Snake case is recommended: i_am_snake_case (all lower case with underscores separating words)
-Make object names descriptive
To see what a dataframe looks like:
```{r view_dataframe, warning = FALSE}
str(storms)
storms
head(storms, 3)
glimpse(storms)
view(storms)
```
Functions come in the form of function_name(arg1 = val1, arg2 = val2, ...)
In R, brackets always come in pairs, as well as quotation marks
See next section to practice using functions
## Data Transformation
### Filter
Filter allows you to subset observations based on their values
The first argument is the name of the data frame, the second and subsequent arguments are the expressions that filter the data frame
Filter "storms" for "name" is equal to "Amy"
```{r filter, warning = FALSE}
filter(storms, name == "Amy")
```
Note the use of “==“ which is used to test for equality
To save the result for use later on, you need to assign the code to a variable name:
Assign the filtered dataset to a variable, then call the variable
```{r filter_name, warning = FALSE}
filtered_dataset <- filter(storms, name == "Amy")
filtered_dataset
```
The function is able to filter for multiple conditions:
Filter "storms" for "name" is equal to "Amy" and "status" is equal to "tropical depression"
```{r filter_multiple, warning = FALSE}
filter(storms, name == "Amy", status == "tropical depression")
```
Exercise:
Filter "penguins" for "species" is equal to "Adelie" and "sex" is equal to "female". Save to a variable called "female_adelies" and call the variable to see the output
```{r filter_2, exercise = TRUE}
<- filter(penguins, species == , sex == )
female_adelies
```
```{r filter_2-solution, warning = FALSE}
female_adelies <- filter(penguins, species == "Adelie", sex == "female")
female_adelies
```
Exercise:
Filter "diamonds" for "cut" is equal to "Premium" and "color" is equal to "I". Save to a variable called "premium_diamonds" and call the variable to see the output
```{r filter_3, exercise = TRUE}
```
```{r filter_3-solution, warning = FALSE}
premium_diamonds <- filter(diamonds, cut == "Premium", color == "I")
premium_diamonds
```
#### Filter - Logical Operators
Multiple arguments can be combined used “and”, “or”, and “not”
“and” = “&”, “or” = “|”, “not” = “!’
Filtering "storms" for "name" is "Amy" AND "wind" is 30:
```{r filter_and, warning = FALSE}
filter(storms, name == "Amy" & wind == 30)
```
Filtering "storms" for "name" is "Amy" OR "name" is "Caroline":
```{r filter_or, warning = FALSE}
filter(storms, name == "Amy" | name == "Caroline")
```
This could also be written using the %in% operator:
```{r filter_in, warning = FALSE}
filter(storms, name %in% c("Amy", "Caroline"))
```
Filter "storms" for "name" is not "Amy" or "Caroline"
```{r filter_or_2, warning = FALSE}
filter(storms, name != "Amy" | name != "Caroline")
```
This could also be written using the %in% operator:
```{r filter_in_2, warning = FALSE}
filter(storms, !name %in% c("Amy", "Caroline"))
```
Can use comparison operators as well
Filter "storms" for "wind" greater than or equal to 30
```{r filter_greater, warning = FALSE}
filter(storms, wind >= 30)
```
Can filter for comparison between columns in the dataset
Filter "storms" for "wind" less than "pressure"
```{r filter_compare, warning = FALSE}
filter(storms, wind < pressure)
```
To filter for NA's, use is.na()
In the "starwars dataset, filter for "hair_color" is NA
```{r filter_na, warning = FALSE}
filter(starwars, is.na(hair_color))
```
Exercise:
Filter "diamonds" for "color" equal to "E" or "H" and "x" less than or equal to "y" and "x" greater than "z"
```{r filter_4, exercise = TRUE}
filter(diamonds, %in% & <= & > )
```
```{r filter_4-solution}
filter(diamonds, color %in% c("E", "H") & x <= y & x > z)
```
Exercise:
Filter "penguins" for "body_mass_g" less than 3500 or "body_mass_g" greater than 4000
```{r filter_5, exercise = TRUE}
```
```{r filter_5-solution}
filter(penguins, body_mass_g < 3500 | body_mass_g > 4000)
```
Exercise:
Filter "penguins" for "body_mass_g" less than 3500 or "body_mass_g" greater than 4000 and "island" does not equal "Torgersen" and save the result to an object called "filtered_penguins". Call the object to see the output.
```{r filter_6, exercise = TRUE}
```
```{r filter_6-solution}
filtered_penguins <- filter(penguins, (body_mass_g < 3500 | body_mass_g > 4000) & island != "Torgersen")
filtered_penguins
```
### Arrange
Arrange acts similarly to filter, except it orders the rows
It takes a dataframe, and a set of column names to order by
If more than one column name is provided, it orders by the first column, then by the second, etc.
Arrange storms by month, then day, then hour
```{r arrange, warning = FALSE}
arrange(storms, month, day, hour)
```
Arrange storms by descending month, descending day, then descending hour
```{r arrange_desc, warning = FALSE}
arrange(storms, desc(month, day, hour))
```
Arrange storms by descending month, descending day, then descending hour using "-"
```{r arrange_desc_2, warning = FALSE}
arrange(storms, -month, -day, -hour)
```
Exercise:
Arrange "diamonds" by "cut", "color", then descending "carat"
```{r arrange_1, exercise=TRUE}
arrange(diamonds, , , desc())
```
```{r arrange_1-solution}
arrange(diamonds, cut, color, desc(carat))
```
Exercise:
Arrange "penguins" by "year", "sex", then descending "body_mass_g"
```{r arrange_2, exercise=TRUE}
```
```{r arrange_2-solution}
arrange(penguins, year, sex, -body_mass_g)
```
### Select
Allows narrowing in on the variables of interest
Retains or drops the specified columns
Can be used to rearrange the order of the columns
Keeping only the specified columns:
For "storms" select the "name", "year", "status", and "pressure" columns
```{r select, warning = FALSE}
select(storms, name, year, status, pressure)
```
Keeping columns except those specified:
For "storms" drop the "month", "day", "hour", and "lat", and "long" columns
```{r select_2, warning = FALSE}
select(storms, -month, -day, -hour, -lat, -long)
```
Rearranging the order of the columns:
For "storms" put the "status" column after the "name" column
```{r select_3, warning = FALSE}
select(storms, name, status, everything())
```
Exercise:
For "diamonds" select the "cut", "color", and "carat" columns
```{r select_4, exercise = TRUE}
select(diamonds, , , )
```
```{r select_4-solution}
select(diamonds, cut, color, carat)
```
Exercise:
For "penguins", remove the "sex" and "year" columns
```{r select_5-solution}
```
```{r select_5, exercise=TRUE}
select(penguins, -sex, -year)
```
There are helper functions that can be used with select
starts_with():
For "diamonds, select all columns that start with "c"
```{r select_starts, warning = FALSE}
select(diamonds, starts_with("c"))
```
ends_with():
For "penguins" select the species column and any column ending with "mm"
```{r select_ends, warning = FALSE}
select(penguins, species, ends_with("mm"))
```
contains():
For penguins, select all columns that contains "length"
```{r select_contains, warning = FALSE}
select(penguins, contains("length"))
```
Can also select a range of columns:
For "storms" select all columns from "name" to "hour"
```{r select_range, warning = FALSE}
select(storms, name:hour)
```
Distinct can be used to select columns and return unique values in those columns:
For "storms", select the unique options in the name and status columns
```{r distinct, warning = FALSE}
distinct(storms, name, status)
```
Exercise:
For penguins select species, any columns that end with mm, and all columns from sex to year
```{r select_6, exercise = TRUE}
select(penguins, , , )
```
```{r select_6-solution}
select(penguins, species, ends_with("mm"), sex:year)
```
Exercise:
For "penguins", select the "species" column and any columns that contain "length" or "depth"
```{r select_contains_2, exercise = TRUE}
```
```{r select_contains_2-solution}
select(penguins, contains("length") | contains ("depth"))
```
Exercise:
For "penguins", remove any columns that end with "mm"
```{r select_ends_2, exercise = TRUE}
```
```{r select_ends_2-solution}
select(penguins, -ends_with("mm"))
```
### Mutate
Mutate adds a new column, often as a function of another column
The new column is added at the end of the dataset
For "storms" create a new column called "modified_pressure" which divides "pressure" by 1000
```{r mutate, warning = FALSE}
mutate(storms, modified_pressure = pressure / 1000)
```
It can also be used to manipulate and existing column:
For "storms" modify the column called "pressure" to divide "pressure" by 1000
```{r mutate_2, warning = FALSE}
mutate(storms, pressure = pressure / 1000)
```
You can create multiple columns in the same mutate function, and also refer to newly created columns:
For "diamonds" create a new column called "volume" which multiples "x", "y", and "z", then use this column to create a new column called "cubic_volume"
```{r mutate_3, warning = FALSE}
mutate(diamonds,
volume = x * y * z,
cubic_volume = volume / 1000)
```
If you only want to keep the new columns, use transmute():
For "diamonds" create a new column called "price_per_carat" from dividing "price" by "carat" and create a new column called "volume" which multiples "x", "y", and "z". Return the new columns only.
```{r transmute, warning = FALSE}
transmute(diamonds,
price_per_carat = price / carat,
volume = x * y * z)
```
Exercise:
For "penguins", create a new column called "bill_area_mm" by multiplying "bill_length_mm" and "bill_depth_mm", then create another column called bill_area_cm by dividing the new column by 100
```{r mutate_4, exercise = TRUE}
mutate(penguins,
= * ,
= / 100)
```
```{r mutate_4-solution}
mutate(penguins,
bill_area_mm = bill_length_mm * bill_depth_mm,
bill_area_cm = bill_area_mm / 100)
```
Exercise:
For "starwars" modify the "height" column to be "height" / 100 and modify the "mass" column to be "mass" * 2.2
```{r mutate_5, exercise = TRUE}
```
```{r mutate_5-solution}
mutate(starwars,
height = height / 100,
mass = mass * 2.2)
```
### Pipes
Pipes allow multiple operations to be combined or stringed together
Makes code more readable
Focuses on the transformations, rather than what is being transformed
Removes the need for intermediate steps
Takes the resulting dataframe and uses this as the first argument to the next function
With piping:
```{r pipes, warning = FALSE}
new_penguins_dataset <- penguins %>%
select(species, ends_with("mm")) %>%
filter(!is.na(bill_length_mm)) %>%
mutate(bill_area_mm = bill_length_mm * bill_depth_mm) %>%
distinct(species, bill_area_mm)
new_penguins_dataset
```
Without piping:
```{r pipes_2, warning = FALSE}
selected_penguins <- select(penguins, species, ends_with("mm"))
filtered_penguins <- filter(selected_penguins, !is.na(bill_length_mm))
penguins_with_area <- mutate(filtered_penguins, bill_area_mm = bill_length_mm * bill_depth_mm)
new_penguins_dataset <- distinct(penguins_with_area, species, bill_area_mm)
new_penguins_dataset
```
Exercise:
Filter "diamonds" for "cut" is either "Premium" or "Good", select the "cut", "color", "price" and "carat" columns, create a new column called "price_per_carat" whcih divides "price" by "carat", then filter for the new column with values over 1500
```{r pipes_3, exercise = TRUE}
diamonds %>%
filter(cut %in% ) %>%
select( , , , ) %>%
mutate(price_per_carat = / ) %>%
filter( > 1500)
```
```{r pipes_3-solution}
diamonds %>%
filter(cut %in% c("Premium", "Good")) %>%
select(cut, color, price, carat) %>%
mutate(price_per_carat = price / carat) %>%
filter(price_per_carat > 1500)
```
Exercise:
Filter "storms for "ts_diameter" is not NA, select all columns except "year", "month", "day", "hour", "lat", and "long", change the "pressure" column to be divided by 1000, change the "ts_diameter" amd "hu_diameter" columns to be divided by 10, and then filter for "status" is equal to "tropical storm".
```{r pipes_4, exercise = TRUE}
```
```{r pipes_4-solution}
storms %>%
filter(!is.na(ts_diameter)) %>%
select(-year, -month, -day, -hour, -lat, -long) %>%
mutate(pressure = pressure / 1000,
ts_diameter = ts_diameter / 10,
hu_diameter = hu_diameter / 10) %>%
filter(status == "tropical storm")
```
### Group and Summarise
We can use a function called group_by() to specific groups within the dataframe. This then allows calculations to be performed by group, rather than by row or over the entire dataframe.
To demonstrate the use of group_by, we can also combine with the summarise() function to calculate summaries for each group.
Summarise will collapse the dataframe down to a single row per group, and will only retain the group columns and the summarised columns.
For "storms", group by "name" and "status", then create a summarised table with new columns "avg_wind" and "avg_pressure"
```{r group, warning = FALSE}
storms %>%
group_by(name, status) %>%
summarise(avg_wind = mean(wind, na.rm = T),
avg_pressure = mean(pressure, na.rm = T))
```
Note: na.rm is to account for missing values (NA's). By setting this to TRUE the summary will ignore the NA's. If set to false (the defualt) where one NA is present the summary value will be NA.
If mutate() is used instead of summarise(), all columns and rows are retained:
For "storms", create two new columns ("avg_wind" and "avg_pressure") which have the average wind and pressure for each "name" and "status" group
```{r group_2, warning = FALSE}
storms %>%
group_by(name, status) %>%
mutate(avg_wind = mean(wind, na.rm = T),
avg_pressure = mean(pressure, na.rm = T))
```
Exercise:
For "diamonds", group by "cut" and "color" and find the average "price" and "carat" for each group
```{r group_3, exercise = TRUE}
diamonds %>%
group_by(, ) %>%
summarise(avg_price = ,
avg_carat = )
```
```{r group_3-solution}
diamonds %>%
group_by(cut, color) %>%
summarise(avg_price = mean(price, na.rm = T),
avg_carat = mean(carat, na.rm = T))
```
Exercise:
For "penguins" calculate the average bill length (from "bill_length_mm"), average bill depth (from "bill_depth_mm"), and average flipper length (from "flipper_length_mm") for each combination of species and island
```{r group_4, exercise = TRUE}
```
```{r group_4-solution}
penguins %>%
group_by(species, island) %>%
summarise(avg_bill_length = mean(bill_length_mm, na.rm = T),
avg_bill_depth = mean(bill_depth_mm, na.rm = T),
avg_flipper_length = mean(flipper_length_mm, na.rm = T))
```
### Summarising
There are other ways to create summaries.
Use the summarise_all() function to find the average "wind" and "pressure" for each group of "name" and "status" for "storms"
```{r summarise, warning = FALSE, echo = FALSE}
storms %>%
select(name, status, wind, pressure) %>%
group_by(name, status) %>%
summarise_all(mean, na.rm = T)
```
Use the summarise_at() function to find the average "wind" and "pressure" for each group of "name" and "status" for "storms"
```{r summarise_2, warning = FALSE, echo = FALSE}
storms %>%
group_by(name, status) %>%
summarise_at(c("wind", "pressure"),
mean, na.rm = T)
```
Exercise:
For "diamonds", use summarise_all to find the average "price" and "carat" for each group of "cut" and "color"
```{r summarise_3, exercise = TRUE}
diamonds %>%
select(, , , ) %>%
group_by(, ) %>%
summarise_all()
```
```{r summarise_3-solution}
diamonds %>%
select(cut, color, price, carat) %>%
group_by(cut, color) %>%
summarise_all(mean, na.rm = T)
```
Exercise:
For "penguins" calculate the average bill length (from "bill_length_mm"), average bill depth (from "bill_depth_mm"), and average flipper length (from "flipper_length_mm") for each combination of species and island using the summarise_at function
```{r summarise_4, exercise = TRUE}
```
```{r summarise_4-solution}
penguins %>%
group_by(species, island) %>%
summarise_at(c("bill_length_mm",
"bill_depth_mm",
"flipper_length_mm"),
mean, na.rm = T)
```
To ungroup a dataframe use ungroup()
### Useful Summary Functions
mean, median, min, max, sum, sd (standard deviation), IQR (interquartile range), quantile, first, nth, and last are some of the summary functions that can be used in R.
For "storms", group by "name" and "status" and calculate a summary using the functions above for the "wind" column
```{r summary, warning = FALSE, echo = FALSE}
storms %>%
group_by(name, status) %>%
summarise(avg_wind = mean(wind, na.rm=T),
med_wind = median(wind, na.rm = T),
min_wind = min(wind, na.rm = T),
max_wind = max(wind, na.rm = T),
sum_wind = sum(wind, na.rm = T),
sd_wind = sd(wind, na.rm = T),
iqr_wind = IQR(wind, na.rm = T),
q25_wind = quantile(wind, 0.25, na.rm = T),
first_wind = first(wind),
fifth_wind = nth(wind, 5),
last_wind = last(wind))
```
Another useful function is n() which returns a count
For "storms" find the count for each group of "name" and "species"
```{r count, warning = FALSE, echo = FALSE}
storms %>%
group_by(name, status) %>%
summarise(count = n())
```
Exercise:
For "diamonds", for each group of "cut" and "color", find the min and max price, the 25th and 75th quantiles of carat, and a count for each group
```{r summary_1, exercise = TRUE}
diamonds %>%
group_by(cut, color) %>%
summarise(min_price = ,
max_price = ,
q25_carat = ,
q75_carat = ,
count = )
```
```{r summary_1-solution}
diamonds %>%
group_by(cut, color) %>%
summarise(min_price = min(price, na.rm = T),
max_price = max(price, na.rm = T),
q25_carat = quantile(carat, 0.25, na.rm = T),
q75_carat = quantile(carat, 0.75, na.rm = T),
count = n())
```
Exercise:
For "penguins", for each group of "species" and "island", find the mean and standard deviation for "body_mass_g, the first and last entries for "year", and a count for each group
```{r summary_2, exercise = TRUE}
```
```{r summary_2-solution}
penguins %>%
group_by(species, island) %>%
summarise(mean_body_mass = min(body_mass_g, na.rm = T),
sd_body_mass = max(body_mass_g, na.rm = T),
first_year = first(year),
last_year = last(year),
count = n())
```
Logical values can also be used with summary functions.
sum(x) gives the number of TRUE’s in x and mean(x) gives the proportion
For "storms", group by "name" and find the number and proportion of values in "status" which are equal to "tropical depression"
```{r summary_3, warning = FALSE, echo = FALSE}
storms %>%
group_by(name) %>%
summarise(n_status = sum(status == "tropical depression", na.rm = T),
tropical_depression_prop = mean(status == "tropical depression", na.rm = T))
```
Exercise:
For "diamonds", group by "name" and find the number and proportion of "E"'s in "color"
```{r summary_4, exercise = TRUE}
diamonds %>%
group_by() %>%
summarise(e_number = ,
e_prop = )
```
```{r summary_4-solution}
diamonds %>%
group_by(cut) %>%
summarise(e_number = sum(color == "E", na.rm = T),
e_prop = mean(color == "E", na.rm = T))
```
Exercise:
For "penguins", for each group of "species" and "island" find the number and proportion of males
```{r summary_5, exercise = TRUE}
```
```{r summary_5-solution}
penguins %>%
group_by(species, island) %>%
summarise(number_males = sum(sex == "male", na.rm = T),
prop_males = mean(sex == "male", na.rm = T))
```