-
Notifications
You must be signed in to change notification settings - Fork 0
/
code4stem.html
5531 lines (5501 loc) · 543 KB
/
code4stem.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="" xml:lang="">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<title>Codes for STEM</title>
<meta name="description" content="This is a collection of codes for analytics projects from import, wrangling, analyzing, visualizing, to reporting" />
<meta name="generator" content="bookdown #bookdown:version# and GitBook 2.6.7" />
<meta property="og:title" content="Codes for STEM" />
<meta property="og:type" content="book" />
<meta property="og:description" content="This is a collection of codes for analytics projects from import, wrangling, analyzing, visualizing, to reporting" />
<meta name="twitter:card" content="summary" />
<meta name="twitter:title" content="Codes for STEM" />
<meta name="twitter:description" content="This is a collection of codes for analytics projects from import, wrangling, analyzing, visualizing, to reporting" />
<meta name="author" content="Noushin Nabavi" />
<meta name="date" content="2020-09-27" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black" />
<!--bookdown:link_prev-->
<!--bookdown:link_next-->
<script src="libs/jquery-2.2.3/jquery.min.js"></script>
<link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-bookdown.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />
<script src="libs/gitbook-2.6.7/js/app.min.js"></script>
<script src="libs/gitbook-2.6.7/js/lunr.js"></script>
<script src="libs/gitbook-2.6.7/js/clipboard.min.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
<script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-clipboard.js"></script>
<style type="text/css">
a.sourceLine { display: inline-block; line-height: 1.25; }
a.sourceLine { pointer-events: none; color: inherit; text-decoration: inherit; }
a.sourceLine:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode { white-space: pre; position: relative; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
code.sourceCode { white-space: pre-wrap; }
a.sourceLine { text-indent: -1em; padding-left: 1em; }
}
pre.numberSource a.sourceLine
{ position: relative; left: -4em; }
pre.numberSource a.sourceLine::before
{ content: attr(data-line-number);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; pointer-events: all; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
div.sourceCode
{ }
@media screen {
a.sourceLine::before { text-decoration: underline; }
}
code span.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code span.at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code span.do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
code span.kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
</style>
<link rel="stylesheet" href="style.css" type="text/css" />
</head>
<body>
<!--bookdown:title:start-->
<div id="header">
<h1 class="title">Codes for STEM</h1>
<p class="author"><em>Noushin Nabavi</em></p>
<p class="date"><em>2020-09-27</em></p>
</div>
<!--bookdown:title:end-->
<!--bookdown:toc:start-->
<div class="book without-animation with-summary font-size-2 font-family-1" data-basepath=".">
<div class="book-summary">
<nav role="navigation">
<!--bookdown:toc2:start-->
<ul>
<li><a href="#coding-for-stem"><span class="toc-section-number">1</span> Coding for STEM</a></li>
<li><a href="#introduction"><span class="toc-section-number">2</span> Introduction</a></li>
<li><a href="#importing-data-into-r"><span class="toc-section-number">3</span> Importing data into R</a></li>
<li><a href="#useful-r-functions-examples"><span class="toc-section-number">4</span> Useful R Functions + Examples</a><ul>
<li><a href="#contents"><span class="toc-section-number">4.1</span> Contents</a></li>
<li><a href="#r-syntax"><span class="toc-section-number">4.2</span> R Syntax</a></li>
<li><a href="#functional-examples"><span class="toc-section-number">4.3</span> Functional examples</a></li>
</ul></li>
<li><a href="#demo-for-dplyr"><span class="toc-section-number">5</span> Demo for dplyr</a></li>
<li><a href="#working-with-dates-in-r"><span class="toc-section-number">6</span> Working with dates in R</a></li>
<li><a href="#demo-for-data.table"><span class="toc-section-number">7</span> Demo for Data.table</a></li>
<li><a href="#tests-for-experiments"><span class="toc-section-number">8</span> Tests for experiments</a></li>
<li><a href="#demo-for-ab-testing"><span class="toc-section-number">9</span> Demo for A/B testing</a></li>
<li><a href="#working-with-time-series-in-r"><span class="toc-section-number">10</span> Working with time-series in R</a></li>
<li><a href="#feature-engineering-in-r"><span class="toc-section-number">11</span> Feature engineering in R</a></li>
<li><a href="#impute-missingness"><span class="toc-section-number">12</span> Impute missingness</a></li>
<li><a href="#r-for-reporting"><span class="toc-section-number">13</span> R for Reporting</a><ul>
<li><a href="#usage-demonstrations"><span class="toc-section-number">13.1</span> Usage demonstrations</a><ul>
<li><a href="#inline-code"><span class="toc-section-number">13.1.1</span> Inline code</a></li>
<li><a href="#code-chunks"><span class="toc-section-number">13.1.2</span> Code chunks</a></li>
</ul></li>
<li><a href="#resources"><span class="toc-section-number">13.2</span> Resources</a></li>
</ul></li>
<li><a href="#resources">Resources</a><ul>
<li><a href="#beginner-resources-by-topic"><span class="toc-section-number">13.3</span> Beginner Resources by Topic</a><ul>
<li><a href="#getting-set-up-with-r-rstudio"><span class="toc-section-number">13.3.1</span> Getting Set-Up with R & RStudio</a></li>
<li><a href="#specialist-topics"><span class="toc-section-number">13.3.2</span> Specialist Topics</a></li>
<li><a href="#operational-basics"><span class="toc-section-number">13.3.3</span> Operational Basics</a></li>
</ul></li>
</ul></li>
</ul>
<!--bookdown:toc2:end-->
</nav>
</div>
<div class="book-body">
<div class="body-inner">
<div class="book-header" role="navigation">
<h1>
<i class="fa fa-circle-o-notch fa-spin"></i><a href="./">Codes for STEM</a>
</h1>
</div>
<div class="page-wrapper" tabindex="-1" role="main">
<div class="page-inner">
<section class="normal" id="section-">
<!--bookdown:toc:end-->
<!--bookdown:body:start-->
<div id="coding-for-stem" class="section level1">
<h1><span class="header-section-number">1</span> Coding for STEM</h1>
<blockquote>
<p>Tools and capabilities of data science is changing everyday!</p>
</blockquote>
<p>This is how I understand it today:</p>
<p><strong>Data can:</strong>
* Describe the current state of an organization or process<br />
* Detec anomalous events<br />
* Diagnose the causes of events and behaviors<br />
* Predict future events</p>
<p><strong>Data Science workflows can be developed for: </strong><br />
* Data collection and management<br />
* Exploration and visualization<br />
* Experimentation and prediction</p>
<p><strong>Applications of data science can include: </strong><br />
* Traditional machine learning: e.g. finding probabilities of events, labeled data, and algorithms<br />
* Deep learning: neurons work together for image and natural language recognition but requires more training data<br />
* Internet of things (IOT): e.g. smart watch algorithms to detect and analyze motion sensors</p>
<p><strong>Data science teams can consist of:</strong>
* Data engineers: SQL, Java, Scala, Python<br />
* Data analysts: Dashboards, hypothesis tests and visualization using spreadsheets, SQL, BI (Tableau, power BI, looker)<br />
* Machine learning scientists: predictions and extrapolations, classification, etc. and use R or python * Data employees can be isolated, embedded, or hybrid</p>
<p>Data use can come with risks of identification of personal information. Policies for personally identifiable information may need to consider:<br />
* sensitivity and caution<br />
* pseudonymization and anonymization</p>
<p>Preferences can be stated or revealed through the data so questions need to be specific, avoid loaded language, calibrate, require actionable results.</p>
<p><strong>Data storage and retrieval may include: </strong>
* parallel storage solutions (e.g. cluster or server)<br />
* cloud storage (google, amazon, azure)<br />
* types of data: 1) unstructured (email, text, video, audio, web, and social media = document database); 2) structured = relational databases<br />
* Data querying: NoSQL and SQL</p>
<p><strong>Communication of data can include: </strong><br />
* Dashboards<br />
* Markdowns<br />
* BI tools<br />
* rshiny or d3.js</p>
<p><strong>Team management around data can use: </strong>
* Trello, slack, rocket chat, or JIRA to communicate due data and priority</p>
<p><strong>A/B Testing: </strong>
* Control and Variation in samples<br />
* 4 steps in A/B testing: pick metric to track, calculate sample size, run the experiment, and check significance</p>
<p>Machine learning (ML) can be used for time series forecasting (investigate seasonality on any time scale), natural language processing (word count, word embeddings to create features that group similar words), neural networks, deep learning, and AI.<br />
<strong>Learning can be classified into: </strong>
<em>Supervised</em>: labels and features/ Model evaluation on test and train data with applications in:
* recommendation systems<br />
* subscription predictions<br />
* email subject optimization<br />
<em>Unsupervised</em>: unlabeled data with only features<br />
* clustering</p>
<p><strong>Deep learning and AI requirements: </strong>
* prediction is more feasible than explanations<br />
* lots of very large amount of training data</p>
<!--chapter:end:index.Rmd-->
</div>
<div id="introduction" class="section level1">
<h1><span class="header-section-number">2</span> Introduction</h1>
<p>This is an introduction</p>
<!--chapter:end:01-intro.Rmd-->
</div>
<div id="importing-data-into-r" class="section level1">
<h1><span class="header-section-number">3</span> Importing data into R</h1>
<p>working with excel, csv, txt, and tsv files in R</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" data-line-number="1"><span class="kw">library</span>(readr) </a>
<a class="sourceLine" id="cb1-2" data-line-number="2"><span class="kw">library</span>(data.table)</a>
<a class="sourceLine" id="cb1-3" data-line-number="3"><span class="kw">library</span>(readxl)</a>
<a class="sourceLine" id="cb1-4" data-line-number="4"><span class="kw">library</span>(gdata)</a>
<a class="sourceLine" id="cb1-5" data-line-number="5"><span class="kw">library</span>(httr)</a>
<a class="sourceLine" id="cb1-6" data-line-number="6"><span class="kw">library</span>(rvest)</a>
<a class="sourceLine" id="cb1-7" data-line-number="7"><span class="kw">library</span>(xml2)</a>
<a class="sourceLine" id="cb1-8" data-line-number="8"><span class="kw">library</span>(rlist)</a>
<a class="sourceLine" id="cb1-9" data-line-number="9"><span class="kw">library</span>(jsonlite)</a>
<a class="sourceLine" id="cb1-10" data-line-number="10"><span class="kw">library</span>(dplyr)</a></code></pre></div>
<p>Importing csv file:
pools <- read.csv(“swimming_pools.csv”, stringsAsFactors = FALSE)</p>
<p>With stringsAsFactors, you can tell R whether it should convert strings in the flat file to factors.</p>
<p>Import txt file with read.delim:
hotdogs <- read.delim(“hotdogs.txt”, header = FALSE)</p>
<p>Import txt file with read.table:
hotdogs <- read.table(path,
sep = “,
col.names = c(”type“,”calories“,”sodium"))</p>
<p>Import with readr functions:
- read_csv, read_tsv, and read_delim are part of this package</p>
<p>Can specify column names before import:
properties <- c(“area”, “temp”, “size”, “storage”, “method”,
“texture”, “flavor”, “moistness”)</p>
<p>Import potatoes.txt:
potatoes <- read_tsv(“potatoes.txt”, col_names = properties)</p>
<p>Import potatoes.txt using read_delim():
potatoes <- read_delim(“potatoes.txt”, delim = ", col_names = properties)</p>
<p>Import 5 observations from potatoes.txt:
potatoes_fragment <- read_tsv(“potatoes.txt”, skip = 6, n_max = 5, col_names = properties)</p>
<p>Import all data, but force all columns to be character: potatoes_char
potatoes_char <- read_tsv(“potatoes.txt”, col_types = “cccccccc”, col_names = properties)</p>
<p>Import without col_types
hotdogs <- read_tsv(“hotdogs.txt”, col_names = c(“type”, “calories”, “sodium”))</p>
<p>The collectors you will need to import the data
fac <- col_factor(levels = c(“Beef”, “Meat”, “Poultry”))
int <- col_integer()</p>
<p>Edit the col_types argument to import the data correctly:
hotdogs_factor <- read_tsv(“hotdogs.txt”,
col_names = c(“type”, “calories”, “sodium”),
col_types = list(fac, int, int))</p>
<p>Import potatoes.csv with fread() from data.table:
potatoes <- fread(“potatoes.csv”)</p>
<p>Import columns 6 and 8 of potatoes.csv:
potatoes <- fread(“potatoes.csv”, select = c(6, 8))</p>
<p>Plot texture (x) and moistness (y) of potatoes:
plot(potatoes<span class="math inline">\(texture, potatoes\)</span>moistness)</p>
<p>Print the names of all worksheets using readxl library:
excel_sheets(“urbanpop.xlsx”)</p>
<p>Read the sheets, one by one
pop_1 <- read_excel(“urbanpop.xlsx”, sheet = 1)
pop_2 <- read_excel(“urbanpop.xlsx”, sheet = 2)
pop_3 <- read_excel(“urbanpop.xlsx”, sheet = 3)</p>
<p>Put pop_1, pop_2 and pop_3 in a list:
pop_list <- list(pop_1, pop_2, pop_3)</p>
<p>Read all Excel sheets with lapply():
pop_list <- lapply(excel_sheets(“urbanpop.xlsx”), read_excel, path = “urbanpop.xlsx”)</p>
<p>Import the first Excel sheet of urbanpop_nonames.xlsx (R gives names):
pop_a <- read_excel(“urbanpop_nonames.xlsx”, col_names = FALSE)</p>
<p>Import the first Excel sheet of urbanpop_nonames.xlsx (specify col_names):
cols <- c(“country”, paste0(“year_”, 1960:1966))
pop_b <- read_excel(“urbanpop_nonames.xlsx”, col_names = cols)</p>
<p>Import the second sheet of urbanpop.xlsx, skipping the first 21 rows:
urbanpop_sel <- read_excel(“urbanpop.xlsx”, sheet = 2, col_names = FALSE, skip = 21)</p>
<p>Print out the first observation from urbanpop_sel
urbanpop_sel[1,]</p>
<p>Import a local file
Similar to the readxl package, you can import single Excel sheets from Excel sheets to start your analysis in R.</p>
<p>Import the second sheet of urbanpop.xls:
urban_pop <- read.xls(“urbanpop.xls”, sheet = “1967-1974”)</p>
<p>Print the first 11 observations using head()
head(urban_pop, n = 11)</p>
<p>Column names for urban_pop
columns <- c(“country”, paste0(“year_”, 1967:1974))</p>
<p>Finish the read.xls call
urban_pop <- read.xls(“urbanpop.xls”, sheet = 2,
skip = 50, header = FALSE, stringsAsFactors = FALSE,
col.names = columns)</p>
<p>Import all sheets from urbanpop.xls
path <- “urbanpop.xls”
urban_sheet1 <- read.xls(path, sheet = 1, stringsAsFactors = FALSE)
urban_sheet2 <- read.xls(path, sheet = 2, stringsAsFactors = FALSE)
urban_sheet3 <- read.xls(path, sheet = 3, stringsAsFactors = FALSE)</p>
<p>Extend the cbind() call to include urban_sheet3: urban_all
urban <- cbind(urban_sheet1, urban_sheet2[-1], urban_sheet3[-1])</p>
<p>Remove all rows with NAs from urban: urban_clean
urban_clean <- na.omit(urban)</p>
<p>Print out a summary of urban_clean
summary(urban_clean)</p>
<p>When working with XLConnect, the first step will be to load a workbook in your R session with loadWorkbook(); this function will build a “bridge” between your Excel file and your R session:
Here using the XLConnect package</p>
<p>Build connection to urbanpop.xlsx:
my_book <- loadWorkbook(“urbanpop.xlsx”)</p>
<p>List the sheets in my_book
getSheets(my_book)</p>
<p>Import the second sheet in my_book
readWorksheet(my_book, sheet = 2)</p>
<p>Import columns 3, 4, and 5 from second sheet in my_book: urbanpop_sel
urbanpop_sel <- readWorksheet(my_book, sheet = 2, startCol = 3, endCol = 5)</p>
<p>Import first column from second sheet in my_book: countries
countries <- readWorksheet(my_book, sheet = 2, startCol = 1, endCol = 1)</p>
<p>cbind() urbanpop_sel and countries together: selection
selection <- cbind(countries, urbanpop_sel)</p>
<p>Add a worksheet to my_book, named “data_summary”
createSheet(my_book, “data_summary”)</p>
<p>Use getSheets() on my_book
getSheets(my_book)</p>
<p>Create data frame:
sheets <- getSheets(my_book)[1:3]
dims <- sapply(sheets, function(x) dim(readWorksheet(my_book, sheet = x)), USE.NAMES = FALSE)
summ <- data.frame(sheets = sheets,
nrows = dims[1, ],
ncols = dims[2, ])</p>
<p>Add data in summ to “data_summary” sheet
writeWorksheet(my_book, summ, “data_summary”)</p>
<p>Rename “data_summary” sheet to “summary”
renameSheet(my_book, “data_summary”, “summary”)</p>
<p>Remove the fourth sheet
removeSheet(my_book, 4)</p>
<p>Save workbook to “renamed.xlsx”
saveWorkbook(my_book, file = “renamed.xlsx”)</p>
<p>Download various files with download.file()
Here are the URLs! As you can see they’re just normal strings:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb2-1" data-line-number="1">csv_url <-<span class="st"> "http://s3.amazonaws.com/assets.datacamp.com/production/course_1561/datasets/chickwts.csv"</span></a>
<a class="sourceLine" id="cb2-2" data-line-number="2">tsv_url <-<span class="st"> "http://s3.amazonaws.com/assets.datacamp.com/production/course_3026/datasets/tsv_data.tsv"</span></a>
<a class="sourceLine" id="cb2-3" data-line-number="3"></a>
<a class="sourceLine" id="cb2-4" data-line-number="4"><span class="co"># Read a file in from the CSV URL and assign it to csv_data</span></a>
<a class="sourceLine" id="cb2-5" data-line-number="5">csv_data <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="dt">file =</span> csv_url)</a>
<a class="sourceLine" id="cb2-6" data-line-number="6"></a>
<a class="sourceLine" id="cb2-7" data-line-number="7"><span class="co"># Read a file in from the TSV URL and assign it to tsv_data</span></a>
<a class="sourceLine" id="cb2-8" data-line-number="8">tsv_data <-<span class="st"> </span><span class="kw">read.delim</span>(<span class="dt">file =</span> tsv_url)</a>
<a class="sourceLine" id="cb2-9" data-line-number="9"></a>
<a class="sourceLine" id="cb2-10" data-line-number="10"><span class="co"># Examine the objects with head()</span></a>
<a class="sourceLine" id="cb2-11" data-line-number="11"><span class="kw">head</span>(csv_data, <span class="dt">n =</span> <span class="dv">2</span>)</a></code></pre></div>
<pre><code>## weight feed
## 1 179 horsebean
## 2 160 horsebean</code></pre>
<div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb4-1" data-line-number="1"><span class="kw">head</span>(tsv_data, <span class="dt">n =</span> <span class="dv">2</span>)</a></code></pre></div>
<pre><code>## weight feed
## 1 179 horsebean
## 2 160 horsebean</code></pre>
<p>Download the file with download.file()</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb6-1" data-line-number="1"><span class="kw">download.file</span>(<span class="dt">url =</span> csv_url, <span class="dt">destfile =</span> <span class="st">"feed_data.csv"</span>)</a>
<a class="sourceLine" id="cb6-2" data-line-number="2"></a>
<a class="sourceLine" id="cb6-3" data-line-number="3"><span class="co"># Read it in with read.csv()</span></a>
<a class="sourceLine" id="cb6-4" data-line-number="4">csv_data <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="dt">file =</span> <span class="st">"feed_data.csv"</span>)</a>
<a class="sourceLine" id="cb6-5" data-line-number="5"></a>
<a class="sourceLine" id="cb6-6" data-line-number="6"></a>
<a class="sourceLine" id="cb6-7" data-line-number="7"><span class="co"># Add a new column: square_weight</span></a>
<a class="sourceLine" id="cb6-8" data-line-number="8">csv_data<span class="op">$</span>square_weight <-<span class="st"> </span>(csv_data<span class="op">$</span>weight <span class="op">^</span><span class="st"> </span><span class="dv">2</span>)</a></code></pre></div>
<p>Save it to disk with saveRDS()
saveRDS(object = csv_data, file = “modified_feed_data.RDS”)</p>
<p>Read it back in with readRDS()
modified_feed_data <- readRDS(file = “modified_feed_data.RDS”)</p>
<p>Using data from API clients</p>
<p>example 1
Load pageviews library for wikipedia</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb7-1" data-line-number="1"><span class="kw">library</span>(pageviews)</a>
<a class="sourceLine" id="cb7-2" data-line-number="2"></a>
<a class="sourceLine" id="cb7-3" data-line-number="3"><span class="co"># Get the pageviews for "Hadley Wickham"</span></a>
<a class="sourceLine" id="cb7-4" data-line-number="4">hadley_pageviews <-<span class="st"> </span><span class="kw">article_pageviews</span>(<span class="dt">project =</span> <span class="st">"en.wikipedia"</span>, <span class="dt">article =</span> <span class="st">"Hadley Wickham"</span>)</a>
<a class="sourceLine" id="cb7-5" data-line-number="5"></a>
<a class="sourceLine" id="cb7-6" data-line-number="6"><span class="co"># Examine the resulting object</span></a>
<a class="sourceLine" id="cb7-7" data-line-number="7"><span class="kw">str</span>(hadley_pageviews)</a></code></pre></div>
<pre><code>## 'data.frame': 1 obs. of 8 variables:
## $ project : chr "wikipedia"
## $ language : chr "en"
## $ article : chr "Hadley_Wickham"
## $ access : chr "all-access"
## $ agent : chr "all-agents"
## $ granularity: chr "daily"
## $ date : POSIXct, format: "2015-10-01"
## $ views : num 53</code></pre>
<p>Load the httr package:</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb9-1" data-line-number="1"><span class="kw">library</span>(httr)</a>
<a class="sourceLine" id="cb9-2" data-line-number="2"></a>
<a class="sourceLine" id="cb9-3" data-line-number="3"><span class="co"># Make a GET request to http://httpbin.org/get</span></a>
<a class="sourceLine" id="cb9-4" data-line-number="4">get_result <-<span class="st"> </span><span class="kw">GET</span>(<span class="dt">url =</span> <span class="st">"http://httpbin.org/get"</span>)</a>
<a class="sourceLine" id="cb9-5" data-line-number="5"></a>
<a class="sourceLine" id="cb9-6" data-line-number="6"><span class="co"># Print it to inspect it</span></a>
<a class="sourceLine" id="cb9-7" data-line-number="7">get_result</a></code></pre></div>
<pre><code>## Response [http://httpbin.org/get]
## Date: 2020-09-28 03:14
## Status: 200
## Content-Type: application/json
## Size: 365 B
## {
## "args": {},
## "headers": {
## "Accept": "application/json, text/xml, application/xml, */*",
## "Accept-Encoding": "deflate, gzip",
## "Host": "httpbin.org",
## "User-Agent": "libcurl/7.54.0 r-curl/4.3 httr/1.4.2",
## "X-Amzn-Trace-Id": "Root=1-5f715508-67454af2eaac626ad9a751a8"
## },
## "origin": "99.229.26.120",
## ...</code></pre>
<p>Make a POST request to <a href="http://httpbin.org/post" class="uri">http://httpbin.org/post</a> with the body “this is a test”</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb11-1" data-line-number="1">post_result <-<span class="st"> </span><span class="kw">POST</span>(<span class="dt">url =</span> <span class="st">"http://httpbin.org/post"</span>, <span class="dt">body =</span> <span class="st">"this is a test"</span>)</a>
<a class="sourceLine" id="cb11-2" data-line-number="2"></a>
<a class="sourceLine" id="cb11-3" data-line-number="3"><span class="co"># Print it to inspect it</span></a>
<a class="sourceLine" id="cb11-4" data-line-number="4">post_result</a></code></pre></div>
<pre><code>## Response [http://httpbin.org/post]
## Date: 2020-09-28 03:14
## Status: 200
## Content-Type: application/json
## Size: 472 B
## {
## "args": {},
## "data": "this is a test",
## "files": {},
## "form": {},
## "headers": {
## "Accept": "application/json, text/xml, application/xml, */*",
## "Accept-Encoding": "deflate, gzip",
## "Content-Length": "14",
## "Host": "httpbin.org",
## ...</code></pre>
<p>Make a GET request to url and save the results:
Handling http failures</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb13-1" data-line-number="1">fake_url <-<span class="st"> "http://google.com/fakepagethatdoesnotexist"</span></a>
<a class="sourceLine" id="cb13-2" data-line-number="2"></a>
<a class="sourceLine" id="cb13-3" data-line-number="3"><span class="co"># Make the GET request</span></a>
<a class="sourceLine" id="cb13-4" data-line-number="4">request_result <-<span class="st"> </span><span class="kw">GET</span>(fake_url)</a></code></pre></div>
<p>Example start to finish using httr package: The API url</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb14-1" data-line-number="1">base_url <-<span class="st"> "https://en.wikipedia.org/w/api.php"</span></a>
<a class="sourceLine" id="cb14-2" data-line-number="2"></a>
<a class="sourceLine" id="cb14-3" data-line-number="3"><span class="co"># Set query parameters</span></a>
<a class="sourceLine" id="cb14-4" data-line-number="4">query_params <-<span class="st"> </span><span class="kw">list</span>(<span class="dt">action =</span> <span class="st">"parse"</span>, </a>
<a class="sourceLine" id="cb14-5" data-line-number="5"> <span class="dt">page =</span> <span class="st">"Hadley Wickham"</span>, </a>
<a class="sourceLine" id="cb14-6" data-line-number="6"> <span class="dt">format =</span> <span class="st">"xml"</span>)</a></code></pre></div>
<p>Read page contents as HTML: library(rvest)
# Extract page name element from infobox: library(xml2)
Create a dataframe for full name
Reproducibility</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb15-1" data-line-number="1">get_infobox <-<span class="st"> </span><span class="cf">function</span>(title){</a>
<a class="sourceLine" id="cb15-2" data-line-number="2"> base_url <-<span class="st"> "https://en.wikipedia.org/w/api.php"</span></a>
<a class="sourceLine" id="cb15-3" data-line-number="3"> </a>
<a class="sourceLine" id="cb15-4" data-line-number="4"><span class="co"># Change "Hadley Wickham" to title</span></a>
<a class="sourceLine" id="cb15-5" data-line-number="5"></a>
<a class="sourceLine" id="cb15-6" data-line-number="6">query_params <-<span class="st"> </span><span class="kw">list</span>(<span class="dt">action =</span> <span class="st">"parse"</span>, </a>
<a class="sourceLine" id="cb15-7" data-line-number="7"> <span class="dt">page =</span> title, </a>
<a class="sourceLine" id="cb15-8" data-line-number="8"> <span class="dt">format =</span> <span class="st">"xml"</span>)}</a>
<a class="sourceLine" id="cb15-9" data-line-number="9"> </a>
<a class="sourceLine" id="cb15-10" data-line-number="10">resp <-<span class="st"> </span><span class="kw">GET</span>(<span class="dt">url =</span> base_url, <span class="dt">query =</span> query_params)</a>
<a class="sourceLine" id="cb15-11" data-line-number="11">resp_xml <-<span class="st"> </span><span class="kw">content</span>(resp)</a>
<a class="sourceLine" id="cb15-12" data-line-number="12"> </a>
<a class="sourceLine" id="cb15-13" data-line-number="13">page_html <-<span class="st"> </span><span class="kw">read_html</span>(<span class="kw">xml_text</span>(resp_xml))</a>
<a class="sourceLine" id="cb15-14" data-line-number="14">infobox_element <-<span class="st"> </span><span class="kw">html_node</span>(<span class="dt">x =</span> page_html, <span class="dt">css =</span><span class="st">".infobox"</span>)</a>
<a class="sourceLine" id="cb15-15" data-line-number="15">page_name <-<span class="st"> </span><span class="kw">html_node</span>(<span class="dt">x =</span> infobox_element, <span class="dt">css =</span> <span class="st">".fn"</span>)</a></code></pre></div>
<p>Construct a directory-based API URL to <code>http://swapi.co/api</code>,
looking for person <code>1</code> in <code>people</code>:</p>
<div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb16-1" data-line-number="1">directory_url <-<span class="st"> </span><span class="kw">paste</span>(<span class="st">"http://swapi.co/api"</span>, <span class="st">"people"</span>, <span class="st">"1"</span>, <span class="dt">sep =</span> <span class="st">"/"</span>)</a>
<a class="sourceLine" id="cb16-2" data-line-number="2"></a>
<a class="sourceLine" id="cb16-3" data-line-number="3"><span class="co"># Make a GET call with it</span></a>
<a class="sourceLine" id="cb16-4" data-line-number="4">result <-<span class="st"> </span><span class="kw">GET</span>(directory_url)</a>
<a class="sourceLine" id="cb16-5" data-line-number="5"></a>
<a class="sourceLine" id="cb16-6" data-line-number="6"><span class="co"># Create list with nationality and country elements</span></a>
<a class="sourceLine" id="cb16-7" data-line-number="7">query_params <-<span class="st"> </span><span class="kw">list</span>(<span class="dt">nationality =</span> <span class="st">"americans"</span>, </a>
<a class="sourceLine" id="cb16-8" data-line-number="8"> <span class="dt">country =</span> <span class="st">"antigua"</span>)</a>
<a class="sourceLine" id="cb16-9" data-line-number="9"></a>
<a class="sourceLine" id="cb16-10" data-line-number="10"><span class="co"># Make parameter-based call to httpbin, with query_params</span></a>
<a class="sourceLine" id="cb16-11" data-line-number="11">parameter_response <-<span class="st"> </span><span class="kw">GET</span>(<span class="st">"https://httpbin.org/get"</span>, <span class="dt">query =</span> query_params)</a>
<a class="sourceLine" id="cb16-12" data-line-number="12"></a>
<a class="sourceLine" id="cb16-13" data-line-number="13"><span class="co"># Print parameter_response</span></a>
<a class="sourceLine" id="cb16-14" data-line-number="14">parameter_response</a></code></pre></div>
<pre><code>## Response [https://httpbin.org/get?nationality=americans&country=antigua]
## Date: 2020-09-28 03:14
## Status: 200
## Content-Type: application/json
## Size: 465 B
## {
## "args": {
## "country": "antigua",
## "nationality": "americans"
## },
## "headers": {
## "Accept": "application/json, text/xml, application/xml, */*",
## "Accept-Encoding": "deflate, gzip",
## "Host": "httpbin.org",
## "User-Agent": "libcurl/7.54.0 r-curl/4.3 httr/1.4.2",
## ...</code></pre>
<p>Using user agents
Informative user-agents are a good way of being respectful of the developers running the API you’re interacting with. They make it easy for them to contact you in the event something goes wrong. I always try to include:
My email address; A URL for the project the code is a part of, if it’s got a URL.</p>
<p>Do not change the url:</p>
<div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb18-1" data-line-number="1">url <-<span class="st"> "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Aaron_Halfaker/daily/2015100100/2015103100"</span></a></code></pre></div>
<p>Add the email address and the test sentence inside user_agent()
server_response <- GET(url, user_agent(“<a href="mailto:my@email.address" class="email">my@email.address</a> this is a test”))</p>
<p>Rate-limiting
The next stage of respectful API usage is rate-limiting: making sure you only make a certain number of requests to the server in a given time period.
Your limit will vary from server to server, but the implementation is always pretty much the same and involves a call to Sys.sleep().
This function takes one argument, a number, which represents the number of seconds to “sleep” (pause) the R session for.
So if you call Sys.sleep(15), it’ll pause for 15 seconds before allowing further code to run.</p>
<p>Construct a vector of 2 URLs:
for(url in urls){
Send a GET request to url
result <- GET(url)
Delay for 5 seconds between requests
Sys.sleep(5)
}</p>
<div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb19-1" data-line-number="1">urls <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"http://httpbin.org/status/404"</span>,</a>
<a class="sourceLine" id="cb19-2" data-line-number="2"> <span class="st">"http://httpbin.org/status/301"</span>)</a></code></pre></div>
<p>Tying it all together:</p>
<div class="sourceCode" id="cb20"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb20-1" data-line-number="1">get_pageviews <-<span class="st"> </span><span class="cf">function</span>(article_title){</a>
<a class="sourceLine" id="cb20-2" data-line-number="2"> url <-<span class="st"> </span><span class="kw">paste</span>(</a>
<a class="sourceLine" id="cb20-3" data-line-number="3"> <span class="st">"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents"</span>, </a>
<a class="sourceLine" id="cb20-4" data-line-number="4"> article_title, </a>
<a class="sourceLine" id="cb20-5" data-line-number="5"> <span class="st">"daily/2015100100/2015103100"</span>, </a>
<a class="sourceLine" id="cb20-6" data-line-number="6"> <span class="dt">sep =</span> <span class="st">"/"</span></a>
<a class="sourceLine" id="cb20-7" data-line-number="7"> ) </a>
<a class="sourceLine" id="cb20-8" data-line-number="8"></a>
<a class="sourceLine" id="cb20-9" data-line-number="9">response <-<span class="st"> </span><span class="kw">GET</span>(url, <span class="kw">user_agent</span>(<span class="st">"my@email.com this is a test"</span>)) </a>
<a class="sourceLine" id="cb20-10" data-line-number="10"> <span class="co"># Is there an HTTP error?</span></a>
<a class="sourceLine" id="cb20-11" data-line-number="11"> <span class="cf">if</span>(<span class="kw">http_error</span>(response)){ </a>
<a class="sourceLine" id="cb20-12" data-line-number="12"> <span class="co"># Throw an R error</span></a>
<a class="sourceLine" id="cb20-13" data-line-number="13"> <span class="kw">stop</span>(<span class="st">"the request failed"</span>) </a>
<a class="sourceLine" id="cb20-14" data-line-number="14"> }</a>
<a class="sourceLine" id="cb20-15" data-line-number="15"> <span class="co"># Return the response's content</span></a>
<a class="sourceLine" id="cb20-16" data-line-number="16"> <span class="kw">content</span>(response)</a>
<a class="sourceLine" id="cb20-17" data-line-number="17">}</a></code></pre></div>
<p>working with JSON files (for more information see: www.json.org)
While JSON is a useful format for sharing data, your first step will often be to parse it into an R object, so you can manipulate it with R.</p>
<p>web scraping 101
The first step with web scraping is actually reading the HTML in.
This can be done with a function from xml2, which is imported by rvest - read_html().
This accepts a single URL, and returns a big blob of XML that we can use further on.</p>
<p>Hadley Wickham’s Wikipedia page:</p>
<div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb21-1" data-line-number="1">test_url <-<span class="st"> "https://en.wikipedia.org/wiki/Hadley_Wickham"</span></a>
<a class="sourceLine" id="cb21-2" data-line-number="2"> </a>
<a class="sourceLine" id="cb21-3" data-line-number="3"><span class="co"># Read the URL stored as "test_url" with read_html()</span></a>
<a class="sourceLine" id="cb21-4" data-line-number="4">test_xml <-<span class="st"> </span><span class="kw">read_html</span>(test_url)</a>
<a class="sourceLine" id="cb21-5" data-line-number="5"> </a>
<a class="sourceLine" id="cb21-6" data-line-number="6"><span class="co"># Print test_xml</span></a>
<a class="sourceLine" id="cb21-7" data-line-number="7">test_xml</a></code></pre></div>
<pre><code>## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...</code></pre>
<p>html_node(), which extracts individual chunks of HTML from a HTML document.
There are a couple of ways of identifying and filtering nodes, and for now we’re going to use XPATHs:
unique identifiers for individual pieces of a HTML document.</p>
<p>Extract the element of table_element referred to by second_xpath_val and store it as page_name
page_name <- html_node(x = table_element, xpath = second_xpath_val)</p>
<p>Extract the text from page_name:</p>
<div class="sourceCode" id="cb23"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb23-1" data-line-number="1">page_title <-<span class="st"> </span><span class="kw">html_text</span>(page_name)</a>
<a class="sourceLine" id="cb23-2" data-line-number="2"></a>
<a class="sourceLine" id="cb23-3" data-line-number="3"><span class="co"># Print page_title</span></a>
<a class="sourceLine" id="cb23-4" data-line-number="4">page_title</a></code></pre></div>
<pre><code>## [1] "Hadley Wickham"</code></pre>
<div class="sourceCode" id="cb25"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb25-1" data-line-number="1"><span class="co"># Turn table_element into a data frame and assign it to wiki_table</span></a>
<a class="sourceLine" id="cb25-2" data-line-number="2"><span class="co"># wiki_table <- rvest::html_table(table_element)</span></a>
<a class="sourceLine" id="cb25-3" data-line-number="3"></a>
<a class="sourceLine" id="cb25-4" data-line-number="4"><span class="co"># Print wiki_table</span></a>
<a class="sourceLine" id="cb25-5" data-line-number="5"><span class="co"># wiki_table</span></a></code></pre></div>
<p>Cleaning a data frame
Rename the columns of wiki_table:</p>
<p>CSS web scraping
CSS is a way to add design information to HTML, that instructs the browser on how to display the content. You can leverage these design instructions to identify content on the page.</p>
<!--chapter:end:02-Importing-data.Rmd-->
</div>
<div id="useful-r-functions-examples" class="section level1">
<h1><span class="header-section-number">4</span> Useful R Functions + Examples</h1>
<blockquote>
<p>This is <em>NOT</em> intended to be fully comprehensive list of every useful R function that exists, but is a practical demonstration of selected relevant examples presented in user-friendly format, all available in base R. For a wider collection to work through, this Reference Card is recommended: <a href="https://cran.r-project.org/doc/contrib/Baggott-refcard-v2.pdf" class="uri">https://cran.r-project.org/doc/contrib/Baggott-refcard-v2.pdf</a></p>
</blockquote>
<blockquote>
<p>Additional CRAN reference cards and R guides (including non-English documentation) found here: <a href="https://cran.r-project.org/other-docs.html" class="uri">https://cran.r-project.org/other-docs.html</a></p>
</blockquote>
<div id="contents" class="section level2">
<h2><span class="header-section-number">4.1</span> Contents</h2>
<p>A. Essentials<br />
* 1. <code>getwd()</code>, <code>setwd()</code><br />
* 2. <code>?foo</code>, <code>help(foo)</code>, <code>example(foo)</code><br />
* 3. <code>install.packages("foo")</code>, <code>library("foo")</code><br />
* 4. <code>devtools::install_github("username/packagename")</code><br />
* 5. <code>data("foo")</code><br />
* 6. <code>read.csv</code>, <code>read.table</code><br />
* 7. <code>write.table()</code><br />
* 8. <code>save()</code>, <code>load()</code></p>
<p>B. Basics<br />
* 9. <code>c()</code>, <code>cbind()</code>, <code>rbind()</code>, <code>matrix()</code><br />
* 10. <code>length()</code>, <code>dim()</code><br />
* 11. <code>sort()</code>, <code>'vector'[]</code>, <code>'matrix'[]</code><br />
* 12. <code>data.frame()</code>, <code>class()</code>, <code>names()</code>, <code>str()</code>, <code>summary()</code>, <code>View()</code>, <code>head()</code>, <code>tail()</code>, <code>as.data.frame()</code></p>
<p>C. Core<br />
* 13. <code>df[order(),]</code><br />
* 14. <code>df[,c()]</code>, <code>df[which(),]</code><br />
* 15. <code>table()</code><br />
* 16. <code>mean()</code>, <code>median()</code>, <code>sd()</code>, <code>var()</code>, <code>sum()</code>, <code>min()</code>, <code>max()</code>, <code>range()</code><br />
* 17. <code>apply()</code><br />
* 18. <code>lapply()</code> using <code>list()</code><br />
* 19. <code>tapply()</code></p>
<p>D. Common<br />
* 20. <code>if</code> statement, <code>if...else</code> statement<br />
* 21. <code>for</code> loop<br />
* 22. <code>function()...</code></p>
</div>
<div id="r-syntax" class="section level2">
<h2><span class="header-section-number">4.2</span> R Syntax</h2>
<p><em>REMEMBER: KEY R LANGUAGE SYNTAX</em></p>
<ul>
<li><strong>Case Sensitivity</strong>: as per most UNIX-based packages, R is case sensitive, hence <code>X</code> and <code>x</code> are different symbols and would refer to different variables.<br />
</li>
<li><strong>Expressions vs Assignments</strong>: an expression, like <code>3 + 5</code> can be given as a command which will be evaluated and the value immediately printed, but not stored. An assignment however, like <code>sum <- 3 + 5</code> using the assignment operator <code><-</code> also evaluates the expression <code>3 + 5</code>, but instead of printing and not storing, it stores the value in the object <code>sum</code> but doesn’t print the result. The object <code>sum</code> would need to be called to print the result.<br />
</li>
<li><strong>Reserved Words</strong>: choice for naming objects is almost entirely free, except for these reserved words: <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html" class="uri">https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html</a><br />
</li>
<li><strong>Spacing</strong>: outside of the function structure, spaces don’t matter, e.g. <code>3+5</code> is the same as <code>3+ 5</code> is the same as <code>3 + 5</code>. For more best-practices for R code Hadley Wickham’s Style Guide is a useful reference: <a href="http://adv-r.had.co.nz/Style.html" class="uri">http://adv-r.had.co.nz/Style.html</a></li>
<li><strong>Comments</strong>: add comments within your code using a hastag, <code>#</code>. R will ignore everything to the right of the hashtag within that line</li>
</ul>
</div>
<div id="functional-examples" class="section level2">
<h2><span class="header-section-number">4.3</span> Functional examples</h2>
<ol style="list-style-type: decimal">
<li>Working Directory management</li>
</ol>
<ul>
<li><code>getwd()</code>, <code>setwd()</code>
R/RStudio is always pointed at a specific directory on your computer, so it’s important to be able to check what’s the current directory using <code>getwd()</code>, and to be able to change and specify a different directory to work in using <code>setwd()</code>.</li>
</ul>
<p>#check the directory R is currently pointed at
getwd()</p>
<ol start="2" style="list-style-type: decimal">
<li>Bring up help documentation & examples</li>
</ol>
<ul>
<li><code>?foo</code>, <code>help(foo)</code>, <code>example(foo)</code></li>
</ul>
<div class="sourceCode" id="cb26"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb26-1" data-line-number="1">?boxplot</a>
<a class="sourceLine" id="cb26-2" data-line-number="2"><span class="kw">help</span>(boxplot)</a>
<a class="sourceLine" id="cb26-3" data-line-number="3"><span class="kw">example</span>(boxplot)</a></code></pre></div>
<hr />
<ol start="3" style="list-style-type: decimal">
<li>Load & Call CRAN Packages</li>
</ol>
<ul>
<li><code>install.packages("foo")</code>, <code>library("foo")</code>
Packages are add-on functionality built for R but not pre-installed (base R), hence you need to install/load the packages you want yourself. The majority of packages you’d want have been submitted to and are available via CRAN. At time of writing, the CRAN package repository featured 8,592 available packages.</li>
</ul>
<ol start="4" style="list-style-type: decimal">
<li>Load & Call Packages from GitHub</li>
</ol>
<ul>
<li><code>devtools::install_github("username/packagename")</code>
Not all packages you’ll want will be available via CRAN, and you’ll likely need to get certain packages from GitHub accounts. This example shows how to install the <code>shinyapps</code> package from RStudio’s GitHub account.</li>
<li>install.packages(“devtools”) #pre-requisite for <code>devtools...</code> function</li>
<li>devtools::install_github(“rstudio/shinyapps”) #install specific package from specific GitHub account</li>
<li>library(“shinyapps”) #Call package</li>
</ul>
<ol start="5" style="list-style-type: decimal">
<li>Load datasets from base R & Loaded Packages</li>
</ol>
<ul>
<li><code>data("foo")</code></li>
</ul>
<div class="sourceCode" id="cb27"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb27-1" data-line-number="1"><span class="co">#AIM: show available datasets</span></a>
<a class="sourceLine" id="cb27-2" data-line-number="2"><span class="kw">data</span>() </a>
<a class="sourceLine" id="cb27-3" data-line-number="3"></a>
<a class="sourceLine" id="cb27-4" data-line-number="4"><span class="co">#AIM: load an available dataset</span></a>
<a class="sourceLine" id="cb27-5" data-line-number="5"><span class="kw">data</span>(<span class="st">"iris"</span>) </a></code></pre></div>
<hr />
<ol start="6" style="list-style-type: decimal">
<li>I/O Loading Existing Local Data</li>
</ol>
<ul>
<li><code>read.csv</code>, <code>read.table</code></li>
</ul>
<ol style="list-style-type: lower-alpha">
<li>I/O When already in the working directory where the data is</li>
</ol>
<p>Import a local <strong>csv</strong> file (i.e. where data is separated by <strong>commas</strong>), saving it as an object:
- object <- read.csv(“xxx.csv”)</p>
<p>Import a local tab delimited file (i.e. where data is separated by <strong>tabs</strong>), saving it as an object:
- object <- read.csv(“xxx.csv”, header = FALSE)
—</p>
<ol start="2" style="list-style-type: lower-alpha">
<li>I/O When NOT in the working directory where the data is</li>
</ol>
<p>For example to import and save a local <strong>csv</strong> file from a different working directory you either need to specify the file path (operating system specific), e.g.:</p>
<p>on a mac:
- object <- read.csv(“~/Desktop/R/data.csv”)</p>
<p>on windows:
= object <- read.csv(“C:/Desktop/R/data.csv”)</p>
<p>OR</p>
<p>You can use the file.choose() command which will interactively open up the file dialog box for you to browse and select the local file, e.g.:
- object <- read.csv(file.choose())</p>
<ol start="3" style="list-style-type: lower-alpha">
<li>I/O Copying & Pasting Data</li>
</ol>
<p>For relatively small amounts of data you can do an equivalent copy paste (operating system specific):</p>
<p>on a mac:
- object <- read.table(pipe(“pbpaste”))</p>
<p>on windows:
- object <- read.table(file = “clipboard”)</p>
<ol start="4" style="list-style-type: lower-alpha">
<li>I/O Loading Non-Numerical Data - character strings</li>
</ol>
<p>Be careful when loading text data! R may assume character strings are statistical factor variables, e.g. “low”, “medium”, “high”, when are just individual labels like names. To specify text data NOT to be converted into factor variables, add <code>stringsAsFactor = FALSE</code> to your <code>read.csv/read.table</code> command:
- object <- read.table(“xxx.txt”, stringsAsFactors = FALSE)</p>
<ol start="5" style="list-style-type: lower-alpha">
<li>I/O Downloading Remote Data</li>
</ol>
<p>For accessing files from the web you can use the same <code>read.csv/read.table</code> commands. However, the file being downloaded does need to be in an R-friendly format (maximum of 1 header row, subsequent rows are the equivalent of one data record per row, no extraneous footnotes etc.). Here is an example downloading an online csv file of coffee harvest data used in a Nature study:
- object <- read.csv(“<a href="http://sumsar.net/files/posts/2014-02-04-bayesian-first-aid-one-sample-t-test/roubik_2002_coffe_yield.csv" class="uri">http://sumsar.net/files/posts/2014-02-04-bayesian-first-aid-one-sample-t-test/roubik_2002_coffe_yield.csv</a>”)</p>
<ol start="7" style="list-style-type: decimal">
<li>I/O Exporting Data Frame</li>
</ol>
<ul>
<li><code>write.table()</code></li>
</ul>
<p>Navigate to the working directory you want to save the data table into, then run the command (in this case creating a tab delimited file):
- write.table(object, “xxx.txt”, sep = ")</p>
<ol start="8" style="list-style-type: decimal">
<li>I/O Saving Down & Loading Objects</li>
</ol>
<ul>
<li><code>save()</code>, <code>load()</code></li>
</ul>
<p>These two commands allow you to save a named R object to a file and restore that object again.<br />
Navigate to the working directory you want to save the object in then run the command:
- save(object, file = “xxx.rda”)</p>
<p>reload the object:
- load(“xxx.rda”)</p>
<ol start="9" style="list-style-type: decimal">
<li>Vector & Matrix Construction</li>
</ol>
<ul>
<li><code>c()</code>, <code>cbind()</code>, <code>rbind()</code>, <code>matrix()</code>
Vectors (lists) & Matrices (two-dimensional arrays) are very common R data structures.</li>
</ul>
<div class="sourceCode" id="cb28"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb28-1" data-line-number="1"><span class="co">#use c() to construct a vector by concatenating data</span></a>
<a class="sourceLine" id="cb28-2" data-line-number="2">foo <-<span class="st"> </span><span class="kw">c</span>(<span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">3</span>, <span class="dv">4</span>) <span class="co">#example of a numeric vector</span></a>
<a class="sourceLine" id="cb28-3" data-line-number="3">oof <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"A"</span>, <span class="st">"B"</span>, <span class="st">"C"</span>, <span class="st">"D"</span>) <span class="co">#example of a character vector</span></a>
<a class="sourceLine" id="cb28-4" data-line-number="4">ofo <-<span class="st"> </span><span class="kw">c</span>(<span class="ot">TRUE</span>, <span class="ot">FALSE</span>, <span class="ot">TRUE</span>, <span class="ot">TRUE</span>) <span class="co">#example of a logical vector</span></a>
<a class="sourceLine" id="cb28-5" data-line-number="5"></a>
<a class="sourceLine" id="cb28-6" data-line-number="6"><span class="co">#use cbind() & rbind() to construct matrices</span></a>
<a class="sourceLine" id="cb28-7" data-line-number="7">coof <-<span class="st"> </span><span class="kw">cbind</span>(foo, oof) <span class="co">#bind vectors in column concatenation to make a matrix</span></a>
<a class="sourceLine" id="cb28-8" data-line-number="8">roof <-<span class="st"> </span><span class="kw">rbind</span>(foo, oof) <span class="co">#bind vectors in row concatenation to make a matrix</span></a>
<a class="sourceLine" id="cb28-9" data-line-number="9"></a>
<a class="sourceLine" id="cb28-10" data-line-number="10"><span class="co">#use matrix() to construct matrices</span></a>
<a class="sourceLine" id="cb28-11" data-line-number="11">moof <-<span class="st"> </span><span class="kw">matrix</span>(<span class="dt">data =</span> <span class="dv">1</span><span class="op">:</span><span class="dv">12</span>, <span class="dt">nrow=</span><span class="dv">3</span>, <span class="dt">ncol=</span><span class="dv">4</span>) <span class="co">#creates matrix by specifying set of values, no. of rows & no. of columns</span></a></code></pre></div>
<ol start="10" style="list-style-type: decimal">
<li>Vector & Matrix Explore</li>
</ol>
<ul>
<li><code>length()</code>, <code>dim()</code></li>
</ul>
<div class="sourceCode" id="cb29"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb29-1" data-line-number="1"><span class="kw">length</span>(foo) <span class="co">#length of vector</span></a>
<a class="sourceLine" id="cb29-2" data-line-number="2"></a>
<a class="sourceLine" id="cb29-3" data-line-number="3"><span class="kw">dim</span>(coof) <span class="co">#returns dimensions (no. of rows & columns) of vector/matrix/dataframe</span></a></code></pre></div>
<ol start="11" style="list-style-type: decimal">
<li>Vector & Matrix Sort & Select</li>
</ol>
<ul>
<li><code>sort()</code>, <code>'vector'[]</code>, <code>'matrix'[]</code></li>
</ul>
<div class="sourceCode" id="cb30"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb30-1" data-line-number="1"><span class="co">#create another numeric vector</span></a>
<a class="sourceLine" id="cb30-2" data-line-number="2">jumble <-<span class="st"> </span><span class="kw">c</span>(<span class="dv">4</span>, <span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">3</span>)</a>
<a class="sourceLine" id="cb30-3" data-line-number="3"><span class="kw">sort</span>(jumble) <span class="co">#sorts a numeric vector in ascending order (default)</span></a>
<a class="sourceLine" id="cb30-4" data-line-number="4"><span class="kw">sort</span>(jumble, <span class="dt">decreasing =</span> <span class="ot">TRUE</span>) <span class="co">#specify the decreasing arg to reverse default order</span></a>
<a class="sourceLine" id="cb30-5" data-line-number="5"></a>
<a class="sourceLine" id="cb30-6" data-line-number="6"><span class="co">#create another character vector</span></a>
<a class="sourceLine" id="cb30-7" data-line-number="7">mumble <-<span class="st"> </span><span class="kw">c</span>( <span class="st">"D"</span>, <span class="st">"B"</span>, <span class="st">"C"</span>, <span class="st">"A"</span>)</a>
<a class="sourceLine" id="cb30-8" data-line-number="8"><span class="kw">sort</span>(mumble) <span class="co">#sorts a character vector in alphabetical order (default)</span></a>
<a class="sourceLine" id="cb30-9" data-line-number="9"><span class="kw">sort</span>(mumble, <span class="dt">decreasing =</span> <span class="ot">TRUE</span>) <span class="co">#specify the decreasing arg to reverse default order</span></a>
<a class="sourceLine" id="cb30-10" data-line-number="10"></a>
<a class="sourceLine" id="cb30-11" data-line-number="11">jumble[<span class="dv">1</span>] <span class="co">#selects first value in our jumble vector</span></a>
<a class="sourceLine" id="cb30-12" data-line-number="12"><span class="kw">tail</span>(jumble, <span class="dt">n=</span><span class="dv">1</span>) <span class="co">#selects last value</span></a>
<a class="sourceLine" id="cb30-13" data-line-number="13">jumble[<span class="kw">c</span>(<span class="dv">1</span>,<span class="dv">3</span>)] <span class="co">#selects the 1st & 3rd values</span></a>
<a class="sourceLine" id="cb30-14" data-line-number="14">jumble[<span class="op">-</span><span class="kw">c</span>(<span class="dv">1</span>,<span class="dv">3</span>)] <span class="co">#selects everything except the 1st & 3rd values</span></a>
<a class="sourceLine" id="cb30-15" data-line-number="15"></a>
<a class="sourceLine" id="cb30-16" data-line-number="16">coof[<span class="dv">1</span>,] <span class="co">#selects the 1st row of our coof matrix</span></a>
<a class="sourceLine" id="cb30-17" data-line-number="17">coof[,<span class="dv">1</span>] <span class="co">#selects the 1st column</span></a>
<a class="sourceLine" id="cb30-18" data-line-number="18">coof[<span class="dv">2</span>,<span class="dv">1</span>] <span class="co">#selects the value in the 2nd row, 1st column</span></a>
<a class="sourceLine" id="cb30-19" data-line-number="19">coof[,<span class="st">"oof"</span>] <span class="co">#selects the column named "oof"</span></a>
<a class="sourceLine" id="cb30-20" data-line-number="20">coof[<span class="dv">1</span><span class="op">:</span><span class="dv">3</span>,] <span class="co">#selects columns 1 to 3 inclusive</span></a>
<a class="sourceLine" id="cb30-21" data-line-number="21">coof[<span class="kw">c</span>(<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">3</span>),] <span class="co">#selects the 1st, 2nd & 3rd rows (same as previous)</span></a></code></pre></div>
<ol start="12" style="list-style-type: decimal">
<li>Create & Explore Data Frames</li>
</ol>
<ul>
<li><code>data.frame()</code>, <code>class()</code>, <code>names()</code>, <code>str()</code>, <code>summary()</code>, <code>View()</code>, <code>head()</code>, <code>tail()</code>, <code>as.data.frame()</code>
A data frame is a matrix-like data structure made up of lists of variables with the same number of rows, which can be of differing data types (numeric, character, factor etc.) - matrices must have columns all of the same data type.</li>
</ul>
<div class="sourceCode" id="cb31"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb31-1" data-line-number="1"><span class="co">#create a data frame with 3 columns with 4 rows each</span></a>
<a class="sourceLine" id="cb31-2" data-line-number="2">doof <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="st">"V1"</span>=<span class="dv">1</span><span class="op">:</span><span class="dv">4</span>, <span class="st">"V2"</span>=<span class="kw">c</span>(<span class="st">"A"</span>,<span class="st">"B"</span>,<span class="st">"C"</span>,<span class="st">"D"</span>), <span class="st">"V3"</span>=<span class="dv">5</span><span class="op">:</span><span class="dv">8</span>)</a>
<a class="sourceLine" id="cb31-3" data-line-number="3"></a>
<a class="sourceLine" id="cb31-4" data-line-number="4"><span class="kw">class</span>(doof) <span class="co">#check data frame object class</span></a>
<a class="sourceLine" id="cb31-5" data-line-number="5"><span class="kw">names</span>(doof) <span class="co"># returns column names</span></a>
<a class="sourceLine" id="cb31-6" data-line-number="6"><span class="kw">str</span>(doof) <span class="co">#see structure of data frame</span></a>
<a class="sourceLine" id="cb31-7" data-line-number="7"><span class="kw">summary</span>(doof) <span class="co">#returns basic summary stats</span></a>
<a class="sourceLine" id="cb31-8" data-line-number="8"><span class="kw">View</span>(doof) <span class="co">#invokes spreadsheet-style viewer</span></a>
<a class="sourceLine" id="cb31-9" data-line-number="9"><span class="kw">head</span>(doof, <span class="dt">n=</span><span class="dv">2</span>) <span class="co">#shows first parts of object, here requesting the first 2 rows</span></a>
<a class="sourceLine" id="cb31-10" data-line-number="10"><span class="kw">tail</span>(doof, <span class="dt">n=</span><span class="dv">2</span>) <span class="co">#shows last parts of object, here requesting the last 2 rows</span></a>
<a class="sourceLine" id="cb31-11" data-line-number="11"></a>
<a class="sourceLine" id="cb31-12" data-line-number="12">convert <-<span class="st"> </span><span class="kw">as.data.frame</span>(coof) <span class="co">#convert a non-data frame object into a data frame</span></a></code></pre></div>
<ol start="13" style="list-style-type: decimal">
<li>Data Frame Sort</li>
</ol>
<ul>
<li><code>df[order(),]</code></li>
</ul>
<div class="sourceCode" id="cb32"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb32-1" data-line-number="1"><span class="co">#use 'painters' data frame</span></a>
<a class="sourceLine" id="cb32-2" data-line-number="2"><span class="kw">library</span>(<span class="st">"MASS"</span>) <span class="co">#call package with the required data</span></a>
<a class="sourceLine" id="cb32-3" data-line-number="3"><span class="kw">data</span>(<span class="st">"painters"</span>) <span class="co">#load required data</span></a>
<a class="sourceLine" id="cb32-4" data-line-number="4"><span class="kw">View</span>(painters) <span class="co">#scan dataset</span></a>
<a class="sourceLine" id="cb32-5" data-line-number="5"></a>
<a class="sourceLine" id="cb32-6" data-line-number="6"><span class="co">#syntax for using a specific variable: df=data frame, '$', V1=variable name</span></a>
<a class="sourceLine" id="cb32-7" data-line-number="7">df<span class="op">$</span>V1 </a>
<a class="sourceLine" id="cb32-8" data-line-number="8"></a>
<a class="sourceLine" id="cb32-9" data-line-number="9"><span class="co">#AIM: print the 'School' variable column</span></a>
<a class="sourceLine" id="cb32-10" data-line-number="10">painters<span class="op">$</span>School</a>
<a class="sourceLine" id="cb32-11" data-line-number="11"></a>
<a class="sourceLine" id="cb32-12" data-line-number="12"><span class="co">#syntax for df[order(),]</span></a>
<a class="sourceLine" id="cb32-13" data-line-number="13">df[<span class="kw">order</span>(df<span class="op">$</span>V1, df<span class="op">$</span>V2...),] <span class="co">#function arguments: df=data frame, in square brackets specify within the order() the columns with which to sort the ROWS by, where default ordering is Ascending, the tailing comma specifies returning all the columns in the df. If only certain columns are wanted this can be specified after the comma.</span></a>
<a class="sourceLine" id="cb32-14" data-line-number="14"></a>
<a class="sourceLine" id="cb32-15" data-line-number="15"><span class="co">#AIM: order the dataset rows based on the painters' Composition Score column, in Ascending order</span></a>
<a class="sourceLine" id="cb32-16" data-line-number="16">painters[<span class="kw">order</span>(painters<span class="op">$</span>Composition),] <span class="co">#Composition is the sorting variable</span></a>
<a class="sourceLine" id="cb32-17" data-line-number="17"></a>
<a class="sourceLine" id="cb32-18" data-line-number="18"><span class="co">#AIM: order the dataset rows based on the painters' Composition Score column, in Descending order</span></a>
<a class="sourceLine" id="cb32-19" data-line-number="19">painters[<span class="kw">order</span>(<span class="op">-</span>painters<span class="op">$</span>Composition),] <span class="co">#append a minus sign in front of the variable you want to sort by in Descending order</span></a>
<a class="sourceLine" id="cb32-20" data-line-number="20"></a>
<a class="sourceLine" id="cb32-21" data-line-number="21"><span class="co">#AIM: order the dataset rows based on the painters' Composition Score column, in Descending order but return just the first 3 columns</span></a>
<a class="sourceLine" id="cb32-22" data-line-number="22">painters[<span class="kw">order</span>(<span class="op">-</span>painters<span class="op">$</span>Composition), <span class="kw">c</span>(<span class="dv">1</span><span class="op">:</span><span class="dv">3</span>)]</a></code></pre></div>
<ol start="14" style="list-style-type: decimal">
<li>Data Frame Select & Deselect</li>
</ol>
<ul>
<li><code>df[,c()]</code>, <code>df[which(),]</code></li>
</ul>
<div class="sourceCode" id="cb33"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb33-1" data-line-number="1"><span class="co">#use 'painters' data frame</span></a>
<a class="sourceLine" id="cb33-2" data-line-number="2"></a>
<a class="sourceLine" id="cb33-3" data-line-number="3"><span class="co">#syntax for select & deselect based on column variables</span></a>
<a class="sourceLine" id="cb33-4" data-line-number="4">df[, <span class="kw">c</span>(<span class="st">"V1"</span>, <span class="st">"V2"</span>...)] <span class="co">#function arguments: df=data frame, in square brackets specify columns to select or deselect. The comma specifies returning all the rows. If certain rows are wanted this can be specified before the comma.</span></a>
<a class="sourceLine" id="cb33-5" data-line-number="5"></a>
<a class="sourceLine" id="cb33-6" data-line-number="6"><span class="co">#AIM: select the Composition & Drawing variables based on their column name</span></a>
<a class="sourceLine" id="cb33-7" data-line-number="7">painters[, <span class="kw">c</span>(<span class="st">"Composition"</span>, <span class="st">"Drawing"</span>)] <span class="co">#subset the df, selecting just the named columns (and all the rows)</span></a>
<a class="sourceLine" id="cb33-8" data-line-number="8"></a>
<a class="sourceLine" id="cb33-9" data-line-number="9"><span class="co">#AIM: select the Composition & Drawing variables based on their column positions in the painters data frame</span></a>
<a class="sourceLine" id="cb33-10" data-line-number="10">painters[, <span class="kw">c</span>(<span class="dv">1</span>,<span class="dv">2</span>)] <span class="co">#subset the df, selecting just the 1st & 2nd columns (and all the rows)</span></a>
<a class="sourceLine" id="cb33-11" data-line-number="11"></a>
<a class="sourceLine" id="cb33-12" data-line-number="12"><span class="co">#AIM: drop the Expression variable based on it's column position in the painters data frame and return just the first 5 rows</span></a>
<a class="sourceLine" id="cb33-13" data-line-number="13">painters[<span class="kw">c</span>(<span class="dv">1</span><span class="op">:</span><span class="dv">5</span>), <span class="dv">-4</span>] <span class="co">#returns the subsetted df having deselected the 4th column, Expression and the first 5 rows</span></a>
<a class="sourceLine" id="cb33-14" data-line-number="14"></a>
<a class="sourceLine" id="cb33-15" data-line-number="15"></a>
<a class="sourceLine" id="cb33-16" data-line-number="16"><span class="co">#syntax for select & deselect based on row variable values</span></a>
<a class="sourceLine" id="cb33-17" data-line-number="17">df[<span class="kw">which</span>(),] <span class="co">#df=data frame, specify the variable value within the `which()` to subset the df on. Again, the tailing comma specifies returning all the columns. If certain columns are wanted this can be specified after the comma.</span></a>
<a class="sourceLine" id="cb33-18" data-line-number="18"></a>
<a class="sourceLine" id="cb33-19" data-line-number="19"><span class="co">#AIM: select all rows where the painters' School is the 'A' category</span></a>
<a class="sourceLine" id="cb33-20" data-line-number="20">painters[<span class="kw">which</span>(painters<span class="op">$</span>School <span class="op">==</span><span class="st"> "A"</span>),] <span class="co">#returns the subsetted df where equality holds true, i.e. row value in School variable column is 'A'</span></a>
<a class="sourceLine" id="cb33-21" data-line-number="21"></a>
<a class="sourceLine" id="cb33-22" data-line-number="22"><span class="co">#AIM: deselect all rows where the painters' School is the 'A' category, i.e. return df subset without 'A' values, AND also only select rows where Colour score > 10</span></a>
<a class="sourceLine" id="cb33-23" data-line-number="23">painters[<span class="kw">which</span>(painters<span class="op">$</span>School <span class="op">!=</span><span class="st"> "A"</span> <span class="op">&</span><span class="st"> </span>painters<span class="op">$</span>Colour <span class="op">></span><span class="st"> </span><span class="dv">10</span>),] <span class="co">#returns the subsetted df where equality holds true, i.e. row value in School variable column is 'not A', AND the Colour score filter is also true.</span></a></code></pre></div>
<ol start="15" style="list-style-type: decimal">
<li>Data Frame Frequency Calculations</li>
</ol>
<ul>
<li><code>table()</code></li>
</ul>
<div class="sourceCode" id="cb34"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb34-1" data-line-number="1"><span class="co">#create new data frame</span></a>
<a class="sourceLine" id="cb34-2" data-line-number="2">flavour <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"choc"</span>, <span class="st">"strawberry"</span>, <span class="st">"vanilla"</span>, <span class="st">"choc"</span>, <span class="st">"strawberry"</span>, <span class="st">"strawberry"</span>) </a>
<a class="sourceLine" id="cb34-3" data-line-number="3">gender <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"F"</span>, <span class="st">"F"</span>, <span class="st">"M"</span>, <span class="st">"M"</span>, <span class="st">"F"</span>, <span class="st">"M"</span>)</a>
<a class="sourceLine" id="cb34-4" data-line-number="4">icecream <-<span class="st"> </span><span class="kw">data.frame</span>(flavour, gender) <span class="co">#icecream df made up of 2 factor variables, flavour & gender, with 3 & 2 levels respectively (choc/strawberry/vanilla & F/M)</span></a>
<a class="sourceLine" id="cb34-5" data-line-number="5"></a>
<a class="sourceLine" id="cb34-6" data-line-number="6"><span class="co">#AIM: create a frequency distribution table which shows the count of each gender in the df</span></a>
<a class="sourceLine" id="cb34-7" data-line-number="7"><span class="kw">table</span>(icecream<span class="op">$</span>gender) </a>
<a class="sourceLine" id="cb34-8" data-line-number="8"></a>
<a class="sourceLine" id="cb34-9" data-line-number="9"><span class="co">#AIM: create a frequency distribution table which shows the count of each flavour in the df</span></a>
<a class="sourceLine" id="cb34-10" data-line-number="10"><span class="kw">table</span>(icecream<span class="op">$</span>flavour)</a>
<a class="sourceLine" id="cb34-11" data-line-number="11"></a>
<a class="sourceLine" id="cb34-12" data-line-number="12"><span class="co">#AIM: create Contingency/2-Way Table showing the counts for each combination of flavour & gender level </span></a>
<a class="sourceLine" id="cb34-13" data-line-number="13"><span class="kw">table</span>(icecream<span class="op">$</span>flavour, icecream<span class="op">$</span>gender)</a></code></pre></div>
<ol start="16" style="list-style-type: decimal">
<li>Descriptive/Summary Stats Functions</li>
</ol>
<ul>
<li><code>mean()</code>, <code>median()</code>, <code>sd()</code>, <code>var()</code>, <code>sum()</code>, <code>min()</code>, <code>max()</code>, <code>range()</code></li>
</ul>
<div class="sourceCode" id="cb35"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb35-1" data-line-number="1"><span class="co">#re-using the jumble vector from before</span></a>
<a class="sourceLine" id="cb35-2" data-line-number="2">jumble <-<span class="st"> </span><span class="kw">c</span>(<span class="dv">4</span>, <span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">3</span>) </a>
<a class="sourceLine" id="cb35-3" data-line-number="3"></a>
<a class="sourceLine" id="cb35-4" data-line-number="4"><span class="kw">mean</span>(jumble)</a>
<a class="sourceLine" id="cb35-5" data-line-number="5"><span class="kw">median</span>(jumble)</a>
<a class="sourceLine" id="cb35-6" data-line-number="6"><span class="kw">sd</span>(jumble)</a>
<a class="sourceLine" id="cb35-7" data-line-number="7"><span class="kw">var</span>(jumble)</a>
<a class="sourceLine" id="cb35-8" data-line-number="8"><span class="kw">sum</span>(jumble)</a>
<a class="sourceLine" id="cb35-9" data-line-number="9"><span class="kw">min</span>(jumble)</a>
<a class="sourceLine" id="cb35-10" data-line-number="10"><span class="kw">max</span>(jumble)</a>
<a class="sourceLine" id="cb35-11" data-line-number="11"><span class="kw">range</span>(jumble)</a></code></pre></div>
<ol start="17" style="list-style-type: decimal">
<li>Apply Functions</li>
</ol>
<ul>
<li><code>apply()</code>
<code>apply()</code> returns a vector, array or list of values where a specified function has been applied to the ‘margins’ (rows/cols combo) of the original vector/array/list.</li>
</ul>
<div class="sourceCode" id="cb36"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb36-1" data-line-number="1"><span class="co">#re-using the moof matrix from before</span></a>
<a class="sourceLine" id="cb36-2" data-line-number="2">moof <-<span class="st"> </span><span class="kw">matrix</span>(<span class="dt">data =</span> <span class="dv">1</span><span class="op">:</span><span class="dv">12</span>, <span class="dt">nrow=</span><span class="dv">3</span>, <span class="dt">ncol=</span><span class="dv">4</span>) </a>
<a class="sourceLine" id="cb36-3" data-line-number="3"></a>
<a class="sourceLine" id="cb36-4" data-line-number="4"><span class="co">#apply syntax</span></a>
<a class="sourceLine" id="cb36-5" data-line-number="5"><span class="kw">apply</span>(X, MARGIN, FUN,...) <span class="co">#function arguments: X=an array, MARGIN=1 to apply to rows/2 to apply to cols, FUN=function to apply</span></a>
<a class="sourceLine" id="cb36-6" data-line-number="6"></a>
<a class="sourceLine" id="cb36-7" data-line-number="7"><span class="co">#AIM: using the moof matrix, apply the sum function to the rows</span></a>
<a class="sourceLine" id="cb36-8" data-line-number="8"><span class="kw">apply</span>(moof, <span class="dv">1</span>, sum) </a>
<a class="sourceLine" id="cb36-9" data-line-number="9"></a>
<a class="sourceLine" id="cb36-10" data-line-number="10"><span class="co">#AIM: using the moof matrix, apply the sum function to the columns</span></a>
<a class="sourceLine" id="cb36-11" data-line-number="11"><span class="kw">apply</span>(moof, <span class="dv">2</span>, sum) </a></code></pre></div>
<ol start="18" style="list-style-type: decimal">
<li>Apply Functions</li>
</ol>
<ul>
<li><code>lapply()</code> using <code>list()</code>
A list, a common data structure, is a generic vector containing objects of any types.
<code>lapply()</code> returns a list where each element returned is the result of applying a specified function to the objects in the list.</li>
</ul>
<div class="sourceCode" id="cb37"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb37-1" data-line-number="1"><span class="co">#create list of various vectors and matrices</span></a>
<a class="sourceLine" id="cb37-2" data-line-number="2">bundle <-<span class="st"> </span><span class="kw">list</span>(moof, jumble, foo) </a>
<a class="sourceLine" id="cb37-3" data-line-number="3"></a>
<a class="sourceLine" id="cb37-4" data-line-number="4"><span class="co">#lapply syntax</span></a>
<a class="sourceLine" id="cb37-5" data-line-number="5"><span class="kw">lapply</span>(X, FUN,...) <span class="co">#function arguments: X=a list, FUN=function to apply</span></a>
<a class="sourceLine" id="cb37-6" data-line-number="6"></a>
<a class="sourceLine" id="cb37-7" data-line-number="7"><span class="co">#AIM: using the bundle list, apply the mean function to each object in the list</span></a>
<a class="sourceLine" id="cb37-8" data-line-number="8"><span class="kw">lapply</span>(bundle, mean)</a></code></pre></div>
<ol start="19" style="list-style-type: decimal">
<li>Apply Functions</li>
</ol>
<ul>
<li><code>tapply()</code>
<code>tapply()</code> applies a specified function to specified groups/subsets of a factor variable.</li>
</ul>
<div class="sourceCode" id="cb38"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb38-1" data-line-number="1"><span class="co">#tapply syntax</span></a>
<a class="sourceLine" id="cb38-2" data-line-number="2"><span class="kw">tapply</span>(X, INDEX, FUN,...) <span class="co">#function arguments: X=an atomic object, INDEX=list of 1+ factors of X length, FUN=function to apply</span></a>
<a class="sourceLine" id="cb38-3" data-line-number="3"></a>
<a class="sourceLine" id="cb38-4" data-line-number="4"><span class="co">#AIM: calculate the mean Drawing Score of the painters, but grouped by School category</span></a>
<a class="sourceLine" id="cb38-5" data-line-number="5"><span class="kw">tapply</span>(painters<span class="op">$</span>Drawing, painters<span class="op">$</span>School, mean) <span class="co">#grouping the data by the 8 different Schools, apply the mean function to the Drawing Score variable to return the 8 mean scores</span></a></code></pre></div>
<ol start="20" style="list-style-type: decimal">
<li>Programming Tools</li>
</ol>
<ul>
<li><code>if</code> statement, <code>if...else</code> statement
An <code>if</code> statement is used when certain computations are conditional and only execute when a specific condition is met - if the condition is not met, nothing executes. The <code>if...else</code> statement extends the <code>if</code> statement by adding on a computation to execute when the condition is not met, i.e. the ‘else’ part of the statement.</li>
</ul>
<div class="sourceCode" id="cb39"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb39-1" data-line-number="1"><span class="co">#if-statement syntax</span></a>
<a class="sourceLine" id="cb39-2" data-line-number="2"><span class="cf">if</span> (<span class="st">'test expression'</span>)</a>
<a class="sourceLine" id="cb39-3" data-line-number="3"> {</a>
<a class="sourceLine" id="cb39-4" data-line-number="4"> <span class="st">'statement'</span></a>
<a class="sourceLine" id="cb39-5" data-line-number="5"> }</a>
<a class="sourceLine" id="cb39-6" data-line-number="6"></a>