-
Notifications
You must be signed in to change notification settings - Fork 2
/
07-R-intro.Rmd
77 lines (76 loc) · 7.92 KB
/
07-R-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Ordered and unordered factors
<p><a href="" id="index-Factors"></a> <a href="" id="index-Ordered-factors"></a></p>
<p>A <em>factor</em> is a vector object used to specify a discrete classification (grouping) of the components of other vectors of the same length. R provides both <em>ordered</em> and <em>unordered</em> factors. While the “real” application of factors is with model formulae (see <a href="statistical-models-in-r.html#Contrasts">Contrasts</a>), we here look at a specific example.</p>
<p><a href="" id="A-specific-example"></a></p>
<h3 id="a-specific-example" class="section">4.1 A specific example</h3>
<p>Suppose, for example, we have a sample of 30 tax accountants from all the states and territories of Australia<a href="appendix-f-references.html#FOOT14" id="DOCF14"><sup>14</sup></a> and their individual state of origin is specified by a character vector of state mnemonics as</p>
<div class="example">
<pre class="example1"><code>> state <- c("tas", "sa", "qld", "nsw", "nsw", "nt", "wa", "wa",
"qld", "vic", "nsw", "vic", "qld", "qld", "sa", "tas",
"sa", "nt", "wa", "vic", "qld", "nsw", "nsw", "wa",
"sa", "act", "nsw", "vic", "vic", "act")</code></pre>
</div>
<p>Notice that in the case of a character vector, “sorted” means sorted in alphabetical order.</p>
<p>A <em>factor</em> is similarly created using the <code class="calibre2">factor()</code> function: <a href="" id="index-factor"></a></p>
<div class="example">
<pre class="example1"><code>> statef <- factor(state)</code></pre>
</div>
<p>The <code class="calibre2">print()</code> function handles factors slightly differently from other objects:</p>
<div class="example">
<pre class="example1"><code>> statef
[1] tas sa qld nsw nsw nt wa wa qld vic nsw vic qld qld sa
[16] tas sa nt wa vic qld nsw nsw wa sa act nsw vic vic act
Levels: act nsw nt qld sa tas vic wa</code></pre>
</div>
<p>To find out the levels of a factor the function <code class="calibre2">levels()</code> can be used. <a href="" id="index-levels"></a></p>
<div class="example">
<pre class="example1"><code>> levels(statef)
[1] "act" "nsw" "nt" "qld" "sa" "tas" "vic" "wa"</code></pre>
</div>
<hr />
<p><a href="" id="The-function-tapply_0028_0029-and-ragged-arrays"></a> <a href="" id="The-function-tapply_0028_0029-and-ragged-arrays-1"></a></p>
<h3 id="the-function-tapply-and-ragged-arrays" class="section">4.2 The function <code class="calibre14">tapply()</code> and ragged arrays</h3>
<p><a href="" id="index-tapply"></a></p>
<p>To continue the previous example, suppose we have the incomes of the same tax accountants in another vector (in suitably large units of money)</p>
<div class="example">
<pre class="example1"><code>> incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56,
61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
59, 46, 58, 43)</code></pre>
</div>
<p>To calculate the sample mean income for each state we can now use the special function <code class="calibre2">tapply()</code>:</p>
<div class="example">
<pre class="example1"><code>> incmeans <- tapply(incomes, statef, mean)</code></pre>
</div>
<p>giving a means vector with the components labelled by the levels</p>
<div class="example">
<pre class="example1"><code> act nsw nt qld sa tas vic wa
44.500 57.333 55.500 53.600 55.000 60.500 56.000 52.250</code></pre>
</div>
<p>The function <code class="calibre2">tapply()</code> is used to apply a function, here <code class="calibre2">mean()</code>, to each group of components of the first argument, here <code class="calibre2">incomes</code>, defined by the levels of the second component, here <code class="calibre2">statef</code><a href="appendix-f-references.html#FOOT15" id="DOCF15"><sup>15</sup></a>, as if they were separate vector structures. The result is a structure of the same length as the levels attribute of the factor containing the results. The reader should consult the help document for more details.</p>
<p>Suppose further we needed to calculate the standard errors of the state income means. To do this we need to write an R function to calculate the standard error for any given vector. Since there is an builtin function <code class="calibre2">var()</code> to calculate the sample variance, such a function is a very simple one liner, specified by the assignment:</p>
<div class="example">
<pre class="example1"><code>> stdError <- function(x) sqrt(var(x)/length(x))</code></pre>
</div>
<p>(Writing functions will be considered later in <a href="grouping-loops-and-conditional-execution.html#Writing-your-own-functions">Writing your own functions</a>. Note that R’s a builtin function <code class="calibre2">sd()</code> is something different.) <a href="" id="index-sd"></a> <a href="" id="index-var-1"></a> After this assignment, the standard errors are calculated by</p>
<div class="example">
<pre class="example1"><code>> incster <- tapply(incomes, statef, stdError)</code></pre>
</div>
<p>and the values calculated are then</p>
<div class="example">
<pre class="example1"><code>> incster
act nsw nt qld sa tas vic wa
1.5 4.3102 4.5 4.1061 2.7386 0.5 5.244 2.6575</code></pre>
</div>
<p>As an exercise you may care to find the usual 95% confidence limits for the state mean incomes. To do this you could use <code class="calibre2">tapply()</code> once more with the <code class="calibre2">length()</code> function to find the sample sizes, and the <code class="calibre2">qt()</code> function to find the percentage points of the appropriate <em>t</em>-distributions. (You could also investigate R’s facilities for <em>t</em>-tests.)</p>
<p>The function <code class="calibre2">tapply()</code> can also be used to handle more complicated indexing of a vector by multiple categories. For example, we might wish to split the tax accountants by both state and sex. However in this simple instance (just one factor) what happens can be thought of as follows. The values in the vector are collected into groups corresponding to the distinct entries in the factor. The function is then applied to each of these groups individually. The value is a vector of function results, labelled by the <code class="calibre2">levels</code> attribute of the factor.</p>
<p>The combination of a vector and a labelling factor is an example of what is sometimes called a <em>ragged array</em>, since the subclass sizes are possibly irregular. When the subclass sizes are all the same the indexing may be done implicitly and much more efficiently, as we see in the next section.</p>
<hr />
<p><a href="" id="Ordered-factors"></a> <a href="" id="Ordered-factors-1"></a></p>
<h3 id="ordered-factors" class="section">4.3 Ordered factors</h3>
<p><a href="" id="index-ordered"></a></p>
<p>The levels of factors are stored in alphabetical order, or in the order they were specified to <code class="calibre2">factor</code> if they were specified explicitly.</p>
<p>Sometimes the levels will have a natural ordering that we want to record and want our statistical analysis to make use of. The <code class="calibre2">ordered()</code> <a href="" id="index-ordered-1"></a> function creates such ordered factors but is otherwise identical to <code class="calibre2">factor</code>. For most purposes the only difference between ordered and unordered factors is that the former are printed showing the ordering of the levels, but the contrasts generated for them in fitting linear models are different.</p>
<hr />
<p><a href="" id="Arrays-and-matrices"></a> <a href="" id="Arrays-and-matrices-1"></a></p>
<div id="calibre_pb_12" class="calibre8">
</div>