11-R-intro.Rmd

#  Probability distributions
<p><a href="" id="index-Probability-distributions"></a></p>
<hr />
<p><a href="" id="R-as-a-set-of-statistical-tables"></a> <a href="" id="R-as-a-set-of-statistical-tables-1"></a></p>
<h3 id="r-as-a-set-of-statistical-tables" class="section">8.1 R as a set of statistical tables</h3>
<p>One convenient use of R is to provide a comprehensive set of statistical tables. Functions are provided to evaluate the cumulative distribution function P(X &lt;= x), the probability density function and the quantile function (given <em>q</em>, the smallest <em>x</em> such that P(X &lt;= x) &gt; q), and to simulate from the distribution.</p>
<blockquote>
<table>
<thead>
<tr class="header">
<th align="left">Distribution</th>
<th align="left">R name</th>
<th align="left">additional arguments</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">beta</td>
<td align="left"><code class="calibre2">beta</code></td>
<td align="left"><code class="calibre2">shape1, shape2, ncp</code></td>
</tr>
<tr class="even">
<td align="left">binomial</td>
<td align="left"><code class="calibre2">binom</code></td>
<td align="left"><code class="calibre2">size, prob</code></td>
</tr>
<tr class="odd">
<td align="left">Cauchy</td>
<td align="left"><code class="calibre2">cauchy</code></td>
<td align="left"><code class="calibre2">location, scale</code></td>
</tr>
<tr class="even">
<td align="left">chi-squared</td>
<td align="left"><code class="calibre2">chisq</code></td>
<td align="left"><code class="calibre2">df, ncp</code></td>
</tr>
<tr class="odd">
<td align="left">exponential</td>
<td align="left"><code class="calibre2">exp</code></td>
<td align="left"><code class="calibre2">rate</code></td>
</tr>
<tr class="even">
<td align="left">F</td>
<td align="left"><code class="calibre2">f</code></td>
<td align="left"><code class="calibre2">df1, df2, ncp</code></td>
</tr>
<tr class="odd">
<td align="left">gamma</td>
<td align="left"><code class="calibre2">gamma</code></td>
<td align="left"><code class="calibre2">shape, scale</code></td>
</tr>
<tr class="even">
<td align="left">geometric</td>
<td align="left"><code class="calibre2">geom</code></td>
<td align="left"><code class="calibre2">prob</code></td>
</tr>
<tr class="odd">
<td align="left">hypergeometric</td>
<td align="left"><code class="calibre2">hyper</code></td>
<td align="left"><code class="calibre2">m, n, k</code></td>
</tr>
<tr class="even">
<td align="left">log-normal</td>
<td align="left"><code class="calibre2">lnorm</code></td>
<td align="left"><code class="calibre2">meanlog, sdlog</code></td>
</tr>
<tr class="odd">
<td align="left">logistic</td>
<td align="left"><code class="calibre2">logis</code></td>
<td align="left"><code class="calibre2">location, scale</code></td>
</tr>
<tr class="even">
<td align="left">negative binomial</td>
<td align="left"><code class="calibre2">nbinom</code></td>
<td align="left"><code class="calibre2">size, prob</code></td>
</tr>
<tr class="odd">
<td align="left">normal</td>
<td align="left"><code class="calibre2">norm</code></td>
<td align="left"><code class="calibre2">mean, sd</code></td>
</tr>
<tr class="even">
<td align="left">Poisson</td>
<td align="left"><code class="calibre2">pois</code></td>
<td align="left"><code class="calibre2">lambda</code></td>
</tr>
<tr class="odd">
<td align="left">signed rank</td>
<td align="left"><code class="calibre2">signrank</code></td>
<td align="left"><code class="calibre2">n</code></td>
</tr>
<tr class="even">
<td align="left">Student’s t</td>
<td align="left"><code class="calibre2">t</code></td>
<td align="left"><code class="calibre2">df, ncp</code></td>
</tr>
<tr class="odd">
<td align="left">uniform</td>
<td align="left"><code class="calibre2">unif</code></td>
<td align="left"><code class="calibre2">min, max</code></td>
</tr>
<tr class="even">
<td align="left">Weibull</td>
<td align="left"><code class="calibre2">weibull</code></td>
<td align="left"><code class="calibre2">shape, scale</code></td>
</tr>
<tr class="odd">
<td align="left">Wilcoxon</td>
<td align="left"><code class="calibre2">wilcox</code></td>
<td align="left"><code class="calibre2">m, n</code></td>
</tr>
</tbody>
</table>
</blockquote>
<p>Prefix the name given here by ‘d’ for the density, ‘p’ for the CDF, ‘q’ for the quantile function and ‘r’ for simulation (<em>r</em>andom deviates). The first argument is <code class="calibre2">x</code> for <code class="calibre2">dxxx</code>, <code class="calibre2">q</code> for <code class="calibre2">pxxx</code>, <code class="calibre2">p</code> for <code class="calibre2">qxxx</code> and <code class="calibre2">n</code> for <code class="calibre2">rxxx</code> (except for <code class="calibre2">rhyper</code>, <code class="calibre2">rsignrank</code> and <code class="calibre2">rwilcox</code>, for which it is <code class="calibre2">nn</code>). In not quite all cases is the non-centrality parameter <code class="calibre2">ncp</code> currently available: see the on-line help for details.</p>
<p>The <code class="calibre2">pxxx</code> and <code class="calibre2">qxxx</code> functions all have logical arguments <code class="calibre2">lower.tail</code> and <code class="calibre2">log.p</code> and the <code class="calibre2">dxxx</code> ones have <code class="calibre2">log</code>. This allows, e.g., getting the cumulative (or “integrated”) <em>hazard</em> function, H(t) = - log(1 - F(t)), by</p>
<div class="example">
<pre class="example1"><code> - pxxx(t, ..., lower.tail = FALSE, log.p = TRUE)</code></pre>
</div>
<p>or more accurate log-likelihoods (by <code class="calibre2">dxxx(..., log = TRUE)</code>), directly.</p>
<p>In addition there are functions <code class="calibre2">ptukey</code> and <code class="calibre2">qtukey</code> for the distribution of the studentized range of samples from a normal distribution, and <code class="calibre2">dmultinom</code> and <code class="calibre2">rmultinom</code> for the multinomial distribution. Further distributions are available in contributed packages, notably <a href="https://CRAN.R-project.org/package=SuppDists"><strong>SuppDists</strong></a>.</p>
<p>Here are some examples</p>
<div class="example">
<pre class="example1"><code>&gt; ## -tailed p-value for t distribution
&gt; 2*pt(-2.43, df = 13)
&gt; ## upper 1% point for an F(2, 7) distribution
&gt; qf(0.01, 2, 7, lower.tail = FALSE)</code></pre>
</div>
<p>See the on-line help on <code class="calibre2">RNG</code> for how random-number generation is done in R.</p>
<hr />
<p><a href="" id="Examining-the-distribution-of-a-set-of-data"></a> <a href="" id="Examining-the-distribution-of-a-set-of-data-1"></a></p>
<h3 id="examining-the-distribution-of-a-set-of-data" class="section">8.2 Examining the distribution of a set of data</h3>
<p>Given a (univariate) set of data we can examine its distribution in a large number of ways. The simplest is to examine the numbers. Two slightly different summaries are given by <code class="calibre2">summary</code> and <code class="calibre2">fivenum</code> <a href="" id="index-summary"></a> <a href="" id="index-fivenum"></a> and a display of the numbers by <code class="calibre2">stem</code> (a “stem and leaf” plot). <a href="" id="index-stem"></a></p>
<div class="example">
<pre class="example1"><code>&gt; attach(faithful)
&gt; summary(eruptions)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.600   2.163   4.000   3.488   4.454   5.100
&gt; fivenum(eruptions)
[1] 1.6000 2.1585 4.0000 4.4585 5.1000
&gt; stem(eruptions)

  The decimal point is 1 digit(s) to the left of the |

  16 | 070355555588
  18 | 000022233333335577777777888822335777888
  20 | 00002223378800035778
  22 | 0002335578023578
  24 | 00228
  26 | 23
  28 | 080
  30 | 7
  32 | 2337
  34 | 250077
  36 | 0000823577
  38 | 2333335582225577
  40 | 0000003357788888002233555577778
  42 | 03335555778800233333555577778
  44 | 02222335557780000000023333357778888
  46 | 0000233357700000023578
  48 | 00000022335800333
  50 | 0370</code></pre>
</div>
<p>A stem-and-leaf plot is like a histogram, and R has a function <code class="calibre2">hist</code> to plot histograms. <a href="" id="index-hist"></a></p>
<div class="example">
<pre class="example1"><code>&gt; hist(eruptions)
## make the bins smaller, make a plot of density
&gt; hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
&gt; lines(density(eruptions, bw=0.1))
&gt; rug(eruptions) # show the actual data points</code></pre>
</div>
<p><a href="" id="index-density"></a> <a href="" id="index-Density-estimation"></a></p>
<p>More elegant density plots can be made by <code class="calibre2">density</code>, and we added a line produced by <code class="calibre2">density</code> in this example. The bandwidth <code class="calibre2">bw</code> was chosen by trial-and-error as the default gives too much smoothing (it usually does for “interesting” densities). (Better automated methods of bandwidth choice are available, and in this example <code class="calibre2">bw = &quot;SJ&quot;</code> gives a good result.)</p>
<p><img src="hist.png" alt="images/hist" class="calibre20" /></p>
<p>We can plot the empirical cumulative distribution function by using the function <code class="calibre2">ecdf</code>. <a href="" id="index-ecdf"></a> <a href="" id="index-Empirical-CDFs"></a></p>
<div class="example">
<pre class="example1"><code>&gt; plot(ecdf(eruptions), do.points=FALSE, verticals=TRUE)</code></pre>
</div>
<p>This distribution is obviously far from any standard distribution. How about the right-hand mode, say eruptions of longer than 3 minutes? Let us fit a normal distribution and overlay the fitted CDF.</p>
<div class="example">
<pre class="example1"><code>&gt; long &lt;- eruptions[eruptions &gt; 3]
&gt; plot(ecdf(long), do.points=FALSE, verticals=TRUE)
&gt; x &lt;- seq(3, 5.4, 0.01)
&gt; lines(x, pnorm(x, mean=mean(long), sd=sqrt(var(long))), lty=3)</code></pre>
</div>
<p><img src="ecdf.png" alt="images/ecdf" class="calibre20" /></p>
<p>Quantile-quantile (Q-Q) plots can help us examine this more carefully. <a href="" id="index-Quantile_002dquantile-plots"></a> <a href="" id="index-qqnorm"></a> <a href="" id="index-qqline"></a></p>
<div class="example">
<pre class="example1"><code>par(pty=&quot;s&quot;)       # arrange for a square figure region
qqnorm(long); qqline(long)</code></pre>
</div>
<p>which shows a reasonable fit but a shorter right tail than one would expect from a normal distribution. Let us compare this with some simulated data from a <em>t</em> distribution</p>
<p><img src="QQ.png" alt="images/QQ" class="calibre20" /></p>
<div class="example">
<pre class="example1"><code>x &lt;- rt(250, df = 5)
qqnorm(x); qqline(x)</code></pre>
</div>
<p>which will usually (if it is a random sample) show longer tails than expected for a normal. We can make a Q-Q plot against the generating distribution by</p>
<div class="example">
<pre class="example1"><code>qqplot(qt(ppoints(250), df = 5), x, xlab = &quot;Q-Q plot for t dsn&quot;)
qqline(x)</code></pre>
</div>
<p>Finally, we might want a more formal test of agreement with normality (or not). R provides the Shapiro-Wilk test <a href="" id="index-Shapiro_002dWilk-test"></a> <a href="" id="index-shapiro_002etest"></a></p>
<div class="example">
<pre class="example1"><code>&gt; shapiro.test(long)

         Shapiro-Wilk normality test

data:  long
W = 0.9793, p-value = 0.01052</code></pre>
</div>
<p>and the Kolmogorov-Smirnov test <a href="" id="index-Kolmogorov_002dSmirnov-test"></a> <a href="" id="index-ks_002etest"></a></p>
<div class="example">
<pre class="example1"><code>&gt; ks.test(long, &quot;pnorm&quot;, mean = mean(long), sd = sqrt(var(long)))

         One-sample Kolmogorov-Smirnov test

data:  long
D = 0.0661, p-value = 0.4284
alternative hypothesis: two.sided</code></pre>
</div>
<p>(Note that the distribution theory is not valid here as we have estimated the parameters of the normal distribution from the same sample.)</p>
<hr />
<p><a href="" id="One_002d-and-two_002dsample-tests"></a> <a href="" id="One_002d-and-two_002dsample-tests-1"></a></p>
<h3 id="one--and-two-sample-tests" class="section">8.3 One- and two-sample tests</h3>
<p><a href="" id="index-One_002d-and-two_002dsample-tests"></a></p>
<p>So far we have compared a single sample to a normal distribution. A much more common operation is to compare aspects of two samples. Note that in R, all “classical” tests including the ones used below are in package <strong>stats</strong> which is normally loaded.</p>
<p>Consider the following sets of data on the latent heat of the fusion of ice (<em>cal/gm</em>) from Rice (1995, p.490)</p>
<div class="example">
<pre class="example1"><code>Method A: 79.98 80.04 80.02 80.04 80.03 80.03 80.04 79.97
          80.05 80.03 80.02 80.00 80.02
Method B: 80.02 79.94 79.98 79.97 79.97 80.03 79.95 79.97</code></pre>
</div>
<p>Boxplots provide a simple graphical comparison of the two samples.</p>
<div class="example">
<pre class="example1"><code>A &lt;- scan()
79.98 80.04 80.02 80.04 80.03 80.03 80.04 79.97
80.05 80.03 80.02 80.00 80.02

B &lt;- scan()
80.02 79.94 79.98 79.97 79.97 80.03 79.95 79.97

boxplot(A, B)</code></pre>
</div>
<p><a href="" id="index-boxplot"></a> <a href="" id="index-Box-plots"></a></p>
<p>which indicates that the first group tends to give higher results than the second.</p>
<p><img src="ice.png" alt="images/ice" class="calibre20" /></p>
<p>To test for the equality of the means of the two examples, we can use an <em>unpaired</em> <em>t</em>-test by <a href="" id="index-Student_0027s-t-test"></a> <a href="" id="index-t_002etest"></a></p>
<div class="example">
<pre class="example1"><code>&gt; t.test(A, B)

         Welch Two Sample t-test

data:  A and B
t = 3.2499, df = 12.027, p-value = 0.00694
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.01385526 0.07018320
sample estimates:
mean of x mean of y
 80.02077  79.97875</code></pre>
</div>
<p>which does indicate a significant difference, assuming normality. By default the R function does not assume equality of variances in the two samples (in contrast to the similar S-PLUS <code class="calibre2">t.test</code> function). We can use the F test to test for equality in the variances, provided that the two samples are from normal populations.</p>
<div class="example">
<pre class="example1"><code>&gt; var.test(A, B)

         F test to compare two variances

data:  A and B
F = 0.5837, num df = 12, denom df =  7, p-value = 0.3938
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.1251097 2.1052687
sample estimates:
ratio of variances
         0.5837405</code></pre>
</div>
<p><a href="" id="index-var_002etest"></a></p>
<p>which shows no evidence of a significant difference, and so we can use the classical <em>t</em>-test that assumes equality of the variances.</p>
<div class="example">
<pre class="example1"><code>&gt; t.test(A, B, var.equal=TRUE)

         Two Sample t-test

data:  A and B
t = 3.4722, df = 19, p-value = 0.002551
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.01669058 0.06734788
sample estimates:
mean of x mean of y
 80.02077  79.97875</code></pre>
</div>
<p>All these tests assume normality of the two samples. The two-sample Wilcoxon (or Mann-Whitney) test only assumes a common continuous distribution under the null hypothesis.</p>
<p><a href="" id="index-Wilcoxon-test"></a> <a href="" id="index-wilcox_002etest"></a></p>
<div class="example">
<pre class="example1"><code>&gt; wilcox.test(A, B)

         Wilcoxon rank sum test with continuity correction

data:  A and B
W = 89, p-value = 0.007497
alternative hypothesis: true location shift is not equal to 0

Warning message:
Cannot compute exact p-value with ties in: wilcox.test(A, B)</code></pre>
</div>
<p>Note the warning: there are several ties in each sample, which suggests strongly that these data are from a discrete distribution (probably due to rounding).</p>
<p>There are several ways to compare graphically the two samples. We have already seen a pair of boxplots. The following</p>
<div class="example">
<pre class="example1"><code>&gt; plot(ecdf(A), do.points=FALSE, verticals=TRUE, xlim=range(A, B))
&gt; plot(ecdf(B), do.points=FALSE, verticals=TRUE, add=TRUE)</code></pre>
</div>
<p>will show the two empirical CDFs, and <code class="calibre2">qqplot</code> will perform a Q-Q plot of the two samples. The Kolmogorov-Smirnov test is of the maximal vertical distance between the two ecdf’s, assuming a common continuous distribution:</p>
<div class="example">
<pre class="example1"><code>&gt; ks.test(A, B)

         Two-sample Kolmogorov-Smirnov test

data:  A and B
D = 0.5962, p-value = 0.05919
alternative hypothesis: two-sided

Warning message:
cannot compute correct p-values with ties in: ks.test(A, B)</code></pre>
</div>
<hr />
<p><a href="" id="Loops-and-conditional-execution"></a> <a href="" id="Grouping_002c-loops-and-conditional-execution"></a></p>
<div id="calibre_pb_20" class="calibre8">

</div>