# Sampling Techniques

## Objectives
- Identify the standard methods of obtaining data and the advantages and disadvantages of each.
- Construct a random sample from a population using simple random sampling, systematic sampling, stratified sampling, and cluster sampling.

## Sampling and Bias
Gathering information about an entire population often costs too much or is virtually impossible. Instead, we gather information about a sample of the population. **A sample should have the same characteristics as the population it is representing.** If the sample does not have the same characteristics as the population, the statistic may not be a good estimator for the parameter. When this happens, we say the sample statistic is **biased**.

For example, suppose a pollster wants to find out if the United States population believes a college education is important for a successful career. If the pollster only samples college students, then we might expect the proportion of the sample that believes college is important to be higher than the proportion of the population that believes college is important. Because the sample (which includes only college students) does not have the same characteristics as the population (which includes college students, college graduates, those who have never been to college, etc.), the sample statistic probably isn't a good estimate of the population parameter.

To try to make sure that the sample has the same characteristics as the population, most statisticians use various methods of **random sampling**. There are several different methods of random sampling. In each form of random sampling, each member of a population initially has an equal chance of being selected for the sample. Each method has pros and cons. We describe some of the most common methods below.

## Simple Random Sampling
The easiest method to describe is called a **simple random sample**. In simple random sampling, each member of the population has an equal chance of being chosen for the sample and is chosen independent of any other member in the population.

```{figure} Simple_random_sampling.png
---
width: 60%
alt: A simple random sample of four individuals is chosen from a population of twelve individuals.
name: simple_random_sampling
---
A simple random sample of four individuals is chosen from a population of twelve individuals.[^simple-random-attribution]
```

[^simple-random-attribution]: {numref}`Figure {number} <simple_random_sampling>` was [created by Dan Kernler](https://commons.wikimedia.org/wiki/File:Simple_random_sampling.PNG) and is licensed under the [Creative Commons Attribution-Share Alike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/deed.en) license.

For example, suppose Professor Baldwin wants to form a six-person study group from the students in his precalculus class, which has 30 students. To choose a simple random sample of size six from the students of his class, Professor Baldwin could put all 30 names in a hat, shake the hat, close his eyes, and pick out six names. A more technological way is for Professor Baldwin to first list the last names of the members of his class together with a number, as in {numref}`name-table`.

```{list-table} Lisa's Class Roster.
:header-rows: 1
:name: name-table

* - Number
  - Name
  - Number
  - Name
  - Number
  - Name
* - 1
  - Anselmo
  - 11
  - Khan
  - 21
  - Roquero
* - 2
  - Bautista
  - 12
  - Legeny
  - 22
  - Roth
* - 3
  - Bayani
  - 13
  - Lundquist
  - 23
  - Rowell
* - 4
  - Cheng
  - 14
  - Macierz
  - 24
  - Salangsang
* - 5
  - Cuarismo
  - 15
  - Motogawa
  - 25
  - Slade
* - 6
  - Cuningham
  - 16
  - Okimoto
  - 26
  - Stratcher
* - 7
  - Fontecha
  - 17
  - Patel
  - 27
  - Tallai
* - 8
  - Hong
  - 18
  - Price
  - 28
  - Tran
* - 9
  - Hoobler
  - 19
  - Quizon
  - 29
  - Wai
* - 10
  - Jiao
  - 20
  - Reyes
  - 30
  - Wood
```

Professor Baldwin could now use a computer to generate six random numbers and choose the individuals on the table matching the numbers. We can do this in R with the <code>sample</code> function:

```
sample(x, size)
```

Here, <code>x</code> is a list of the members of the population, and <code>size</code> is the size of the sample we desire. The <code>sample</code> function will randomly choose a simple random sample from population <code>x</code> of the given <code>size</code>.

In this example, we would want <code>x</code> to be a list of numbers from 1 to 30 corresponding to the numbers on our table. R makes it easy to create a list of consecutive numbers. Simply type <code>start:end</code>, where <code>start</code> is the first number in the list and <code>end</code> is the last number in the list. For a list from 1 to 30, we would type <code>1:30</code>.

And since we want to choose six students from the class to be in the study group, <code>size</code> will be 6.

In [1]:
# Randomly choose 6 students from the class
sample(1:30, size = 6)

````{margin} Simple Random Sampling Study Group
```{list-table} Students in the study group chosen by simple random sampling.
:header-rows: 1
:name: simple-random-table
* - Number
  - Name
* - 4
  - Cheng
* - 25
  - Slade
* - 19
  - Quizon
* - 7
  - Fontecha
* - 17
  - Patel
* - 18
  - Price
```
````


In this case, the <code>sample</code> function randomly chose students 4, 25, 19, 7, 17, and 18 (Cheng, Slade, Quizon, Fontecha, Patel, and Price) to be in the study group. (See {numref}`simple-random-table`.)

For large samples, simple random sampling tends to do a very good job at representing the characteristics of a population. However, the smaller the sample is, the more likely it is that some characteristics or groups of a population won't be accurately represented in the sample. Also, to make sure that each member of the population has an equal chance of being chosen, simple random sampling requires a list of the full population, which may be difficult or impossible to obtain for large populations. Because of this, simple random sampling can sometimes be tedious, time consuming, and expensive to implement.

In [37]:
# Generates the image of the two bar charts comparing human randomness with expected randomness below.
responses = c(3, 7, 10, 14, 5, 24, 21, 5, 20, 4)
n = sum(responses)


png("human_randomness.png", width = 1000, height = 500)

par(mfrow = c(1, 2), cex.main = 2, cex.axis = 1.5, cex.lab = 2, mar = c(5, 5, 5, 0))

# "Random" numbers chosen by students
barplot(height = responses, names = 1:10, ylim = c(0, 25), ylab = "Frequency", xlab = "Number Chosen", main = paste(n, "\"Random\" Numbers\nChosen by Students"))

# Expected distribution with true randomness
barplot(height = rep(n/10, 10), names = 1:10, ylim = c(0, 25), ylab = "Frequency", xlab = "Number Chosen", main = paste("Expected Distribution of", n, "\nNumbers Chosen Truly Randomly"))

dev.off()


````{warning}
Humans don't have a good intuition for randomness. To demonstrate this point, the author asked 113 of his math students to pick a random number from 1 to 10. {numref}`Figure {number} <human_randomness>` shows a bar chart on the left of how many times each number was chosen by the students, as well as a bar chart on the right of how many times we would expect each number to be chosen by a truly uniformly random process.
```{figure} human_randomness.png
---
width: 100%
alt: Two bar charts. The bar chart on the left shows how many times each number from 1 to 10 was chosen by 113 students. The bar chart on the right shows how many we would expect each number from 1 to 10 to be chosen by a random process.
name: human_randomness
---
The bar chart on the left shows how many times each number from 1 to 10 was chosen by 113 students. The bar chart on the right shows how many times we would expect each number to be chosen if the numbers were actually chosen randomly. The dramatic differences in the bar charts suggest that the students did not truly choose their numbers randomly.
```
With a truly random process, we would expect 1 to be chosen about as often as 6 is chosen. However, only three of the 113 students chose 1, while twenty-four students chose 6.

When you need random values, **do not trust yourself to be random.** Instead, use a random process like rolling a die, picking values out of a hat, or (as we will most often do in this class) using a computer to generate random values.
````

***

### Example 1.3.1

{numref}`revenues-per-capita-1` contains real data of the revenues per capita from each city in Riverside County, California in 2019. The data in the table is displayed how data in a spreadsheet might be displayed. Sample five cities using simple random sampling.

```{csv-table} Real data from the cities in Riverside County, California in 2019. The data is organized how data might be organized in a spreadsheet with each row labeled with a number and each column labeled with a letter. [^revenues-per-capita-attribution]
:file: RiversideCountyRevenuesPerCapita.csv
:header-rows: 2
:stub-columns: 1
:widths: 5 19 19 19 19 19
:name: revenues-per-capita-1
```

[^revenues-per-capita-attribution]: Riverside County data on 2019 city revenues per capita obtained from https://data.ca.gov/dataset/city-revenues-per-capita

#### Solution
Rather than assign each city a number, we can use the row numbers already on the spreadsheet. Note, though, that row $1$ contains the column headers. The actual data is contained in rows $2$ to $29$. To perform simple random sampling, we use the <code>sample</code> function to select five random row numbers between $2$ and $29$.

In [1]:
sample(2:29, size = 5)

Our sample includes the city in row 2 (Banning), row 10 (Desert Hot Springs), row 13 (Indian Wells), row 23 (Palm Springs), and row 17 (La Quinta).

***

## Systematic Sampling
In a **systematic sample**, the population is in some order, and every $k$th member of the population is sampled for some number $k$ called the **sampling interval**. (For example, if $k = 7$, then every $7$th member of the population is included.) To make sure we sample members throughout the whole population, we calculate the sampling interval $k$ using the formula

$$k = \frac{N}{n}, $$

where $N$ is the size of the population, and $n$ is the size of the sample we want. To make the sampling random, one of the first $k$ members of the population is randomly chosen as the first member of the sample and the starting point of the sampling process.

```{figure} Systematic_sampling.png
---
width: 90%
alt: Four individuals are chosen from a population of twelve individuals by sampling every third individual.
name: systematic_sampling
---
Four individuals are chosen from a population of twelve individuals using systematic sampling. The second individual is randomly chosen as a starting point. Since there are $N = 12$ individuals in the population and the size of the sample is $n = 4$, the interval between each individual sampled is $k = 12/4 = 3$. So every 3rd individual is chosen from the starting point.[^systematic-attribution]
```

[^systematic-attribution]: {numref}`Figure {number} <systematic_sampling>` was [created by Dan Kernler](https://commons.wikimedia.org/wiki/File:Systematic_sampling.PNG) and is licensed under the [Creative Commons Attribution-Share Alike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/deed.en) license.

For example, suppose Professor Baldwin wants to choose six students from his precalculus class for a study group using systematic sampling. The students are already in an ordered list in {numref}`name-table`. To begin the systematic sampling, Professor Baldwin first calculates the sampling interval $k$. Since there are $N = 30$ total students in the class, and since Professor Baldwin wants to sample $n = 6$ of the students for the study group, the sampling interval is

$$ k = \frac{30}{6} = 5. $$

So Professor Baldwin will sample every $5$th student.

To determine which student is the first student chosen to be sampled, Professor Baldwin uses R to randomly select one of the first 5 students.

must randomly choose a starting point. To make sure the choice is truly random, we can use the <code>sample</code> function. We will choose $1$ student from the list of $30$ students as a starting point. Systematic sampling is frequently chosen because it is a simple method.

In [1]:
# Randomly choose one student as a starting point
sample(1:5, size = 1)

````{margin} Systematic Sampling Study Group
```{list-table} Students in the study group chosen by sytematic sampling.
:header-rows: 1
:name: systematic-table
* - Number
  - Name
* - $4$
  - Cheng
* - $9$
  - Hoobler
* - $14$
  - Macierz
* - $19$
  - Quizon
* - $24$
  - Salangsang
* - $29$
  - Wai
```
````

So the $4$th student in {numref}`name-table`, Cheng, is the first person in the study group and the starting point of the systematic sample. 

Since the sampling interval is $k = 5$, Professor Baldwin next samples every $5$th student in the population after Cheng. After Cheng, the next student in the study group is student $4 + 5 = 9$, Hoobler. The next student after Hoobler is student number $9 + 5 = 14$, Macierz. The next student is number $14 + 5 = 19$, Quizon, then student number $19 + 5 = 24$, Salangsang. The final student is student number $24 + 5 = 29$, Wai. (See {numref}`systematic-table`.)

Systematic sampling is a usually effective sampling technique that is easy to understand and implement. However, if members of the population with a certain characteristic regularly repeat in the order, systematic sampling may over-sample or under-sample these members. For example, suppose that a microchip manufacturer performs quality testing on every $9$th microchip in the assembly line. If there is a fault on the assembly line that causes every $3$rd microchip manufactured to malfunction, then the quality testing will either sample none of the faulty microchips (if the systematic sampling is started on a working microchip), or the quality testing will sample only faulty microchips (if the systematic sampling is started on a faulty microchip). In either case, the sample would be biased.

***

### Example 1.3.2

The spreadsheet below contains real data of the revenues per capita of each city in Riverside County, California in 2019. Sample five cities using systematic sampling.

```{csv-table} Real data from the cities in Riverside County, California in 2019. The data is organized how data might be organized in a spreadsheet with each row labeled with a number and each column labeled with a letter. [^revenues-per-capita-attribution]
:file: RiversideCountyRevenuesPerCapita.csv
:header-rows: 2
:stub-columns: 1
:widths: 5 19 19 19 19 19
:name: revenues-per-capita-2
```

#### Solution
We must first determine the sampling interval $k$ between the cities in the sample. Since there are $N = 28$ cities in the population, and we want a sample size of $n = 5$, we calculate

$$ k = \frac{28}{5} = 5.6. $$

But the size of the sampling interval must be a whole number, so we round $5.6$ to a sampling interval of $k = 6$.

Next, we will randomly choose one city with which to begin the systematic sampling. We use the <code>sample</code> function to choose one city from the from the first $k = 6$ cities in the population. (Note that the first $6$ cities are on rows $2$ to $7$.)

In [11]:
sample(2:7, size = 1)

We will begin our systematic sampling with the city on row $3$ (Eastvale). Since the sampling interval is $k = 6$, we sample every $6$th city after Eastvale. The next city to be sampled is Corona on row $3 + 6 = 9$. After Corona is the city on row $9 + 6 = 15$, Jurupa Valley. Next is Norco on row $15 + 6 = 21$. The last city to be included in the sample is San Jacinto on row $21 + 6 = 27$.

***

## Stratified Sampling

To choose a **stratified sample**, divide the population into groups called strata, then randomly choose a **proportionate** number of members from each stratum. For example, if a certain stratum consists of 27% of the population, the sample should have 27% of its members taken from that strata. Usually, strata are chosen to guarantee that certain important characteristics of the population are present in the sample. For instance, imagine a pollster wants to make sure their sample accurately represents each ethnicity of the population. The pollster can do this using stratified sampling where the strata are the different ethnicities of the population. If the population is 13.4% African American and 5.9% Asian, then the pollster's stratified sample will be 13.4% African American and 5.9% Asian.

```{figure} Stratified_sampling.png
---
width: 60%
alt: Four individuals are chosen from a population of twelve individuals by dividing the population into strata (groups) and sampling a proportionate number of individuals from each strata.
name: stratified_sampling
---
Four individuals are chosen from a population of twelve individuals by dividing the population into strata (groups) based on color, then sampling a proportionate number of individuals from each strata. In this example, one individual is sampled from the white stratum, two individuals from the black stratum, and one individual from the gray stratum. Twice as many individuals were sampled from the black stratum than from the other strata because the black stratum is twice as large as the other strata.[^stratified-attribution]
```

[^stratified-attribution]: {numref}`Figure {number} <stratified_sampling>` was [created by Dan Kernler](https://commons.wikimedia.org/wiki/File:Stratified_sampling.PNG) and is licensed under the [Creative Commons Attribution-Share Alike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/deed.en) license.

To better illustrate stratified sampling, suppose Professor Baldwin was to choose six students from his precalculus class for a study group, but wants students with an A, students with a B, and students with a C or lower to be proportionately represented in the study group. He divides the 30 students in the class into three different strata based on their grade, as shown in {numref}`stratified-list`. (Note that we have assigned each student a new number in the stratum they are in. This will make it easier to randomly sample students in each stratum later.)

In [13]:
S = 1:30
A = sort(sample(S, size = 10))
A
B = sort(sample(S[-A], size = 15))
B
C = S[-union(A, B)]
C

````{list-table} Students are divided into strata based on their grade in the course.
:header-rows: 1
:name: stratified-list

* - Grade A Students
  - Grade B Students
  - Grade C or Lower Students
* - ```{list-table}
    :header-rows: 1
    * - Number in Stratum
      - Name
    * - 1
      - Bautista
    * - 2
      - Cheng
    * - 3
      - Hong
    * - 4
      - Hoobler
    * - 5
      - Jiao
    * - 6
      - Lundquist
    * - 7
      - Patel
    * - 8
      - Salangsang
    * - 9
      - Stratcher
    * - 10
      - Tran
    ```
  - ```{list-table}
    :header-rows: 1
    * - Number in Stratum
      - Name
    * - 1
      - Anselmo
    * - 2
      - Bayani
    * - 3
      - Cuarismo
    * - 4
      - Fontecha
    * - 5
      - Khan
    * - 6
      - Legeny
    * - 7
      - Okimoto
    * - 8
      - Price
    * - 9
      - Quizon
    * - 10
      - Reyes
    * - 11
      - Roquero
    * - 12
      - Roth
    * - 13
      - Slade
    * - 14
      - Tallai
    * - 15
      - Wai
    ```
  - ```{list-table}
    :header-rows: 1
    * - Number in Stratum
      - Name
    * - 1
      - Cuningham
    * - 2
      - Macierz
    * - 3
      - Motogawa
    * - 4
      - Rowell
    * - 5
      - Wood
    ```
````

There are 10 students with an A out of the 30 total students in the class, so the proportion of students in the class in the "Grade A" stratum is $\frac{10}{30}$. Since the size of the study group is 6, there should be $\frac{10}{30} \times 6 = 2$ students from the "Grade A" stratum in the study group. Similarly, the "Grade B" stratum has 15 students out of the 30 total students in the class, so there should be $\frac{15}{30} \times 6 = 3$ students from the "Grade B" stratum in the study group. And there should be $\frac{5}{30} \times 6 = 1$ student from the "Grade C" or Lower stratum in the study group since 5 of the 30 students in the class have a grade of C or lower.

We can use simple random sampling using the <code>sample</code> function to choose the students for the study group in each stratum. We want to choose 2 students from the 10 students in the "Grade A" stratum.

In [15]:
# Randomly choose 2 students from the Grade A stratum
sample(1:10, size = 2)

So student 1 (Bautista) and student 9 (Stratcher) from the "Grade A" stratum will be in the study group.

Similarly, we choose 3 of the 15 students in the "Grade B" stratum for the study group.

In [16]:
# Randomly choose 3 students from the Grade B stratum
sample(1:15, size = 3)

We include student 2 (Bayani), student 5 (Khan), and student 15 (Wai) from the "Grade B" stratum in the study group.

Finally, we want to select 1 of the 5 students in the "Grade C or Lower" stratum for the study group.

In [18]:
# Randomly choose 1 student from the Grade C or Lower stratum
sample(1:5, size = 1)

````{margin} Stratified Sampling Study Group
```{list-table} Students in the study group chosen by stratified sampling.
:header-rows: 1
:name: stratified-table
* - Stratum
  - Number in Stratum
  - Name
* - Grade A
  - 1
  - Bautista
* - Grade A
  - 9
  - Stratcher
* - Grade B
  - 2
  - Bayani
* - Grade B
  - 5
  - Khan
* - Grade B
  - 15
  - Wai
* - Grade C or Lower
  - 2
  - Macierz
```
````

The last student in the study group is student 2 (Macierz) from the "Grade C or Lower" stratum. (See {numref}`stratified-table`.)

Unlike other sampling methods, stratified sampling guarantees that important characteristics of the population (the characteristics that determine the strata) are properly represented in the sample. However, it can be difficult to implement, since the size of each stratum and which stratum each member of the population belong to must be known beforehand.

***

### Example 1.3.3

An economist wants to survey 400 renters regarding their income and budgets. She wants to make sure her sample accurately reflects the different prices of rental units in the United States, so she will use stratified sampling to collect her sample, where the strata are the different price levels of rental units according to the nationwide data obtained by the Census Bureau shown in {numref}`rent-table`.

```{list-table} Occupied Units Paying Rent in the United States in 2019.[^rent-attribution]
:header-rows: 1
:widths: 500 100
:name: rent-table
* - Monthly Rent
  - Percentage
* - Less that \$500
  - 9.2%
* - \$500 to \$999
  - 34.1%
* - \$1,000 to \$1,499
  - 29.9%
* - \$1,500 to \$1,999
  - 15.2%
* - \$2,000 to \$2,499
  - 6.2%
* - \$2,500 to \$2,999
  - 2.7%
* - \$3,000 or More
  - 2.7%
 ```

 [^rent-attribution]: The data in {numref}`rent-table` is from the United States Census Bureau's [2019 American Community Survey 1-Year Estimates](https://data.census.gov/cedsci/table?q=United%20States&g=0100000US&tid=ACSDP1Y2019.DP04&hidePreview=true). 

How many renters with rents between \$1,500 and \$1,999 should the economist include in her sample?

#### Solution
To find the number of renters that should be included in the sample with rents between \$1,500 and \$1,999, we need to multiply the proportion of renters with rents between \$1,500 and \$1,999 in the population with the size of the sample. From the table, we can see that the proportion of renters in this strata is 15.2% = 0.152. (Note, when performing mathematics with percentages, we must always first convert percentages to decimals.) Since the economist wants a sample of size 400, we simply multiply 0.152 by 400 to find the number of individuals we should sample from the strata.

In [1]:
0.152 * 400

We can't sample exactly 60.8 people, so we round up to 61. Thus, of the 400 renters in the sample, 61 of those renters should pay monthly rents of between \$1,500 and \$1,999.

***

## Cluster Sampling

To choose a **cluster sample**, divide the population into clusters (groups) and then randomly select some of the clusters. *All* the members from the selected clusters are in the sample. 

```{figure} Cluster_sampling.png
---
width: 60%
alt: Four individuals are chosen from a population of twelve individuals by dividing the population into six clusters (groups) and sampling all individuals in two of the clusters.
name: cluster_sampling
---
Four individuals are chosen from a population of twelve individuals by dividing the population into six clusters, then randomly selecting two of the clusters. All individuals in the two chosen clusters are sampled. [^cluster-attribution]
```

[^cluster-attribution]: {numref}`Figure {number} <cluster_sampling>` was [created by Dan Kernler](https://commons.wikimedia.org/wiki/File:Cluster_sampling.PNG) and is licensed under the [Creative Commons Attribution-Share Alike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/deed.en) license.

For example, suppose Professor Baldwin's precalculus students are seated in class at tables in groups of three. (See {numref}`Figure {number} <table_clusters>`.) He wants to use cluster sampling to select a six-student study group, where the clusters are the groups of three students at each table.


In [17]:
# Code to create table_clusters.png, to be displayed below.
library("plotrix")
library(repr)

#options(repr.plot.width = 14, repr.plot.height = 18)

draw_table = function(x, y, r, 
                     ) {
    
    name_text_size = 2.1
    table_text_size = 3
    name_offset = 1.375
    vert_offset = 1.375
    
    # Top Chair
    draw.circle(x, y+1.25*r, r/1.5, lwd = 2)
    text(x, y+1.375*r, names[1], cex = name_text_size)
    
    # Left Chair
    draw.circle(x + 1.25*r*cos(pi/2 + 2*pi/3), y + 1.25*r*sin(pi/2 + 2*pi/3), r/1.5, lwd = 2)
    text(x + 1.375*r*cos(pi/2 + 2*pi/3), y + vert_offset*r*sin(pi/2 + 2*pi/3), names[2], cex = name_text_size, srt = -60)
    
    # Right Chair
    draw.circle(x + 1.25*r*cos(pi/2 + 4*pi/3), y + 1.25*r*sin(pi/2 + 4*pi/3), r/1.5, lwd = 2)
    text(x + 1.375*r*cos(pi/2 + 4*pi/3), y + vert_offset*r*sin(pi/2 + 4*pi/3), names[3], cex = name_text_size, srt = 60)
    
    # Table
    draw.circle(x, y, r, col = "gray70", border = "white", lwd = 5)
    text(x, y, paste("Table", tablenum, sep = "\n"), cex = table_text_size)
}

students = c("Anselmo", "Bautista", "Bayani", "Cheng", "Cuarismo", "Cuningham", "Fontecha", "Hong", "Hoobler", "Jiao", "Khan", "Legeny", "Lundquist", "Macierz", "Motogawa", "Okimoto", "Patel", "Price", "Quizon", "Reyes", "Roquero", "Roth", "Rowell", "Salangsang", "Slade", "Stratcher", "Tallai", "Tran", "Wai", "Wood")
students = sample(students)


png("table_clusters.png", width = 1200, height = 1500)

par(mar = c(0, 0, 0, 0))
plot(c(-14, 14), c(-19, 21), type = "n", asp = 1, xaxs = "i", yaxs = "i", axes = FALSE, ann = FALSE)

draw_table(-10, 10, 3, tablenum = 1, names = students[1:3])
draw_table(-10, 0, 3, tablenum = 2, names = students[4:6])
draw_table(-10, -10, 3, tablenum = 3, names = students[7:9])

draw_table(0, 15, 3, tablenum = 4, names = students[10:12])
draw_table(0, 5, 3, tablenum = 5, names = students[13:15])
draw_table(0, -5, 3, tablenum = 6, names = students[16:18])
draw_table(0, -15, 3, tablenum = 7, names = students[19:21])

draw_table(10, 10, 3, tablenum = 8, names = students[22:24])
draw_table(10, 0, 3, tablenum = 9, names = students[25:27])
draw_table(10, -10, 3, tablenum = 10, names = students[28:30])

dev.off()

```{figure} table_clusters.png
---
width: 100%
alt: Ten tables are shown, with each table numbered. At each table, the names of three individuals seated at the table are given.
name: table_clusters
---
Three students sit at each of the ten tables in Professor Baldwin's precalculus classroom. The students at each table form a cluster.
```

Since each cluster has three students, Professor Baldwin will need to randomly select two of the ten tables to create the six-person study group. As with the other sampling methods described above, we can make a random selection with the `sample` function.

In [1]:
# Randomly select two of the ten tables
sample(1:10, size = 2)

````{margin} Cluster Sampling Study Group
```{list-table} Students in the study group chosen by cluster sampling.
:header-rows: 1
:name: cluster-table
* - Cluster
  - Name
* - Table 2
  - Hong
* - Table 2
  - Quizon
* - Table 2
  - Cuarismo
* - Table 8
  - Okimoto
* - Table 8
  - Tran
* - Table 8
  - Bautista
```
````

The students in cluster 2 and cluster 8 have been randomly selected to be included in the study group. So *all* the students at table 2 (Hong, Quizon, and Cuarismo) and *all* the students at table 8 (Okimoto, Tran, and Bautista) have been selected for the study group. None of the students at any of the other tables are included in the study group. (See {numref}`cluster-table`.)

Cluster sampling can be more convenient to implement than other sampling methods because it often involves sampling members of a population that are close together. Because of this, cluster sampling can be an economical sampling method. However, the clusters must be carefully chosen to reflect the population, or the sample may be biased.

***

### Example 1.3.4
About 130 planes fly out of Ontario International Airport in California each day. Airport management wants to survey customer satisfaction using cluster sampling by sampling the passengers on six different flights on a given day. Use R to randomly choose the flights that should be sampled.

#### Solution
The clusters in this sample are the flights. We will use the <code>sample</code> function to randomly choose six flights from the 130 total.

In [1]:
sample(1:130, size = 6)

We will sample *all* the passengers on the 36th flight to leave for the day, the 67th flight to leave for the day, the 77th flight to leave for the day, the 79th flight to leave for the day, the 26th flight to leave for the day, and the 106th flight to leave for the day.

***

## Convenience Sampling: A Method to Avoid

A type of sampling that is non-random is **convenience sampling**. Convenience sampling involves using results that are convenient and readily available. For example, a computer software store conducts a marketing study by interviewing potential customers who happen to be in the store browsing through the available software. The results of convenience sampling may be very good in some cases and highly biased (favor certain outcomes) in others. Because the outcome if often biased, convenience sampling should generally be avoided.

For example, suppose Professor Baldwin chooses a six-student study group by conveniently choosing the six students sitting nearest to the front of the classroom. Note that Professor Baldwin's choice is *not* random: students who tend to sit at the back of the classroom did not have an equal chance of being chosen for the study group. The study group is also very likely to be biased: it is well established that students who sit at the front of the classroom tend to do better in the class, which means the study group is likely to have a higher proportion of students with high grades than the class as a whole.

## Sample Size and Variability
When collecting a sample, the **sample size** is important. The larger the size of the sample, the more likely it is that the sample statistic will accurately approximate the population parameter. A random sample of 10 members of a population is less likely to give a good approximation of the population parameter than a random sample of 100 members. The examples you have seen in this book so far have been small. Samples of only a few hundred observations, or even smaller, are sufficient for many purposes. 

Additionally, two or more samples from the same population, taken randomly, and having close to the same characteristics of the population will likely be different from each other. This **sample variability** is completely natural. Suppose Doreen and Jung both decide to study the average amount of time students at their college sleep each night. Doreen and Jung each use simple random sampling to sample 500 students. Doreen's sample will be different from Jung's sample. Neither sample is wrong, but purely by chance, Doreen will sample students that Jung doesn't sample, and Jung will sample students that Doreen doesn't sample. So we shouldn't be surprised if the students in Doreen's sample sleep an average of 7.38 hours per night, while the students in Jung's sample sleep an average of 7.41 hours per night. While the results are different, both statistics are likely a good approximation of the population average.

If Doreen and Jung took larger samples (i.e. the number of data values is increased), their sample statistics (the average amount of time a student sleeps) might be closer to the actual population average. But still, their samples would be, in all likelihood, different from each other.