Skip to content

04 Sampling Distribution

Serena Kim edited this page Feb 22, 2024 · 3 revisions

Import dataset

In this tutorial, we will use North Carolina's house tenure data by census blocks. Alternatively, you can download the file in this directory.

House tenure data:

Calculating the proportion of renters in each block

Let's calculate the proportion of renters in each block.

nc_housing$prop_renter = nc_housing$Renter / nc_housing$Total

Replacing missing values to 0

The rows with 0 households result in a special floating-point value called "NaN," which stands for "Not-a-Number," for the prop_renter column. This is because Division by zero being undefined. In mathematics, you cannot divide any number by zero because there is no meaningful answer. It's not possible to determine what value the result should be when dividing by zero because it represents a mathematical inconsistency.

However, the blocks without any renters should get 0 for the value. Therefore, we should change NaN to 0. To do this, you can use the following command:

nc_housing$prop_renter[is.nan(nc_housing$prop_renter)] <- 0
  • nc_housing$prop_renter accesses the prop_renters column of the my_data dataframe.

  • is.nan(nc_housing$prop_renter) creates a logical vector that is TRUE for elements that are NaN and FALSE otherwise.

  • [is.nan(nc_housing$prop_renter)] uses this logical vector to index the column, effectively replacing NaN values with 0.

Now you can see the NaN values are replaced.

Population distribution

Let's create the distribution of prop_renter, which is our variable of interest. As this variable is continuous variable, we'll create a histogram:

ggplot(nc_housing, aes(x = prop_renter)) +
  geom_histogram(binwidth = 0.04, fill = "pink", color = "black") +
  labs(x = "Proportion of renters in each block",
       y = "Frequency")
  • ggplot(nc_housing, aes(x = prop_renter)): This line initializes a ggplot object using the ggplot() function. It specifies that you want to create a plot using the nc_housing dataframe, and you are mapping the x-axis (horizontal axis) to the variable prop_renter from your dataframe. You're telling ggplot to use the values in the prop_renter column for the x-axis.
  • The geom_histogram() function is used to create histograms. binwidth = 0.04 specifies the width of the bins in the histogram. fill = "pink" sets the fill color for the bars in the histogram to pink. color = "black" sets the border color of the bars to black.
  • labs(): This line is used to set the labels for the x-axis and y-axis of the plot.

It is highly skewed as you can see:

Measuring the skewness of a distribution

We will measure Pearson skewness to quantify the skewness of the distribution. We can use the e1071 pacakge. The Pearson skewness formula is:

$ Pearson Skewness = \frac{3 \cdot (Mean - Median)}{Standard Deviation} $

First, we will need to install the package:

install.packages("e1071")
library(e1071)

And then, you can calculate the skewness of the prop_renter in the population data:

population_rent_skewness <- skewness(nc_housing$prop_renter, type=1)
population_rent_skewness
  • skewness(): This is a function used to calculate the skewness of a dataset.
  • nc_housing$prop_renter: This part of the code specifies the variable for which you want to calculate the skewness. It accesses the prop_renter column within the nc_housing dataframe.
  • type = 1: The type argument is set to 1, indicating that you want to calculate Pearson's skewness coefficient.
  • population_rent_skewness: You can print population_rent_skewness to see the skewness value for the prop_renter variable.

A sampling distribution of sample means

A sampling distribution is a theoretical distribution of a sample statistic (such as the mean, median, variance, or any other summary statistic) calculated from infinite samples drawn randomly from the same population. The Central limit theorem (CLT) suggests that as sample size increases, the sampling distribution of sample means becomes increasingly close to a normal distribution. Statisticians often use a rule of thumb that suggests a sample size of at least n ≥ 30 (or n ≥ 50) for the normal approximation to work well. Of course, we can draw the samples infinitely, so we will limit the number of sampling to 1000 times.

Therefore, first create the num_sampling to be 1000:

num_sampling <- 1000

1 When the sample size is 2 (n=2)

sample_size <-2 
sample_means <- numeric(num_sampling)
  • sample_size: a new variable which has value of 2 (sample size)
  • sample_means: Create a one-dimensional vector to store the sample means
for (i in 1:num_sampling) {
  sample_data <- nc_housing[sample(nrow(nc_housing), sample_size), ]
  sample_means[i] <- mean(sample_data$prop_renter)
}
  • for (i in 1:num_sampling) {: This line starts a for loop that will iterate from 1 to num_sampling (which is 1000), where num_sampling is a variable that represents the number of samples you want to draw and calculate means for.
  • sample_data is newly created by sample(nrow(nc_housing), sample_size), which generates random sample indices from 1 to the number of rows (nrow) in the nc_housing data frame. The number of indices generated is equal to sample_size which is 2.
  • sample_means[i] <- mean(sample_data$prop_renter): sample_means is the one dimensional empty vector generated above. There are 1000 empty values. In each iteration of the loop, a new mean is calculated and stored in the vector. Each mean is the mean of randomly sampled two prop_renter values (sample_data$prop_renter).

Now let's store the vector with 1,000 sample means in a column of a newly generated dataframe called sample_means_df:

sample_means_df <- data.frame(sample_means_n2 = sample_means)
  • As you can imagine, the column name we used here is sample_means_n2. This column takes the value from the vector sample_means.

Now let's create a histogram of this sampling distribution (n=2):

library(ggplot2)
ggplot(sample_means_df, aes (x=sample_means_n2)) +
  geom_histogram(binwidth = 0.04, fill = "blue", color = "black") +
  labs(x = "Sample (n=2) means of the proportion of renters in each block",
       y = "Frequency")
  • You're telling ggplot to use the values in the sample_means_n2 column in the sample_means_df dataframe for the x-axis.

Skewness:

library(e1071)
sample_n2_skewness <- skewness(sample_means_df$sample_means_n2, type=1)
sample_n2_skewness
  • library(e1071): This line loads the e1071 library in R. *sample_n2_skewness <- skewness(sample_means_df$sample_means_n2, type=1): sample_n2_skewness is a new value generated by calculating Pearson Skewness by the using the function skewness. We're telling R to use the sample_means_n2 column in sample_means_df.
  • You are displaying sample_n2_skewness

2 When the sample size is 5 (n=5)

num_sampling <- 1000
sample_size <-5 
  • Notice we are not updating the number of sampling (how many samples will be selected). However, now we are sampling 5 blocks each time when we sample.
sample_means <- numeric(num_sampling)
for (i in 1:num_sampling) {
  sample_data <- nc_housing[sample(nrow(nc_housing), sample_size), ]
  sample_means[i] <- mean(sample_data$prop_renter)
}

Now we will store the sample means into our dataframe that has the sample means from the previous sampling (n=2):

sample_means_df$sample_means_n5 <- sample_means

We can create the histogram of the sample means when the sample size is 5:

library(ggplot2)
ggplot(sample_means_df, aes (x=sample_means_n5)) +
  geom_histogram(binwidth = 0.04, fill = "yellow", color = "black") +
  labs(x = "Sample (n=5) means of the proportion of renters in each block",
       y = "Frequency")

Let's also calculate the Pearson skewness:

library(e1071)
sample_n5_skewness <- skewness(sample_means_df$sample_means_n5, type=1)

3 When the sample size is 10 (n=10)

num_sampling <- 1000
sample_size <-10 
  • Notice we are not updating the number of sampling (how many samples will be selected). However, now we are sampling 10 blocks each time when we sample.
sample_means <- numeric(num_sampling)
for (i in 1:num_sampling) {
  sample_data <- nc_housing[sample(nrow(nc_housing), sample_size), ]
  sample_means[i] <- mean(sample_data$prop_renter)
}

Now we will store the sample means into our dataframe that has the sample means from the previous samplings:

sample_means_df$sample_means_n10 <- sample_means

We can create the histogram of the sample means when the sample size is 10:

library(ggplot2)
ggplot(sample_means_df, aes (x=sample_means_n10)) +
  geom_histogram(binwidth = 0.04, fill = "brown", color = "black") +
  labs(x = "Sample (n=10) means of the proportion of renters in each block",
       y = "Frequency")

Let's also calculate the Pearson skewness:

library(e1071)
sample_n10_skewness <- skewness(sample_means_df$sample_means_n10, type=1)

4 When the sample size is 30 (n=30)

num_sampling <- 1000
sample_size <- 30
  • Notice we are not updating the number of sampling (how many samples will be selected). However, now we are sampling 30 blocks each time when we sample.
sample_means <- numeric(num_sampling)
for (i in 1:num_sampling) {
  sample_data <- nc_housing[sample(nrow(nc_housing), sample_size), ]
  sample_means[i] <- mean(sample_data$prop_renter)
}

Now we will store the sample means into our dataframe that has the sample means from the previous samplings (n=2, n=5, & n=10):

sample_means_df$sample_means_n30 <- sample_means

We can create the histogram of the sample means when the sample size is 30:

library(ggplot2)
ggplot(sample_means_df, aes (x=sample_means_n30)) +
  geom_histogram(binwidth = 0.03, fill = "green", color = "black") +
  labs(x = "Sample (n=30) means of the proportion of renters in each block",
       y = "Frequency")

Let's also calculate the Pearson skewness:

library(e1071)
sample_n30_skewness <- skewness(sample_means_df$sample_means_n30, type=1)
sample_n30_skewness

5 When the sample size is 500 (n=500)

num_sampling <- 1000
sample_size <- 500
  • Notice we are not updating the number of sampling (how many samples will be selected). However, now we are sampling 30 blocks each time when we sample.
sample_means <- numeric(num_sampling)
for (i in 1:num_sampling) {
  sample_data <- nc_housing[sample(nrow(nc_housing), sample_size), ]
  sample_means[i] <- mean(sample_data$prop_renter)
}

Now we will store the sample means into our dataframe that has the sample means from the previous samplings:

sample_means_df$sample_means_n500 <- sample_means

We can create the histogram of the sample means when the sample size is 500:

library(ggplot2)
ggplot(sample_means_df, aes (x=sample_means_n500)) +
  geom_histogram(binwidth = 0.008, fill = "purple", color = "black") +
  labs(x = "Sample (n=500) means of the proportion of renters in each block",
       y = "Frequency")

Let's also calculate the Pearson skewness:

library(e1071)
sample_n500_skewness <- skewness(sample_means_df$sample_means_n500, type=1)
sample_n500_skewness