# Full Stack Citizen Science, Education Stack, Part 1: Reduced Form Modeling with Joint Distributions

Part 1 of the Education Stack will lay foundations for modeling the world via joint statistical distributions. Don't worry if you don't know what that means yet -- the purpose of this part is to teach you. 

We'll start with distributions of a single variables and build intution about what a distribution *is*, and what it means to build a distribution directly. 

Then we will focus on *functions of distributions,* before learning about *distributions of functions of distributions*.

This will set us up to learn about what a hypothesis test actually is, and why counterfactual reasoning is a core element of the scientitic process. 

Finally we will learn how to read regression tables, which is simply another set of hypothesis tests. 


# Table of Contents

 So far only the bolded items below have been completed. Each section after the Introduction will have a homework assigment. 

Concepts:

1. **Introduction**
2. **Distributions of One Variable**
    1. **The Probability Mass Function**
    2. **Expected Value / Mean, and Average**
        - **Homework 1**
    4. The Cummulative Density Function (CDF and ECDF); the Probability Function
    5. Median and quantiles
    6. Variance and interquartile range (and higher level shapes)
    7. Concepts:
        - location of the "mass" of the distribution (mean, median, mode)
        - dispersion of the distribution (var/stdev, interquartile range)
        - skewness and kurtosis (we'll return to these)
    8. The basic shapes of distributions: two main, and multi-modal
        - single-peak centered
        - single-peak, skewed
        - multi-modal
3. Functions of distributions
    1. mean and variances
    2. differences in means
4. Distributions of functions of distributions 
    1. The scientific process, and how that is actualized in statistics.
        - Hypotheses
        - Formulating a counterfactual
        - The null hypothesis, and probability of the null hypothesis given the data.
        - Bayeisan reasoning: generalizing the null hypothesis
    2. Hypothesis test: difference of means
    3. Constructing the distribution of the difference of means
        - Direct construction of the distribution when you control the data generating process
        - Bootstrapping when you do not
        - Analytical statistics and asymptotics
5. Distributions of Two Variables
    1. Conditional and marginal distributions
    2. Covariance and correlation
    3. Dot-plot as 2d histogram
    4. Distributions of N variables
6. Analyzing hypothesis tests for more than 2 variables
    1. Ceteris paribus reasoning and marginal *functions*
    2. Manually constructing the marginal (the point estimate)
    3. Manually constructing the variance of the marginal (the variance)
7. Reading Regression Tables
    1. Nearest neighbor regression: conditional distribution with N dimensions
    2. Linear regression: conditional distribution with N dimensions
    3. Linear regression: the partial depedence function
    4. Linear regression: bootstrapping the partial dependence function
    5. Reading a regression table
8. Further topics: causal inference, time series, higher levels of modeling

# 1. Introduction

Distributions are the core of statistics.

There are some values that are not random -- think days of the week, or inches in a foot, or age you were a year ago. Those values are known with certainty -- they are called "deterministic."

There are other values that are *random* -- you don't know them for certain until you see them, but you still know something about how likely different possibilities are. For example, how much rain is likely to fall tomorrow (probably somewhere in the range of 0-5 inches, probably *not* 1,000+ inches).

If you have a *random variable X*, then that variable is described by an object called a *distribution*. There are two common ways to represents a distribution, which we will explore futher in later sections:

- the probability *density* or *mass* function, abbreviated PDF or PMF, usually expressed as $f_X(x)$
- the *cummulative density* function, abbreviated CDF, usually expressed as $F_X(x)$.

Don't worry if you are unfamiliar with these, we will define them in more detail later. All you need to take away for the moment is that these are are two alteratnive ways of looking at the same probability distribution.  

Approximately all of statistics can be characterized as:

1. Distributions ($f_X(x)$ or $F_X(x)$)
2. Functions of Distributions
3. Distributions of Functions of Distributions

Abusing notation slightly and using $g()$ for "functions," we can write this compactly as:

1. $F(X)$
2. $g(F(X))$
3. $F(g(F(X)))$

We will return to this framing to build up intuition of what statistics is as a discipline, and how the final step is used in scientific reasoning. 

### Discretization

To accelerate learning and intuition about distributions, we are going to use a trick called "discretization."

Many distributions are *continous* (discussed more in Section 1), and you need calculus (derivatives and integrals) to deal with them.

*Discrete* distributions, on the other hand, only require addition, subtraction, multiplication, and division to work with them. Thus nearly everyone can learn statistics using discrete distributions. 

*Discretization* is the process of turning a continuous distribution into a discrete distribution. Generally speaking, a variable can be discretized by chopping its range of values into a finite set of "buckets," and then acting as though everything that falls into the same bucket can be represented the same way. We will work through an example of this in the next section.

Discretization reduces some of the accuracy of the true distribution, but comes with at least two benefits: first, as noted, it makes intution and learning much easier. Second, it makes some highly complicated problems solvable, that otherwise may be intractable. Some of the most advanced modeling techniques rely on dicretization to capture complicated real-world problems; more on that later. 

We will only be directly districretizing the *probability density* version of a distribution, $f_X(x)$, and we will do this in a way that allows us to always track to moments of the true underlying continuous distribution. (For those with more statistical background: we discretize the PDF using samples of data, such that the analytical mean, variace, etc of the distribution exactly match the sample mean, variance, etc. This is possible by employing the laws of total expectation, total variance, total covariance, laws of propogation of errors, and other similar 'laws of moments' of the distribution. The end result is a discrete discituion that is close to the observed data) 

The cummulative density version of a distribution, $F_X(x)$, has a simple representation that is easy to understand and work with, that does not require explicit discretization.

Finally, once you learn the discretized version of distributions, it is relatively straightforward to learn some of the continuous distribution functions, and eventually moving on to calculus or measure-based probability if desired. 

### Background reading

Optional resources if you want to read more:

- Think Statistics: this free book discusses distributions and takes the discretization approach. Read [Chapter Two](https://greenteapress.com/thinkstats2/html/thinkstats2003.html) for more background.
- Wikipedia has many articles on distributions, histograms, continuous vs discrete variables, and probability mass functions, but they can be a bit to advanced for learning from scratch. Feel free to read them, but don't get scared away if they don't make sense! (in fact lots of learning stats and then more science is learning how to read things you don't yet understand, not getting scared away, learning, then reading again...so maybe these wikipedia articles are good practice for that!)



# 2. Distributions of One Variable

Let's start with an example: our random variable $X$ will be *human height*. Human height is *continuous* -- imagine that you had a perfect measuring stick, and it could give you as many decimal places of precision as you wanted. You measure someone and decide you want to have 15 decimal places of precision. Your measuring stick tells you that the person is 5.8240312119261876655316423270011 feet tall (or approximately 5'10" tall). 

What are the odds that you can find another person *exactly* as tall as that? Pretty low. Now imagine that you asked your measuring stick for 100 decimal places of precision, or 1,000. For values that are *continuous*, you can just keep asking for more and more precision, and you will always get more. 

That's not true of a discrete variable. For example, consider the number of days in a month. There are only a few options: 28 days, 29 days, 30 days, or 31 days. The month you choose (and whether it is a leap year) will determine the number of days in that month. For example, if we choose May then the number of days is 31. It doesn't matter if you measure the number of days in May *very precisely*. Even if you ask for 15 decimal places, it will still be 31 days in May 


To turn a continous variable (like height) into a discrete variable (like days in a month), you can create a discrete set of "buckets" to contain the continuous values. An obvious choice of "buckets" for for height is 1-foot intervals, like so:

- 0-1 feet
- 1-2 feet
- 2-3 feet
- 3-4 feet
- 4-5 feet
- 5-6 feet
- 6-7 feet
- 7+  feet

Using our example earlier, the person with a height of 5.8240312119261876655316423270011 feet tall (or approximately 5'10") would go into the 5-6 feet bucket. 

Let's take a sample of height values, and put them into buckets, to generate our discrete distribution of heights.

## 2.A Probability Mass Function (PMF)

Recall that our random variable $X$ is height. Say that we have a *sample* of 40 people's heights. We can write down the sample as $\vec{x}$. The little arrow over the top of $\vec{x}$ means this is a *vector* or a set of values. 

In general we write a vector as a set of values indexed like so:

$$
\vec{x} = [x_1, x_2, ...,x_n, ..., x_{I}]
$$

The vector has length indicated by $I$, and we use $i$ as a placeholder for an arbitrary element anywhere in the vector. 

Let's say we have a sample of 40 observations of people's heights, in feet. Here is the sample we will use for our illustrations:

$$
\begin{aligned}
\vec{x} = [&2.91, 3.09, 5.19, 5.76, 5.20, 4.93, 4.96, 5.60, 5.67, 5.86, \\
           &4.85, 5.60, 5.22, 5.71, 5.16, 5.27, 5.05, 5.07, 5.58, 3.65, \\
           &4.78, 5.07, 5.31, 6.17, 5.68, 5.50, 5.90, 5.33, 4.68, 4.85 \\
           &4.70, 4.51, 5.58, 3.82, 6.02, 4.46, 5.85, 6.16, 5.08, 3.51]
\end{aligned}
$$

The first element of this vector is $x_1 = 2.91$ft, the $10^{th}$ element is $x_{10} = 5.86$ft, and the capital-I $I^{th}$ (last) element is $x_I = 3.51$ft. 

Because no value is less than 2 or greater than 7, our buckets will be:

- 2-3ft
- 3-4ft
- 4-5ft
- 5-6ft
- 6-7ft

To make our discrete distribution, we will count the number of each of these values that fall into each of the bins defined above. 

We could imagine "dropping each observation" into the bins randomly like so:

![Height in Buckets](./buckets_of_height_count.png)

In this illustration, the positions inside each bucket don't matter -- we've just randomly dropped each one in. The number of observations in each bucket is the thing that matters, and we've written it above each bucket.

To turn this into a distribution, we need to do two things:
1. Turn the counts into a probability weight, and
2. Decide on an "x-value" to represent each bucket.

Turning counts into probability weights is simple: just divide the count in each bucket by the total number of observations. 

Here's the same image, except we've divided each count by the total number of observations:

![Height in Buckets](./buckets_of_height_fraction.png)

The formal version of these buckets is the *histogram,* which represents, in 2D, the discrete probability distribution for these values. Each "bucket" will be represented by a rectangle whose area (height $\times$ width) is equal to the fraction of values in the bucket. 

The area of each bar, in blue, is typically called the "probability mass" of the distribution -- the area of each rectangle is literally how much probability is represented by each area.

If you were to calculate the area of each blue rectangle and add them all up, you would see that the total amount of probability mass in this histogram is 1. 

Here is a histogram where each bar is outlined:

![Height in Buckets](./histogram.png)

Histograms can also be represented as a line function, like so: 

![Height in Buckets](./histogram_step.png)

The key element is that if you were to calculate the area under this line function, you would find that it adds up to 1. All probability functions, discrete or continous, have this property: the area under the curve (step function in this case) sums up to 1. 

A note on units: probabilities are typically talked about as a fraction: 0.5 means 1/2, which means 50%. Percentages are a very common way to talk about the likelihood of something, as in, "there is a 50% chance of rain tomorrow." Percent literally means "per one hundred," and to convert from probabilities to percentages, you multiply the probability by 100 (and then make sure to indicate that it is in percent units by ading the % symbol, eg. 50% not just 50). We will use both probabilities and percentages; the important thing to be aware of is that $probability \times 100 = percentage \;%$.

We still need to assign an x-value to represent each bucket. There are at least two ways to do this. We can simply choose the midpoint of the bins, or we can choose the average of all observations that fall into each bin. We will use bin average for reasons that will become clear in the first homework. 

We will represent the bin averages as black dots in the histogram below, and draw a line from the dot to the x-axis to make it easy to see the x-axis value. 

![Histogram and PMF Combo](./histogram_with_mid_mean_and_straight_line.png)

The black dots and lines are the *probability mass function* or *PMF* -- the height of each dot corresponds to the area of the bars of the histogram. Because the width of each bar = 1 in out example, the PMF matches the height of the bars in the histogram.

If we keep gathering data and keep shrinking the width of the bars, we will eventually converge back to the continous version of this function, the so-called probability density function. More on that to follow!

In a previous section we noted that distributions are often abbreviated as lower-case $f_X(x)$ or upper-case $F_X(x)$. The PMF (and its continuous version) are typically referred to by the lower-case  $f_X(x)$.

<!-- If we define the function $Pr(x)$ as "the probability that $x$ equals some value," then for our discretization we would define: 

$$
\begin{aligned}
f_X(x) = Pr(x \in B_x),
\end{aligned}
$$

where $x \in B_x$ is read as "the probability that $x$ falls within the bin it belongs to $B_x$, instead of any other bin.
 -->

The histogram and PMF serve different purposes, and we will use both:

- The histogram is easiest to use for *visual assessment of a distribution.* You can learn a lot about a distribution by just looking at it, and the histogram can be great for that (with caveats).
- The PMF is easier to use when you are doing math; we will see this later.

One final note: typically the histogram and PMF are not drawn on the same plot. Instead they may look like: 

![Histogram and PMF Side-by-Side](./histogram_and_PMF.png)

**Some Properties of the Example**

Let's talk about some *properties of this distribution.* 

The most common x-value in this distribution is called its *mode*. Visually this appears as the "peak" of a distribution, or the tallest bin. Here the modal bin is the 5-6ft bin, and the modal value ('the mode') is 5.4, the x-value associated with this bin. We can see this directly by finding the tallest point in the PMF and finding it's x-value.

The distribution is single-peaked -- there is only one obvious peak / mode. The distribution isn't perfectly symmetric -- there is more *probability mass* (aka the blue space in the histogram) to the left of the mode than to the right of the mode. 

We can figure out how many people fall between, say, 4-6ft by just adding up the probabilities for those bins from the PMF -- it turns out about 80% (0.225 + 0.572) of people in this sample fall between 4-6ft.

There are only a few people in the biggest and smallest bins, often called the "tails" of the distribution -- about 2.5% and 7.5% respectively. We will talk more about that when we talk about shapes of distributions.  

To talk about more properties of the distribution, we need to learn a few more concepts.

## 2.B Expected Value / Mean / Average

Let's talk about one of the most powerful and common functions of a distribution: the "expected value," $E[X]$. This is often also called the "mean." It is also called the "first moment," which we will discuss later. The expected value is the foundation of many other descriptions of a distribution.  

The expected value is the *weighted average* of all values in a distribution, where the weights are the probabilities associated with each value. 

We'll write that down explicitly below. First we need to define some math terms. 

As we mentioned earlier, we will use the histogram when we want to look at a distribution for inutition, and we will use the PMF when we talk about math. We're talking about math now, so will mostly use the PMF.

**Some Math**

Our discrete distribution is defined by a few things: 

- The set of cuttoffs $B$ that define our bins. These were listed in a bullet list above, but we can write them formally as $B = [(b_0, b_1], (b_1, b_2], ..., (b_{N-1}, b_{N})]$, such that there are $N$ bins, and each one individually is labelled as $B_n=(b_{n-1}, b_n]$. We will go into detail about this notation in an appendix. This is the formal way to write our bullet list from earlier of 
- The probabilities that are asociated with each bin, which should sum to 1: $\vec{p}$
- The value that represents each bin, which we will denote with $\vec{a}$


The little arrow over the top of $\vec{p}$ and $\vec{a}$ simple means that variable is a *collection* of values where the order matters -- the first value is conceptually different from the second value, etc. For a PMF, the order matters because the first probability in $\vec{p}$ corresponds to the first value in $\vec{a}$, etc. This kind of variable is called a "vector."

We write out $\vec{p}$ and $\vec{a}$ as:

$$
\begin{aligned}
\vec{p} &= [p_1, p_2, ...,p_n, ...,  p_{N}]\\
\vec{a} &= [a_1, a_2, ...,a_n, ..., a_{N}]
\end{aligned}
$$

Here we will use $N$ as the maximum number of elements in the vector. For our example of height, $N=8$.

To make this more clear, the first element of our PMF vectors are: 
$$
\begin{aligned}
p_{1} &= 0.025 \\
a_{1} &= 2.9
\end{aligned}
$$
That is the left-most dot in the PMF plot, and means that the value $a=2.9$, or almost 3ft, and has the probability of occurring of $p=0.025,$ or 2.5%.

Here is our full PMF:
$$
\begin{aligned}
\vec{a} &= [2.91,  3.09,  3.66,  4.46,  4.78, 5.20, 5.71,  6.12]\\
\vec{p} &= [0.025, 0.025, 0.075, 0.025, 0.20, 0.30, 0.275, 0.075]
\end{aligned}
$$

If we plot a dot for each of the $a_{n},\;p_{n}$ values in order, we get our **PMF plot** above. 

If we plot a box for each bin $B_n$, where the area of the box (width $\times$ height) is equal to the probability associated with each bin, then we get our **histogram plot** above. 

**Calculating the Expected Value E[X]**

The expectation of the random variable X, written $E[X]$, is the weighted average of all the x-values that define a distribution, weighted by their probabilities.

Mathematically this is written: 

$$
E[X] = \sum\limits_{n=1}^{N} p_{n} \times a_{n}
$$

The $\Sigma$ is the capital greek letter "sigma" which is used in math notation to indicate "add up everything that follows, when iterating over the set of values from $n=1$ (below sigma) to $n=N$ (above the sigma)." Sigma is close to our letter "S" for "summation."

For any specific problem, you replace the $N$ with whatever is the actual number.

For our example above, $N=8$ and the summation would look like:

$$
\begin{aligned}
E[X] = &\; 2.91\times 0.025 + 3.52\times 0.01 + 4.75\times 0.225 + \\
       &\; 5.45\times 0.575 + 6.12\times 0.075 \\
       \\
     \approx &\; 5.082
\end{aligned}
$$

(The symbol $\approx$ means "approximately equal;" we use it here because we rounded the final answer.)

The expected value is our first basic example of a *function of a distribution*.

A *function of a distribution* takes all the values that define a distribution, the $\vec{a}$ and $\vec{p}$, and applies some mathematical function to these values to get a single value. Here that is the weighted sum. 

A function of a distribution tells you something about the entire disttribution. The expected value tells you something about where the "location" of the distribution. (There are other functions that tell you something about the location of the distribution as well, which we will discuss.)

One useful property of $E[X]$ is that we can apply a function to X, such as X^2, and the expected value is calculated the same way, now applying the function to X throughout the summation. 

In our example, 

$$
\begin{aligned}
E[X^2] = &\; 2.91^2\times 0.025 + 3.52^2\times 0.01 + 4.75^2\times 0.225 + \\
         &\; 5.45^2\times 0.575 + 6.12^2\times 0.075 \\
         \\
       \approx  &\; 26.42 
\end{aligned}
$$

Note that this is not quite equal to just squaring $E[X]$ from above: $E[X]^2 \approx 5.082^2 \approx 25.83 \not = 26.42$. This is not just due to the rounding, but rather is due to the properties of the underlying math. We will learn these rules for $E[X]$ in an appendix later. Either way, you can always just directly calcuate the expectation of a function of X as we just did. When in doubt, calculate directly! (You can also calculate directly as a gut check, after you've learned the underlying math rules.) 

The expected value is one way to measure the "location" of the "center" of a distribution, but it is not the only way. Other measures include mode, as described earlier, and median, which we will discuss next when we discuss the empirical cummulative density function.

## The Sample Average

Let's go back to our random sample of heights, $\vec{x}$ with length $I$. (We use $i$ and $I$ to indicate that these are *individual* values or observations.)

When you are given a random sample of values, you can directly calculate what is called the "sample average." 

The *average* of the sample of values is denoted $\bar{x}$ (that's a little "bar" over the top, not an arrow), and defined mathematically as:

$$
\begin{aligned}
\bar{x} = \frac{1}{I}  \sum\limits_{i=1}^{I} x_i
\end{aligned}
$$

In words: you add up all the numbers and then divide the resulting total sum by the numbers of observations. Alternatively, this is a weighted sum where the weight is equal to $1/N$ for all values.

Recall our sample of 40 values, copied from above: 

$$
\begin{aligned}
\vec{x} = [&2.91, 3.09, 5.19, 5.76, 5.20, 4.93, 4.96, 5.60, 5.67, 5.86, \\
           &4.85, 5.60, 5.22, 5.71, 5.16, 5.27, 5.05, 5.07, 5.58, 3.65, \\
           &4.78, 5.07, 5.31, 6.17, 5.68, 5.50, 5.90, 5.33, 4.68, 4.85 \\
           &4.70, 4.51, 5.58, 3.82, 6.02, 4.46, 5.85, 6.16, 5.08, 3.51]
\end{aligned}
$$

If we add these all up we get $203.29$, and if we divide that by 40, we get $203.29/40 = 5.08225$.

This is very close to the expected value $E[X]=5.08234$ that we calculated for our distribution. 

This is not an accident -- the average $\bar{x}$ is an *estimator* of the theoretical expected value $E[X]$ of a distribution. An *estimator* is mathematical expression that is often simpler to calculate than a theoretical value, and with enough data, the estiamtor will converge to the theoretical value if the right conditions are in place. We will talk more about estimators in the future. 

In addition, earlier, when we represented each bin in our histogram by the *average within a bin,* we were taking advantage of something called the *[Law of Total Expectation](https://en.wikipedia.org/wiki/Law_of_total_expectation),* which we will talk more about when we get to conditional expectations in a later chapter. Using the average of a bin to represent a bin exploits this property to make moments of our discrete distribution extremely close to the moments of samples use to create the discrete distibution. (More on all this later.)


### 2.B Homework: Histograms and Expected Value

You have been given a dataset called "health_data.csv." This is from the public CDC NHANES dataset (https://wwwn.cdc.gov/Nchs/Nhanes/) and contains the following variables: ParticipantID (anonymized id per person), Gender (1=Male, 2=Female), Age (years), Weight (lbs), Height (feet), BMI (kg/m^2)


1. Your job is to create a histogram of each of the following:
    1. height
    2. weight, and
    3. age

For each variable, choose bin cutoffs and then construct the histogram and PMF. Draw the resulting histograms and label them. The problem will be scored based on how well the histogram has been constructed (once you make the choice about bin cutoffs, there will be a correct answer, and the score will be based on that).

Make sure to find the point-probabilities for each bin and also write down the PMF for your distribution. You can insert the dots intot he histogram if there is room, of write it side-by-side with the histrogram as we did above. 

For all height and age, use the average of the values within the bin for your dots/PMF. For weight, use the midpoint of the bins as your dots/PMF. 

2. Describe your histograms in words. Are they single-peaked? What is the mode for each? What is the probability in the smallset and largest tail?

3. Calculate the expected value $E[X]$ for height, weight, and age, using your PMF functions. Then calculate the *sample average* $\bar{x}$ using just the sample of the data itself (not using your histogram). How does the $E[X]$ compare to the $\bar{x}$ for each variable?

4. Finally, calculate $E[ (703 × Weight) / (12 × Height)^2 ]$ using the approach decribed at the end of the "Calculating the Expected Value $E[X]$" section. This is the expected value of BMI in the dataset. Compare this to the *sample average* $\bar{BMI}$, directly calculated on the BMI variable in the dataset. What are these two values? Provide the two values you calculate and describe any difference that you see. 

5. Looking at this data, what stands out to you? (This is just meant to be a "free form" description, say whatever you like here.)


## 2.C Cummulative Density Function (CDF and ECDF); the Probability Function

So far we have described the *probability density function* and its discretized version, the *probability mass function* (PMF). This is often expressed as $f_X(x)$. The PMF gives us the probability that a random value from the distribution falls into the bucket that $x$ falls into. 

Another major way to describe a distribution is the following: when given some value $x$, produce the probability that a random value from the distribution is *less than or equal to* $x$. This is called the *cummulative density function* (CDF), and is often written as $F_X(x)$. 

We don't need to explicitly discretize this function to use it easily, because when we construct this function from data, it discretizes itself automatically. The version of $F_X(x)$ constructed from data is called the "empirical density function," or ECDF. We will discuss how this is constructed in the next section. 

One final important note: this is a good time to emphasize that both the PMF $f_X(x)$ and CDF $F_X(x)$ do something similar: when you give them an $x$, they *tell you something about the distribution.* 

Specifically, they tell you something about the probability of drawing a random variable that relates to $x$ in a particular way. For $f_X(x)$ this is the probability that a random draw from the entire distribution will come from the same bin that $x$ falls into. For $F_X(x)$ this is the probability that a random draw is less than or equal to $x$. (You might notice that you can use $F_X(x)$ to create $f_X(x)$ -- you would be correct! We will demonstrate that in the homework for this section.) 

This can be confusing the first few times you encounter it. It seems like the functions should tell you something about $x$ -- and they do, in the sense that they are telling you something about the *distribution of x*.

#### Constructing the CDF / ECDF

Let's use the same dataset as before to construct a density function from this data. 

Recall our sample of 40 values, copied from above: 

$$
\begin{aligned}
\vec{x} = [&2.91, 3.09, 5.19, 5.76, 5.20, 4.93, 4.96, 5.60, 5.67, 5.86, \\
           &4.85, 5.60, 5.22, 5.71, 5.16, 5.27, 5.05, 5.07, 5.58, 3.65, \\
           &4.78, 5.07, 5.31, 6.17, 5.68, 5.50, 5.90, 5.33, 4.68, 4.85 \\
           &4.70, 4.51, 5.58, 3.82, 6.02, 4.46, 5.85, 6.16, 5.08, 3.51]
\end{aligned}
$$

We can draw the empirical density function by sorting the data, and then for each datapoint, counting how many datapoints are less than or equal to that datapoint. If we plot those points only, we get something that looks like this: 

![Points only](./ECDF_points_only.png)

Notice that the right-most point is all the way up at 1 -- for this value, when counting itself, the fraction of observations that are <= this value is all of them, or 1. 

...
