# TOOLS OF PARAMETER ESTIMATION: THE PDF, CDF, AND QUANTILE FUNCTION

#### The most common mathematical use of the PDF is in integration, to solve for probabilities associated with various ranges, just as we did in the previous section. However, we can save ourselves a lot of effort with the cumulative distribution function (CDF), which sums all parts of our distribution, replacing a lot of calculus work.

## Visualizing and Interpreting the CDF
#### The PDF is most useful visually for quickly estimating where the peak of a distribution is, and for getting a rough sense of the width (variance) and shape of a distribution. However, with the PDF it is very difficult to reason about the probability of various ranges visually. The CDF is a much better tool for this. For example, we can use the CDF in Figure 13-4 to visually reason about a much wider range of probabilistic estimates for our problem than we can using the PDF alone. 

## Finding the Median
#### The median is the point in the data at which half the values fall on one side and half on the other—it is the exact middle value of our data. In other words, the probability of a value being greater than the median and the probability of it being less than the median are both 0.5. The median is particularly useful for summarizing the data in cases where it contains extreme values.

## Estimating Confidence Intervals

#### Looking at the probability of ranges of values leads us to a very important concept in probability: the confidence interval. A confidence interval is a lower and upper bound of values, typically centered on the mean, describing a range of high probability, usually 95, 99, or 99.9 percent. When we say something like “The 95 percent confidence interval is from 12 to 20,” what we mean is that there is a 95 percent probability that our true measurement is somewhere between 12 and 20. Confidence intervals provide a good method of describing the range of possibilities when we’re dealing with uncertain information.

#### While computing the inverse of the CDF visually is easy for estimates, we need a separate mathematical function to compute it for exact values. The inverse of the CDF is an incredibly common and useful tool called the quantile function. To compute an exact value for our median and confidence interval, we need to use the quantile function for the beta distribution. Just like the CDF, the quantile function is often very tricky to derive and use mathematically, so instead we rely on software to do the hard work for us.

## CDF VS PDF

#### Key Difference: The PDF describes the probability density at a given value, while the CDF provides the cumulative probability up to that value. In simple terms, the PDF tells you the relative likelihood of the random variable at a precise value, and the CDF tells you the likelihood of the random variable being less than or equal to a certain value.

## Exercise 2:

#### Returning to the task of measuring snowfall from Chapter 10, say you have the following measurements (in inches) of snowfall: 7.8, 9.4, 10.0, 7.9, 9.4, 7.0, 7.0, 7.1, 8.9, 7.4 What is your 99.9 percent confidence interval for the true value of snowfall?

In [1]:
import numpy as np
import scipy.stats as stats

data = np.array([7.8, 9.4, 10.0, 7.9, 9.4, 7.0, 7.0, 7.1, 8.9, 7.4])

mean_snowfall = np.mean(data)
std_snowfall = np.std(data, ddof=1)
n = len(data)

t_critical = stats.t.ppf(0.9995, n-1)

margin_error = t_critical * (std_snowfall /np.sqrt(n))

confidence_interval = (mean_snowfall - margin_error, mean_snowfall + margin_error)


print(f"99.9% Confidence Interval for the true mean snowfall: {confidence_interval}")

99.9% Confidence Interval for the true mean snowfall: (6.474413856518159, 9.905586143481845)


## Using Normal Distribution

In [2]:
lower = stats.norm.ppf(0.0005, loc=mean_snowfall, scale=std_snowfall)
upper = stats.norm.ppf(0.9995, loc=mean_snowfall, scale=std_snowfall)

print(f"99.9% Confidence Interval (Normal): ({lower:.2f}, {upper:.2f})")

99.9% Confidence Interval (Normal): (4.46, 11.92)


##### Two-Tailed Test/Interval: This involves looking for extreme values at both ends of the distribution. For a confidence interval, this means we're concerned with deviations from the mean in both directions—both above and below.
##### In the context of a 99.9% confidence interval for mean snowfall, saying it's "two-tailed" means we're considering the possibility that the true mean could be either significantly higher or significantly lower than our sample mean. Thus, we split our alpha level (the remaining 0.1% probability) between the lower and upper ends of the distribution, placing 0.05% in each tail.

## Exercise 3:

#### A child is going door to door selling candy bars. So far she has visited 30 houses and sold 10 candy bars. She will visit 40 more houses today. What is the 95 percent confidence interval for how many candy bars she will sell the rest of the day?

## Standard Error (SE)
#### Definition: The standard error of a statistic (in this case, the sample proportion) measures the dispersion or variability of that statistic from the true population value. It quantifies the uncertainty inherent in estimating a population parameter from a sample.

In [3]:
from scipy.stats import norm

p = 10/30
n = 30

SE = np.sqrt(p * (1-p) /n )

z = norm.pdf(0.975)

MOE = z * SE

CI_lower = p - MOE
CI_upper = p + MOE

sales_lower = CI_lower * 40
sales_upper = CI_upper * 40

print(f"95% Confidence Interval for candy bars sold in next 40 houses: {sales_lower:.2f} to {sales_upper:.2f}")

95% Confidence Interval for candy bars sold in next 40 houses: 12.48 to 14.19
