# Random Sampling and Sample Bias
A **sample** is a subset of data from a larger data set; statisticians call this larger
data set the population. A population in statistics is not the same thing as in
biology — it is a large, defined but sometimes theoretical or imaginary, set of
data.

### KEY TERMS FOR RANDOM SAMPLING
**Sample**:
A subset from a larger data set.

**Population**:
The larger data set or idea of a data set.

**N (n)**:
The size of the population (sample).

**Random sampling**:
Drawing elements into a sample at random.

**Stratified sampling**:
Dividing the population into strata and randomly sampling from each strata.

**Simple random sample**:
The sample that results from random sampling without stratifying the population.

**Sample bias**:
A sample that misrepresents the population.

# Selection Bias
Selection bias refers to the practice of selectively choosing data — consciously or
unconsciously — in a way that that leads to a conclusion that is misleading or
ephemeral.

### KEY TERMS
**Bias**:
Systematic error.

**Data snooping**:
Extensive hunting through data in search of something interesting.

**Vast search effect**:
Bias or nonreproducibility resulting from repeated data modeling, or modeling data with large

# Sampling Distribution of a Statistic
The **term sampling** distribution of a statistic refers to the distribution of some
sample statistic, over many samples drawn from the same population. Much of
classical statistics is concerned with making inferences from (small) samples to
(very large) populations.

### KEY TERMS
**Sample statistic**:
A metric calculated for a sample of data drawn from a larger population.

**Data distribution**:
The frequency distribution of individual values in a data set.

**Sampling distribution**:
The frequency distribution of a sample statistic over many samples or resamples.

**Central limit theorem**:
The tendency of the sampling distribution to take on a normalshape as sample size rises.

**Standard error**:
The variability (standard deviation) of a sample statistic over many samples  (not to be confused
with standard deviation, which, by itself, refers to variability of individual data values).


## KEY IDEA
Specifying a hypothesis, then collecting data following randomization and random sampling
principles, ensures against bias.

All other forms of data analysis run the risk of bias resulting from the data collection/analysis
process (repeated running of models in data mining, data snooping in research, and after-the-fact
selection of interesting events).

# Sampling Distribution of a Statistic
### KEY TERMS
**Sample statistic**:
A metric calculated for a sample of data drawn from a larger population.

**Data distribution**:
The frequency distribution of individual values in a data set.

**Sampling distribution**:
The frequency distribution of a sample statistic over many samples or resamples.

**Central limit theorem**:
The tendency of the sampling distribution to take on a normalshape as sample size rises.

**Standard error**:
The variability (standard deviation) of a sample statistic over many samples (not to be confused
with standard deviation, which, by itself, refers to variability of individual data values).

### Standard Error
The standard error is a single metric that sums up the variability in the sampling
distribution for a statistic. The standard error can be estimated using a statistic
based on the standard deviation s of the sample values, and the sample size n:

$ Standard Error = SE = \frac{s}{\sqrt(n)} $

In fact, you don’t need to rely on the central limit
theorem to understand standard error. Consider the following approach to measure
standard error:
1. Collect a number of brand new samples from the population.
2. For each new sample, calculate the statistic (e.g., mean).
3. Calculate the standard deviation of the statistics computed in step 2; use
this as your estimate of standard error.

# The Bootstrap
### KEY TERMS
**Bootstrap sample**:
A sample taken with replacement from an observed data set.

**Resampling**:
The process of taking repeated samples from observed data; includes both bootstrap and
permutation (shuffling) procedures.

One easy and effective way to estimate the sampling distribution of a statistic, or
of model parameters, is to draw additional samples, with replacement, from the
sample itself and recalculate the statistic or model for each resample. This
procedure is called the bootstrap, and it does not necessarily involve any
assumptions about the data or the sample statistic being normally distributed.

Conceptually, you can imagine the bootstrap as replicating the original sample
thousands or millions of times so that you have a hypothetical population that
embodies all the knowledge from your original sample (it’s just larger). You can
then draw samples from this hypothetical population for the purpose of estimating
a sampling distribution

In practice, it is not necessary to actually replicate the sample a huge number of
times. We simply replace each observation after each draw; that is, we sample
with replacement. In this way we effectively create an infinite population in
which the probability of an element being drawn remains unchanged from draw to
draw. The algorithm for a bootstrap resampling of the mean is as follows, for a
sample of size n:
1. Draw a sample value, record, replace it.
2. Repeat n times.
3. Record the mean of the n resampled values.
4. Repeat steps 1–3 R times.
5. Use the R results to:
a. Calculate their standard deviation (this estimates sample mean
standard error).
b. Produce a histogram or boxplot.
c. Find a confidence interval

In [22]:
from scipy.stats import bootstrap
import numpy as np
import random

#define array of data values
data = [7, 9, 10, 10, 12, 14, 15, 16, 16, 17, 19, 20, 21, 21, 23]

#convert array to sequence
data = (data,)

#calculate 95% bootstrapped confidence interval for median
bootstrap_ci = bootstrap(data, np.median, confidence_level=0.95,
                         random_state=1, method='percentile')

#view 95% boostrapped confidence interval
print(bootstrap_ci.confidence_interval)

ConfidenceInterval(low=10.0, high=20.0)


**For example, we can change np.median to np.std within the bootstrap() function to instead calculate a 95% confidence interval for the standard deviation:**

In [2]:
#calculate 95% bootstrapped confidence interval for median
bootstrap_ci = bootstrap(data, np.std, confidence_level=0.95,
                         random_state=1, method='percentile')

#view 95% boostrapped confidence interval
print(bootstrap_ci.confidence_interval)

ConfidenceInterval(low=3.3199732261303283, high=5.66478399066117)


**Or use bootstrap like this**

In [3]:
x = np.random.normal(loc = 15, size=100)
#The loc parameter controls the mean of the output data. Default = 0.
#The scale parameter controls the standard deviation of the normal distribution. Default = 1.
print(np.mean(x))

sample_mean = []
for i in range(50):
    y = random.sample(x.tolist(), 70)
    avg = np.mean(y)
    sample_mean.append(avg)

print(np.mean(sample_mean))

14.977760970939983
14.9760282677799


### KEY IDEAS
The bootstrap (sampling with replacement from a data set) is a powerful tool for assessing the
variability of a sample statistic.

The bootstrap can be applied in similar fashion in a wide variety of circumstances, without
extensive study of mathematical approximations to sampling distributions.

It also allows us to estimate sampling distributions for statistics where no mathematical
approximation has been developed.

When applied to predictive models, aggregating multiple bootstrap sample predictions (bagging)
outperforms the use of a single model.

# Confidence Intervals
Frequency tables, histograms, boxplots, and standard errors are all ways to
understand the potential error in a sample estimate. Confidence intervals are
another
### KEY TERMS
**Confidence level**:
The percentage of confidence intervals, constructed in the same way from the same population,
expected to contain the statistic of interest.

**Interval endpoints**:
The top and bottom of the confidence interval.

Given a sample of size n, and a sample statistic of interest, the algorithm for a
bootstrap confidence interval is as follows:
1. Draw a random sample of size n with replacement from the data (a
resample).
2. Record the statistic of interest for the resample.
3. Repeat steps 1–2 many (R) times.
4. For an x% confidence interval, trim [(1 – [x/100]) / 2]% of the R
resample results from either end of the distribution.
5. The trim points are the endpoints of an x% bootstrap confidence interval.

**Ex1:** If x% is 95%, the trim will be [(1 - [95/100]) / 2)] = 0.025 => The trim is [0.025 - 97.5].

**Ex2:** The sample size is 25, the sample mean is 4.5 and the standard deviation is 2.5. The company calculates the confidence interval assuming a 97% confidence level:

Confidence interval = 4.5 ± 0.97(2.5 ÷ √25) = 4.5 ± 0.97(2.5 ÷ 5) = 4.5 ± 0.97(0.5) = 4.5 ± 0.485 = 4.985, 4.015

The percentage associated with the confidence interval is termed the level of
confidence. The higher the level of confidence, the wider the interval. Also, the
smaller the sample, the wider the interval (i.e., the more uncertainty). Both make
sense: the more confident you want to be, and the less data you have, the wider
you must make the confidence interval to be sufficiently assured of capturing the
true value.

### KEY IDEAS
Confidence intervals are the typical way to present estimates as an interval range.

The more data you have, the less variable a sample estimate will be.

The lower the level of confidence you can tolerate, the narrower the confidence interval will be.

The bootstrap is an effective way to construct confidence intervals.

# Normal Distribution
The bell-shaped normal distribution is iconic in traditional statistics.The fact
that distributions of sample statistics are often normally shaped has made it a
powerful tool in the development of mathematical formulas that approximate those
distributions.

### KEY TERMS
**Error**:
The difference between a data point and a predicted or average value.

**Standardize**:
Subtract the mean and divide by the standard deviation.

**z-score**:
The result of standardizing an individual data point.

**Standard normal**:
A normal distribution with mean = 0 and standard deviation = 1.

**QQ-Plot**:
A plot to visualize how close a sample distribution is to a normal distribution

**In a normal distribution (Figure 2-10), 68% of the data lies within one standard
deviation of the mean, and 95% lies within two standard deviations.**


![image.png](attachment:image.png)

# Standard Normal and QQ-Plots
A standard normal distribution is one in which the units on the x-axis are
expressed in terms of standard deviations away from the mean. 

To compare data to a standard normal distribution, you subtract the mean then divide by the
standard deviation; this is also called normalization or standardization

Note that “standardization” in this
sense is unrelated to database record standardization (conversion to a common
format). The transformed value is termed a z-score, and the normal distribution is
sometimes called the z-distribution.

![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

## QQ-plot
A QQ-Plot is used to visually determine how close a sample is to the normal
distribution. The QQ-Plot orders the z-scores from low to high, and plots each
value’s z-score on the y-axis; the x-axis is the corresponding quantile of a normal
distribution for that value’s rank. Since the data is normalized, the units
correspond to the number of standard deviations away of the data from the mean.
If the points roughly fall on the diagonal line, then the sample distribution can be
considered close to normal. Figure 2-11 shows a QQ-Plot for a sample of 100
values randomly generated from a normal distribution; as expected, the points
closely follow the line.

![image.png](attachment:image.png)

### KEY IDEAS
The normal distribution was essential to the historical development of statistics, as it permitted
mathematical approximation of uncertainty and variability.

While raw data is typically not normally distributed, errors often are, as are averages and totals in
large samples.

To convert data to z-scores, you subtract the mean of the data and divide by the standard deviation;
you can then compare the data to a normal distribution.

# Long-Tailed Distributions
Despite the importance of the normal distribution historically in statistics, and in
contrast to what the name would suggest, data is generally not normally
distributed.

### KEY TERMS FOR LONG-TAIL DISTRIBUTION
**Tail**
The long narrow portion of a frequency distribution, where relatively extreme values occur at low
frequency.

**Skew**
Where one tail of a distribution is longer than the other.

![image.png](attachment:image.png)

### KEY IDEAS FOR LONG-TAIL DISTRIBUTION
Most data is not normally distributed.

Assuming a normal distribution can lead to underestimation of extreme events (“black swans”).

# Student’s t-Distribution
The t-distribution is a normally shaped distribution, but a bit thicker and longer
on the tails. It is used extensively in depicting distributions of sample statistics.
Distributions of sample means are typically shaped like a t-distribution, and there
is a family of t-distributions that differ depending on how large the sample is. The
larger the sample, the more normally shaped the t-distribution becomes.
