# Section 2.2 Introduction to Samples

In [3]:
import numpy as np
import pandas as pd
import pymc as pm
from scipy import stats
import scipy
import arviz as az
import matplotlib.pyplot as plt

You are running the v4 development version of PyMC which currently still lacks key features. You probably want to use the stable v3 instead which you can either install via conda or find on the v3 GitHub branch: https://github.com/pymc-devs/pymc/tree/v3


In [2]:
az.style.use("arviz-darkgrid")
RANDOM_SEED = 8265
np.random.seed(RANDOM_SEED)

# Introduction to Samples
Fundamental to Modern Bayes and Central concept for this course 

# We'll focus on two kinds of samples
* Population Sample
* Distribution samples

# Population Sample
These are samples from a "real world" population. Population samples are the samples you typically read about in statistics textbooks.


![image.png](attachment:image.png)
https://en.wikipedia.org/wiki/Sample_(statistics)

# Population Samples Examples

* Calling 10 people to ask who they'll vote for
* Testing a new drug in a treatment trial on 10 people
* Collecting water from 10 lakes to measure toxins

# (Computational) Distribution Samples
The other related concept is taking random values from a computational distribution. These samples are not something that we've observed or counted. They purely are random draws from a computer.

For example we can define a Bernoulli distribution computationally and take 10000 samples

In [7]:
# Lets draw a 10000 random samples and see what we get
p = 0.3
bern = stats.bernoulli(p)

num_samples = 100000
samples = bern.rvs(num_samples)

samples[:10]

array([1, 0, 1, 0, 0, 0, 0, 0, 0, 0])

From those samples we can calculate the proportion of 1's and zeros

In [None]:
samples.sum() / samples.shape[0]

This is, unsurprisingly, a number very close to 0.3

# Samples from a Normal distribution
We can do the same with Normal distributions. Here's an example of 10

In [3]:
unit_norm = stats.norm(0,1).rvs(10)
unit_norm

array([ 0.16528152,  0.87889885,  0.56182566,  0.58892851, -0.48045392,
        0.36210852,  1.13471608,  0.69627419, -0.74007821, -2.98966773])

# But why do this?
If we already know the distribution we're sampling from why are we taking samples? What's even the point

*We don't know the true distribution, but by taking samples in a clever way we can estimate it*



# Sampling is the foundation of Markov Chain Monte Carlo
MCMC is one of the most popular ways of effectively utilizing Bayes Theorem in practice.
It relies entirely on taking samples in clever ways, and the information from those samples is what enables all inferential magic we're looking for.

Computers made this all possible which is why you're seeing such an explosion of Bayes usage today

If this doesnt make sense don't worry too much right now. It'll be covered in other lessons

# Observed and Computational Sampling works together
In other words, clever use of both *observed* samples, coupled with *computational sampling* is how we make estimates of the world.



# Summary
* There are two kinds of samples we'll be using in this course.
    * Observed data, which is samples from the "real world"
    * Computational samples, which are samples from computationally defined distributions
* Computational sampling is the foundation for Markov Chain Monte Carlo (MCMC)
    * MCMC is the algorithm that makes Bayes so accessible in the modern world
* Using both together is what enables *inference* 
