### Generating Random Variables

In [2]:
from __future__ import print_function, division

import numpy as np

import scipy.stats

import matplotlib.pyplot as pyplot

from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

# seed the random number generator so we all get the same results
np.random.seed(17)

# some nice colors from http://colorbrewer2.org/
COLOR1 = '#7fc97f'
COLOR2 = '#beaed4'
COLOR3 = '#fdc086'
COLOR4 = '#ffff99'
COLOR5 = '#386cb0'

%matplotlib inline

In [8]:
mu1, sig1 = 178, 7.7
male_height_dist = scipy.stats.norm(mu1, sig1)

In [9]:
mu2, sig2 = 163, 7.3
female_height_dist = scipy.stats.norm(mu2, sig2)

In [12]:
male_height_dist.mean()

178.0

In [13]:
male_height_dist.std()

7.7

### CDF

CDF stands for a cumulative distribution function.  It tells us the likelihood that a value is this amount or less.  

For example, let's look at it with our normal distribution.

In [52]:
import scipy.stats as stats

In [53]:
std_normal = stats.norm(0, 1)

> Above we initalized a standard normal distribution with a mean of 0 and standard deviation of 1.

This means that our distribution is centered around the mean, and on average, a datapoint is a value of 1 away from this mean.  Now let's see what the cdf is of a couple values.

In [54]:
norm.cdf(0)

0.5

Zero is our midpoint.  So this is saying that the likelihood of a value lower than 0 is .5.  What a value lower that -1.

In [55]:
norm.cdf(-1)

0.15865525393145707

The likelihood of drawing a number lower than -1, is only `.158`.  And what about the likelihood of a number 1 or lower?

In [56]:
norm.cdf(1)

0.8413447460685429

Well, with a standard normal distribution this happens with a likelihood of .84.

### Using Percentiles

Now let's imagine that we want get a sense of different percentiles.  

* For example, what number is higher than 99 percent of numbers drawn from our distribution?

In [57]:
norm.ppf(.99)

2.3263478740408408

That number is 2.32.

What number is greater than `.8413` of the numbers drawn from our population?

Well, we already saw this number above.

In [60]:
norm.ppf(.8413)

0.9998150936147446

That number is 1.

So the percent point function is the *inverse* of the cdf.

In [38]:
x = np.linspace(norm.ppf(0.01), norm.ppf(0.99), 100)

From here, we can use the pdf and cdf methods to indicate the likelihood of given have occurring. 

In [49]:
# norm.pdf

In [45]:
sum(x*probs)

4.0245584642661925e-16

In [36]:
# norm.pdf(np.linspace(norm.ppf(0.01), norm.ppf(0.99), 100))

In [20]:
norm = scipy.stats.norm()

In [29]:
nums = np.linspace(-1, 1, 21)
sum(norm.pdf(nums))

7.064831455238545

The following function evaluates the normal (Gaussian) probability density function (PDF) within 4 standard deviations of the mean.  It takes and rv object and returns a pair of NumPy arrays.

In [9]:
def eval_pdf(rv, num=4):
    mean, std = rv.mean(), rv.std()
    xs = np.linspace(mean - num*std, mean + num*std, 100)
    ys = rv.pdf(xs)
    return xs, ys

In [None]:
np.linspace()

Here's what the two distributions look like.

In [None]:
xs, ys = eval_pdf(male_height)
pyplot.plot(xs, ys, label='male', linewidth=4, color=COLOR2)

xs, ys = eval_pdf(female_height)
pyplot.plot(xs, ys, label='female', linewidth=4, color=COLOR3)
pyplot.xlabel('height (cm)')
None

Let's assume for now that those are the true distributions for the population.

I'll use `rvs` to generate random samples from the population distributions.  Note that these are totally random, totally representative samples, with no measurement error!

In [None]:
male_sample = male_height.rvs(1000)

In [None]:
female_sample = female_height.rvs(1000)

[Normal random variable](https://www.tutorialspoint.com/scipy/scipy_stats.htm)