In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from ipywidgets import interact, FloatSlider, IntSlider
%matplotlib inline

## Calculations involving Binomial Coefficients

In this section, you will learn how to perform calculations for a binomially-distributed variable.

You can import the special module from the scipy library, which contains a function for calculating binomial coefficients.

In [None]:
from scipy import special

For example, to find **3 choose 1**:

In [None]:
special.binom(3,1)

Your turn. How many ways are there to choose 5 successes out of 20 trials?

In [None]:
# Your code here

We saw in the slides that the probability of exactly one coin landing on heads out of a sequence of 3 flips of a bent coin was calculated as 

<img src="images/binom_normal/01.png" width="400">

Let's see how to perform this calculation in Python. Recall that multiplication in done using `*`. For this calculation, we also need to exponentiate, which is done using `**`.

In [None]:
special.binom(3,1) * (0.7)**1 * (0.3)**2

Now that you have seen what goes into this calculation, let's look at some examples involving more trials.

**Example:** For a sequence of ten flips of a fair coin (probability of heads = 0.5 = probability of tails), what is the probability of exactly six flips landing on heads?

In [None]:
# Your code here

**Example:** For a sequence of ten flips of a fair coin (probability of heads = 0.5 = probability of tails), what is the probability of exactly one coin landing on heads?

In [None]:
# Your code here

**Example:** For a sequence of ten flips of a bent coin, where the probability of heads is 0.7, what is the probability of exactly 1 coin landing on heads?

In [None]:
# Your code here

# Working with Probability Distributions

Most common probability distributions are contained in the `scipy stats` module.

## The Binomial Distribution

For example, if you want to work with the binomial distribution, you can use:

In [None]:
from scipy.stats import binom

For calculating probabilties of a specific number of successes, you can use the `pmf` function. Note that pmf stands for "probability mass function".

To find the probability of exactly 7 successes, you need to specify the following arguments:
* k: desired number of successes
* n: total number of trials
* p: probability of success

<img style="float: left;" src="images/binom_normal/02.png" width="600">

In [None]:
binom.pmf(k = 7, n = 10, p = 0.5)

**Example:** For ten flips of the **bent** coin, what is the probability of exactly 7 heads?

In [None]:
# Your code here.

<img style="float: left;" src="images/binom_normal/03.png" width="600">

In [None]:
binom.pmf(k = 6, n = 10, p = 0.5) + binom.pmf(k = 7, n = 10, p = 0.5)

**Example:** For ten flips of the **bent** coin, what is the probability of either 6 or 7 heads?

In [None]:
# Your code here.

<img style="float: left;" src="images/binom_normal/04.png" width="600">

If you want to find the probability of $x$ _or fewer_ successes you can - rather than using the probability mass function - use the **cumulative distribution fuction**, or **cdf**. 

For a random variable, the cdf is defined as

$$F(x) := P(X \leq x) = \text{the probability that value of the random variable is } x \text{ or less}$$

<img style="float: left;" src="images/binom_normal/05.png" width="600">

<img style="float: left;" src="images/binom_normal/06.png" width="600">

You can compute this value using the `.cdf` function:

In [None]:
binom.cdf(k = 7, n = 10, p = 0.5)

**Exercise:** For ten flips of a **bent** coin, what is the probability of 7 or fewer heads?

In [None]:
# Your code here

<img style="float: left;" src="images/binom_normal/07.png" width="600">

<img style="float: left;" src="images/binom_normal/08.png" width="800">

In [None]:
binom.cdf(k = 6, n = 10, p = 0.5) - binom.cdf(k = 3, n = 10, p = 0.5)

**Exercise:** For 10 flips of the **bent** coin, what is the probability of between 4 and 6 heads, inclusive? 

In [None]:
#Your code here

<img style="float: left;" src="images/binom_normal/09.png" width="600">

<img style="float: left;" src="images/binom_normal/10.png" width="800">

In [None]:
1 - binom.cdf(k = 4, n = 10, p = 0.5)

**Exercise:** For the **bent** coin, what is the probability of 5 or more heads?

In [None]:
# Your code here.

## The Normal Distribution

If you are going to work with normal distributions, import `norm` from `scipy.stats`.

In [None]:
from scipy.stats import norm

The following interactive widget demonstrates how the two parameters, $\mu$ and $\sigma$ affect the shape and location of a normal distribution.

In [None]:
@interact(mu = FloatSlider(value = 0, min = -3, max = 3, step = 0.1),
         sigma = FloatSlider(value = 1, min = 0.1, max = 3, step = 0.1))
def normal_pdf(mu, sigma):
    x = np.arange(start = -4, stop = 4, step = 0.01)
    plt.plot(x, norm.pdf(x, loc = mu, scale = sigma), color = 'black')
    plt.fill_between(x, norm.pdf(x, loc = mu, scale = sigma))
    plt.hlines(y = 0, xmin = -4, xmax = 4, color = 'black')
    plt.ylabel('Density')
    plt.title('Normal Distribution\n $\mu$ = {}, $\sigma$= {}'.format(mu, sigma))
    plt.xlim(-4, 4);

For calculating probabilities with the normal distribution, you will usually need to use its cdf.

**Example:** For a random variable which is normally distributed with a mean of 100 and standard deviation of 10, what is the probability that the variable is less than 80?

Recall that the cdf tells the probability that the random variable is $x$ or less.

When using the `cdf` or `pdf` for a normal distribution, you need to specify the value(s) of $x$ along with
* loc: the mean
* scale: the standard deviation

In [None]:
norm.cdf(x = 80, loc = 100, scale = 10)

<img style="float: left;" src="images/binom_normal/11.png" width="600">

**Example:** For a random variable which is normally distributed with a mean of 100 and standard deviation of 10, what is the probability that the variable is more than 85 but less than 115?

To answer this, you need to do the trick with subtracting two areas. Remember that the cdf only tells the probability of a particular value or less.

In [None]:
norm.cdf(x = 115, loc = 100, scale = 10) - norm.cdf(x = 85, loc = 100, scale = 10)

<img style="float: left;" src="images/binom_normal/12.png" width="600">

**Example:** For a random variable which is normally distributed with a mean of 100 and standard deviation of 10, what is the probability that the variable is more than 90?

Again, you will need to do the subtraction trick.

In [None]:
1 - norm.cdf(x = 90, loc = 100, scale = 10)

<img style="float: left;" src="images/binom_normal/13.png" width="600">

### Using the Normal Distribution to Estimate Probabilities

The dataset NHANES_heights_weights.csv contains a sample of participants in the National Health and Nutrition Examination Survey. Specifically, it contains the heights and weights of all male participants between the ages of 30 and 40.

In [None]:
nhanes = pd.read_csv('../data/NHANES_heights_weights.csv')

In [None]:
nhanes.head()

You can get a quick glimpse at the characteristics of the dataset using the `.describe()` method.

In [None]:
nhanes.height_cm.describe()

To get a better idea of the distribution of values, we can look at a distplot.

In [None]:
sns.distplot(nhanes.height_cm);

You can see that the data is roughly bell-shaped. There are some statistical tests which can be used to check whether a sample appears to have come from a normal distribution. 

Another option is to use what's called a **quantile-quantile plot**, or **Q-Q plot**. This type of plot can be used to assess whether it is plausible that a set of observations came from a particular distribution.

Specifically, a Q-Q plot is a scatterplot which shows the theoretical quantiles from the candidate distribution against the observed quantiles from the sample. If the plot is close to the identity plot (the diagonal line), then we can conclude that it is plausible (but not certain) that the sample came from that distribution. 

When looking at a Q-Q plot to evalate whether it is plausible to estimate a distribution using a normal distribution, the quantiles for a normal distribution with the same mean and standard deviation as the dataset are calculated.

You will use the following function to create our Q-Q plots. You just need to pass in the column of interest.

In [None]:
from nssstats.plots import qq_plot

In [None]:
qq_plot(nhanes.height_cm)

You can see that, with just a few exceptions, the sample data hugs the diagonal line. You are probably safe to model the overall distribution using a normal distribution.

You can approximate the population distribution using a normal distribution with the same mean and standard deviation as the sample.

In [None]:
mu = np.mean(nhanes.height_cm)
sigma = np.std(nhanes.height_cm)

print('mu = {}'.format(mu))
print('sigma = {}'.format(sigma))

Let's take a look at the hypothetical normal distribution against the sample data.

In [None]:
x = np.arange(start = -4 * sigma + mu, stop = 4 * sigma + mu, step = 0.01)
plt.plot(x, norm.pdf(x, loc = mu, scale = sigma), color = 'black')
nhanes.height_cm.hist(density = True);

You can see that it's not a perfect fit, but is reasonably close.

Using this distribution, you can make predictions about the overall population.

What proportion of 30 - 40 year-old males will be under 5 feet tall (152.4 cm)? 

**Hint:** We calculated the mean and standard deviation of the distribution above and saved them as mu and sigma, so you can pass those variables in as arguments.

In [None]:
#Your code here

What proportion of 30 - 40 year-old males will be over 6 feet tall (182.88 cm)?

In [None]:
#Your code here

What proportion of 30 - 40 year-old males will be over 7 feet tall (213.36 cm)?

In [None]:
#Your code here

What about weights?

In [None]:
nhanes.weight_kg.hist();

It appears that weights are skewed to the right.

In [None]:
qq_plot(nhanes.weight_kg)

You can also see this in the Q-Q plot. The values in both the upper and lower quantiles are way larger than would be expected from a normal distribution.

Distributions with a large tail on the right can sometimes be approximated with a normal distribution after transforming the values. A common transformation is the logarithm.

In [None]:
nhanes.weight_kg.apply(np.log).hist();

In [None]:
qq_plot(nhanes.weight_kg.apply(np.log))

It is still somewhat skewed to the right, but you are probably okay to make some estimates.

In [None]:
mu = np.mean(nhanes.weight_kg.apply(np.log))
sigma = np.std(nhanes.weight_kg.apply(np.log))

print('mu = {}'.format(mu))
print('sigma = {}'.format(sigma))

In [None]:
x = np.arange(start = -4 * sigma + mu, stop = 4 * sigma + mu, step = 0.01)
plt.plot(x, norm.pdf(x, loc = mu, scale = sigma), color = 'black')
nhanes.weight_kg.apply(np.log).hist(density = True);

What proportion of males between the age of 30 and 40 do you expect to weight less than 100 lbs (45.3592 kg)?

**Hint:** You may need to use the `np.log()` function as part of this calculation.

In [None]:
# Your code here

What proportion do you expect to weight more than 300 lbs (136.078 kg)?

In [None]:
# Your code here

## Connection Between Binomial and Normal Distributions

Even though the binomial distribution is a discrete probability distribution and the normal distribution is a continuous probability distribution, the binomial distribution can be well-approximated by the normal, for sufficiently large values of $n$.

This fact will be useful when making inferences about population proportions in the next couple of weeks.

In [None]:
from nssstats.demos import binom_normal_plot
from ipywidgets import interact, IntSlider, FloatSlider

This widget shows how the normal distribution can be used to approximate the binomial distribution. This approximation is better as the number of trials, $n$ increases.

Notice that even if the binomial distribution is not symmetric (i.e., $p \neq 0.5$), the approximation is very close for large values of $n$.

In [None]:
interact(binom_normal_plot,
        n = IntSlider(value = 5, min = 1, max = 100, continuous_update = False),
        p = FloatSlider(value = 0.5, min = 0, max = 1, step = 0.01, continuous_update = False));