# Statistical Distributions 
Today, we're going to focus on how we can describe probability distributions. This will give us a tool set to make inferences about samples and populations.

>> Aside - We will be using a Mathematical Model as a Proxy to the Population's Distribution

>If we know what a population distribution should look, we can mathematically define a probability model that would fill-in for our population.

>This is analogous to something like using a perfect circle to approximate a car's tire. Is it exactly the same? No, there are some deviations from the circle but it's close enough for most applications.

>There are many parametric probability distributions which can be described mathematically and can be very convenient for us depending on the problem but here we will focus on the most useful distributions.

Earlier we looked at descriptive statistics: starting with a dataset and making various observations (overall shape, histogram, outliers, etc.) as well as calculations of quantities that can characterize the dataset as a whole (mean, median, mode, variance, standard deviation, quartiles, percentiles, etc.).

To make the move into inferential statistics, we need to imagine now that we don't have (or anyway cannot measure) all the data of interest.

And this is, of course, the typical situation. Consider:

- A zoologist wanting to know the typical lifespan of a Siberian tiger
- A cosmologist wanting to know the mass of a normal white dwarf star
- A businesswoman wanting to know how many M&M's her customers should expect to find in their Party Size bags
- A botanist wanting to know how tall California redwoods usually grow

The zoologist could, in principle:

Keep track of every currently existing Siberian tiger; record their (more or less) exact ages at their moments of death; add up those ages and divide by the number of tigers to calculate an average lifespan ––But only in principle. In all of these situations, there is no realistic or practical opportunity to check each relevant data point.

![](https://pictures-of-cats.org/wp-content/uploads/2012/09/bengal-tiger-size-weight-1.jpg)

What we can do, however, is to check some of the data points we want to check. That is, we'll draw a sample of data from our population of interest. We can then use the techniques of descriptive statistics to characterize our sample.

Does this help? The hope, of course, is that our sample will be representative of the population as a whole, which would justify our using facts about the sample to infer things about the population as a whole. But naturally we'll expect a certain amount of error: If I take the mean of a sample,  𝑥¯  and project it as an estimate of the mean of the whole population,  𝜇 , the estimate is bound to be imperfect.


## Parameters vs. Statistics 

We define the **population** as the whole group we're interested in. We abstract this to the **population** being the whole set of possible outcomes.
When we say we have randomly **sampled** over the population, we call this subset of sampled individuals/outcomes/items from the population a **sample**.

> The sample statistic is calculated from the sample data and the population parameter is inferred (or estimated) from this sample statistic. Let me say that again: Statistics are calculated, parameters are estimated. 

**Know the differences - Population v Sample Terminology**

Characteristics of populations are called *parameters*<br/>
Characteristics of a sample are called *statistics*

![](https://media.cheggcdn.com/media/7ac/7ac1a812-3b41-4873-8413-b6a7b8fab530/CL-26481V_image_006.png)


## Discrete VS Continuous
A fundamental distinction among kinds of distributions is the distinction between discrete and continuous distributions. 
A **discrete distribution** (or variable) takes on countable values, like integers, every outcome has a positive probability.
A **continuous distribution** takes on a continuum of values, like real numbers. It assigns probabilities to ranges of values. 

![](https://miro.medium.com/max/1022/1*7DwXV_h_t7_-TkLAImKBaQ.png)


In [None]:
import pandas as pd 
import numpy as np
from scipy import stats 
import matplotlib.pyplot as plt
%matplotlib inline 

import seaborn as sns
sns.set_style('darkgrid')

In [None]:
sb_data = {'drink_orders' : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 
          'freq' : [4, 20, 13, 6, 4, 2, 0, 0, 0, 0, 1]}

In [None]:
df = pd.DataFrame(sb_data)
df

In [None]:
df.freq.sum()

In [None]:
df['r_freq'] = df['freq'].divide(50)
df

In [None]:
plt.bar(df['drink_orders'], df['freq']);

**What if we had wait times of each customer? Would relative frequency be helpful?**<br/>
No, with discrete distributions, you get specific values. Continuous data would give you a range of values. 

## PMF: Probability Mass Function
The probability mass function (pmf) for a random variable gives, at any value $k$, the probability that a random variable takes the value $k$. Used for discrete data and takes on a finite set of values.

**Suppose, for example, that I have a jar full of lottery balls containing:**

In [None]:
# For each number, we calculate the probability that pull it from the jar by dividing

numbers = range(1, 5)
counts = [50, 25, 15, 10]

# calculate the probs by dividing each count by the total number of balls.

probs = [count/sum(counts) for count in counts]

lotto_dict = {number: prob for number, prob in zip(numbers, probs)}
lotto_dict

In [None]:
# Plot here!

x = list(lotto_dict.keys())
y = list(lotto_dict.values())

fig, ax = plt.subplots(1, 1, figsize=(6, 6))
ax.plot(x, y, 'bo', ms=8, label='lotto pmf')
ax.vlines(x, 0, y, 'r', lw=5)
ax.legend(loc='best');

In [None]:
print("Probability of drawing a 1 or a 2:", sum(probs[:2]))


## PDF: Probability Density Function
> Probability density functions are similar to PMFs, in that they describe the probability of a result within a range of values. But where PMFs are appropriate for discrete variables and so can be descibed with barplots, PDFs are smooth curves that describe continuous random variables. They are helpful for identifying regions in a distribution where observations are more likely to occur. 

![](https://raw.githubusercontent.com/learn-co-students/dsc-probability-density-function-onl01-dtsc-ft-030220/master/images/pdf2.jpg)

## CDF: Cumulative Distribution Function
The cumulative distribution function describes the probability that your result will be of a value equal to or below a certain value. It can apply to both discrete or continuous functions.

For the lotto ball scenario, the CDF would describe the probability of drawing a ball equal to or below a certain number.

In order to create the CDF from a sample, we:

- align the values from least to greatest
- for each value, count the number of values that are less than or equal to the current value
- divide that count by the total number of values

The CDF of the Lotto example plots how likely we are to get a ball less than or equal to a given example.

Let's create the CDF for our Lotto example

In [None]:
# align the values

lotto_dict = {0:0, 1:50, 2:25, 3:15, 4:10}
values = list(lotto_dict.keys())

# count the number of values that are less than
# or equal to the current value

count_less_than_equal = np.cumsum(list(lotto_dict.values()))

# divide by total number of values
prob_less_than_or_equal = count_less_than_equal/sum(lotto_dict.values())

In [None]:
fig, ax = plt.subplots()
ax.plot(values, prob_less_than_or_equal, 'bo', ms=8, label='lotto pdf')
for i in range(0, 5):
    ax.hlines(prob_less_than_or_equal[i], i,i+1, 'r', lw=5,)
for i in range(0, 4):
    ax.vlines(i+1, prob_less_than_or_equal[i+1],
              prob_less_than_or_equal[i], linestyles='dotted')
ax.legend(loc='best' )
ax.set_ylim(0);

### Recap - PMF, PDF, CDF
A probability mass function (PMF)— also called a frequency function— gives you probabilities for discrete random variables. “Random variables” are variables from experiments like dice rolls, choosing a number out of a hat, or getting a high score on a test. The “discrete” part means that there’s a set number of outcomes. For example, you can only roll a 1,2,3,4,5, or 6 on a die.

Its counterpart is the probability density function, which gives probabilities for continuous random variables. There are too many variables so each variables probabilty is zero or almost zero. So we use intervals in this case.

You can use the CDF to figure out probabilities above a certain value, below a certain value, or between two values. For example, if you had a CDF that showed weights of cats, you can use it to figure out:

The probability of a cat weighing more than 11 pounds.
The probability of a cat weighing less than 11 pounds.
The probability of a cat weighing between 11 and 15 pounds.
In the case of the above scenario, it would be important for, say, a veterinary pharmaceutical company knowing the probability of cats weighing a certain amount in order to produce the right volume of medications that cater to certain weights.

### Now let's work with a dataset 

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/learn-co-curriculum/dsc-lp-STATISTICS-pmf-pdf-cdf/master/baby_weights.csv')
df.head()

**Let's take a look at the percentiles of this data.** <br/>
**Question: What is the 90th percentile? What does it mean?**

In [None]:
for i in range(10):
    print('{}th percentile: {}'.format(i*10, df.weight.quantile(q=i/10.0)))

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(df.weight,
             hist_kws=dict(cumulative=True),
             kde_kws=dict(cumulative=True)
            )
for i in range(10):
    print('{}th percentile: {}'.format(i*10, df.weight.quantile(q=i/10.0)))
    plt.scatter(df.weight.quantile(q=i/10.0), i/10.0, c='red')
plt.title('CDF of Weights')
plt.xlabel('Weight')
plt.ylabel('Cumulative Density')

**Q: Looking at the CDF, approximately what is the median of this dataset? The first quartile? The 75th percentile?**

In [None]:
#Precise calculations....
print(df.weight.quantile(q=.25))
print(df.weight.quantile(q=.5))
print(df.weight.quantile(q=.75))

### PDF - Probability Density Function

Probability density functions serve to outline the underlying theoretical distribution of continuous variables. A PDF shows the relative likelihood of a given observation. For example, in our current example, you might wonder, what is the probability that an individual has a weight of 120 pounds? When working with PDFs, the answer to this would be 0. You really only can calculate probabilities within an interval for a continuous variable. While a lot of individuals may have a weight of approximately 120 pounds, the probability of having a weight of exactly 120 pounds is zero. You must define an interval, however small such as 119.999 to 120.001, in order to have a positive non-zero probability.

In [None]:
sns.distplot(df.weight)

While PDFs are again, the underlying distribution behind a dataset, it can be helpful to see the comparison to histograms and the frequency of observations for a specific dataset to gain a deeper understanding:

In [None]:
df.weight.hist(bins=25);

An appropriate PDF to characterize our dataset could be the normal distribution.
## The Normal Distribution & Standard Normal Distribution
### Normal distribution
What are the parameters that characterize the normal distribution?
A: The normal distribution is characterized by two parameters, $\mu$ and $\sigma$, which correspond to the mean of the distribution and the standard deviation of the distribution, respectively.

The mean, $\mu$, is a measure of the central tendency of the distribution
The standard deviation $\sigma$ measures the spread of the data about this mean.

#### The empirical rule
A: The empirical rule states that 68% of the data in a normal distribution is to be found 1 standard deviation away from the mean, 95% of the data is found within 2 standard deviations from the mean, and 99.7% of the data can be found within 3 standard deviations from the mean. The empirical rule is also known as the 68-95-99.7 rule for this reason.

Use numpy to create a normal distribution containing 3000 values with mean $\mu = 20$ and standard deviation $\sigma = 0.5$

In [None]:
import numpy as np

mu, sigma = 20, 0.5
n = 3000

s = np.random.normal(mu, sigma, n)

In [None]:
n, bins, _ = plt.hist(s, bins=20)


## Standard normal distributions
Compare and contrast the normal distribution and the standard normal distribution. What is the empirical rule for the standard normal distribution?

A: The standard normal distribution is a special case of the normal distribution. It is a normal distribution with a mean of 0 and standard deviation of 1.

The empirical rule for the standard normal distribution is as follows:

68% of the area under the standard normal distribution lies between -1 and 1
95% of the area under the standard normal distribution lies between -2 and 2
99.7% of the area under the standard normal distribution lies between -3 and 3
How do you standardize a normal distribution?

A: To standardize normally distributed data you first subtract the mean of the data from each point and then divide this difference by the data's standard deviation.

In [None]:
standard_s = (s - np.mean(s))/np.std(s)
sns.distplot(standard_s, kde=True);

## Standard score (z-score)
Why is the standard score a useful statistic?

>The z-score tells us how many standard deviations above or below the mean an observation is. Calculating the z-score allows us to understand how extreme a certain result is.

>The z-score allows us to compute the probability of a score occurring in a normal distribution and it allows us to compare scores from different normal distributions.

![norm_to_z](images/norm_to_z.png)

### Back to our baby weights dataset 
Q:What is the probability that an individual has a weight between 115 and 125 pounds?

In [None]:
import scipy.stats as stats

In [None]:
lower_z = (115 - df.weight.mean())/df.weight.std() #z-score for 115 pounds
lower_z

In [None]:
stats.norm.cdf(lower_z)

There is ~40% chance that an individual will have a weight below 115 pounds according to our paramters.

In [None]:
upper_z = (125 - df.weight.mean())/df.weight.std() #z-score for 125 pounds
upper_z

In [None]:
stats.norm.cdf(upper_z)

There is ~61.7% chance that an individual will have a weight under 125 pounds.

In [None]:
#Putting it all together; probability of having weight between 115 and 125 pounds
stats.norm.cdf(upper_z) - stats.norm.cdf(lower_z)

What is the observed probability of having a weight between 115 and 125 pounds according to the dataset?

In [None]:
#Answer
total_obs = len(df)
relevant_obs = len(df[(df.weight >= 115) & (df.weight <=125)])
relevant_obs / total_obs

### Confidence Intervals for Normally Distributed Data
Because sample statistics are imperfect representations of the true population values, it is often appropriate to state these estimates with confidence intervals.
#### Key Ideas 
- Sample statistics are supplemented with confidence intervals to approximate the population
- Generally believe sample statistic is in the neighborhood of true population's statistic
- The larger the sample, the less likely we got all the "weirdo" data points from the population
- We trade certainty with precision by expanding our interval
- Taking multiple samples (experiments) allows us more examples of where the true population statistic lies

#### Interpreting Confidence Intervals 
**Wrong:**

There is a 95% probability that the mean age is between 26.3 and 28.3

**Correct:**

If we find 100 (random) samples and create confidence intervals, we expect 95 intervals would contain the true mean of population age.

We are confident in this interval because we expect that a true population mean outside of this interval would produce these results 5% or less of the time. In other words, only an unlikely (but not impossible) sampling event could have caused us to calculate this interval, if the true mean is outside of this interval.

The true population mean is a specific value and we do not know what it is. The confidence level you choose is a question of how often you are willing to find an interval that does not include the true population mean, but it doesn't tell you whether this particular sample + interval calculation gave you the "right" answer.