<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Statistical-Distributions" data-toc-modified-id="Statistical-Distributions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Statistical Distributions</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Why-is-this-distinction-so-important?" data-toc-modified-id="Why-is-this-distinction-so-important?-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Why is this distinction so important?</a></span></li><li><span><a href="#The-difference-between-PMF,-PDF-and-CDF" data-toc-modified-id="The-difference-between-PMF,-PDF-and-CDF-1.0.2"><span class="toc-item-num">1.0.2&nbsp;&nbsp;</span>The difference between PMF, PDF and CDF</a></span></li><li><span><a href="#So-how-do-we-calculate-the-AUC?" data-toc-modified-id="So-how-do-we-calculate-the-AUC?-1.0.3"><span class="toc-item-num">1.0.3&nbsp;&nbsp;</span>So how do we calculate the AUC?</a></span></li><li><span><a href="#Z--Score" data-toc-modified-id="Z--Score-1.0.4"><span class="toc-item-num">1.0.4&nbsp;&nbsp;</span>Z- Score</a></span></li></ul></li></ul></li></ul></div>

# Statistical Distributions
- Binomial 
- PMF
- PDF 
- CDF 
- normal/standard normal 

A **random variable**, is actually a function. It assigns numerical values to the outcomes of a random process. 

A fundamental distinction among kinds of distributions is the distinction between discrete and continuous distributions. 
A **discrete distribution** (or variable) takes on countable values, like integers, every outcome has a positive probability.
A **continuous distribution** takes on a continuum of values, like real numbers. It assigns probabilities to ranges of values. 

![](https://miro.medium.com/max/1022/1*7DwXV_h_t7_-TkLAImKBaQ.png)

[More Explanation](https://mathbitsnotebook.com/Algebra1/FunctionGraphs/FNGContinuousDiscrete.html)

### Why is this distinction so important? 

In [None]:
import pandas as pd 
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt 
%matplotlib inline 

In [None]:
sb_data = {'drink_orders' : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'freq' : [4, 20, 13, 6, 4, 2, 0, 0, 0, 0, 1],}

In [None]:
df = pd.DataFrame(sb_data)
df

In [None]:
df.freq.sum()

In [None]:
df['r_freq'] = df['freq'].divide(50)
df
    

In [None]:
#discrete distribution - limited number of outcomes 
#distribution table is very helpful 
plt.bar(df['orders'], df['freq'])

A continuous distribution like the wait times of the 50 Starbucks customers would not be helpful. They would vary too much the distribution would be all over the place. Since we have an endless number of possibilities for the continuous random variables we need to consider alternate ways of calculating/visualizing those types of probabilities.

![](https://raw.githubusercontent.com/learn-co-students/dsc-probability-density-function-onl01-dtsc-ft-030220/master/images/pdf2.jpg)

### The difference between PMF, PDF and CDF 

![alt text](https://bokeh.pydata.org/en/0.8.2/_images/charts_histogram_cdf.png)

A probability mass function (PMF)— also called a frequency function— gives you probabilities for discrete random variables. “Random variables” are variables from experiments like dice rolls, choosing a number out of a hat, or getting a high score on a test. The “discrete” part means that there’s a set number of outcomes. For example, you can only roll a 1,2,3,4,5, or 6 on a die.

Its counterpart is the probability density function, which gives probabilities for continuous random variables.

You can use the CDF to figure out probabilities above a certain value, below a certain value, or between two values. For example, if you had a CDF that showed weights of cats, you can use it to figure out:

- The probability of a cat weighing more than 11 pounds.
- The probability of a cat weighing less than 11 pounds.
- The probability of a cat weighing between 11 and 15 pounds.

In the case of the above scenario, it would be important for, say, a veterinary pharmaceutical company knowing the probability of cats weighing a certain amount in order to produce the right volume of medications that cater to certain weights.

In [None]:
x = np.linspace(-5, 5, 5000)
mu = 0
sigma = 1

y_pdf = stats.norm.pdf(x, mu, sigma) # the normal pdf
y_cdf = stats.norm.cdf(x, mu, sigma) # the normal cdf

fig, ax = plt.subplots(1, 1, figsize=(6, 6))
ax.plot(x, y_pdf, 'r', label='pdf', linewidth=5)
ax.plot(x, y_cdf, 'k', label='cdf', linewidth=5)
ax.legend(loc='best');

### So how do we calculate the AUC? 
We should learn a bit more more about the bell shaped distribution to answer this. 

[Normal Distribution Learn.co](https://learn.co/tracks/module-2-data-science-career-2-1/statistics-ab-testing-and-linear-regression/section-12-statistical-distributions/the-normal-distribution)

<img src = "https://github.com/learn-co-students/dsc-0-09-12-gaussian-distributions-online-ds-ft-031119/blob/master/formula.jpg?raw=true" width=300>

<img src='https://github.com/learn-co-students/dsc-0-09-12-gaussian-distributions-online-ds-ft-031119/blob/master/normalsd.jpg?raw=true' width=700/>

[Practice](https://www.intmath.com/counting-probability/normal-distribution-graph-interactive.php)

![alt text](https://trello-attachments.s3.amazonaws.com/5c9820e82b57e23871ddad9a/5c982e562847357b452cccd7/4bb7f068f92283d8ce096d7b4cabbfce/skew1.jpeg)

### Z- Score 
**The area under the whole of a normal distribution curve is 1, or 100 percent. The  𝑧 -table helps by telling us what percentage is under the curve up to any particular point. The z-table works from the idea that a score found on the table shows the probability of a random variable falling to the left of the score (some tables also show the area to some z-score to the right of the mean). The normal distribution, the basis of z-scores, is a cumulative distribution function:**
$$z = \frac{x-\mu}{\sigma}$$ Which is just the number of std deviations from mean.